Git Repo - linux.git/log

zram: propagate error to user

When we initialized zcomp with single, we couldn't change
max_comp_streams without zram reset but current interface doesn't show
any error to user and even it changes max_comp_streams's value without
any effect so it would make user very confusing.

This patch prevents max_comp_streams's change when zcomp was initialized
as single zcomp and emit the error to user(ex, echo).

[[email protected]: don't return with the lock held, per Sergey]
[[email protected]: fix coccinelle warnings]
Signed-off-by: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Jerome Marchand <[email protected]>
Acked-by: Sergey Senozhatsky <[email protected]>
Signed-off-by: Fengguang Wu <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: return error-valued pointer from zcomp_create()

Instead of returning just NULL, return ERR_PTR from zcomp_create() if
compressing backend creation has failed. ERR_PTR(-EINVAL) for unsupported
compression algorithm request, ERR_PTR(-ENOMEM) for allocation (zcomp or
compression stream) error.

Perform IS_ERR() check of returned from zcomp_create() value in
disksize_store() and set return code to PTR_ERR().

Change suggested by Jerome Marchand.

[[email protected]: clean up error recovery flow]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Reported-by: Jerome Marchand <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: move comp allocation out of init_lock

While fixing lockdep spew of ->init_lock reported by Sasha Levin [1],
Minchan Kim noted [2] that it's better to move compression backend
allocation (using GPF_KERNEL) out of the ->init_lock lock, same way as
with zram_meta_alloc(), in order to prevent the same lockdep spew.

[1] https://lkml.org/lkml/2014/2/27/337
[2] https://lkml.org/lkml/2014/3/3/32

Signed-off-by: Sergey Senozhatsky <[email protected]>
Reported-by: Minchan Kim <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Sasha Levin <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: add lz4 algorithm backend

Introduce LZ4 compression backend and make it available for selection.
LZ4 support is optional and requires user to set ZRAM_LZ4_COMPRESS config
option.  The default compression backend is LZO.

TEST

(x86_64, core i5, 2 cores + 2 hyperthreading, zram disk size 1G,
ext4 file system, 3 compression streams)

iozone -t 3 -R -r 16K -s 60M -I +Z

       Test           LZO           LZ4
----------------------------------------------
  Initial write   1642744.62    1317005.09
        Rewrite   2498980.88    1800645.16
           Read   3957026.38    5877043.75
        Re-read   3950997.38    5861847.00
   Reverse Read   2937114.56    5047384.00
    Stride read   2948163.19    4929587.38
    Random read   3292692.69    4880793.62
Mixed workload   1545602.62    3502940.38
   Random write   2448039.75    1758786.25
         Pwrite   1670051.03    1338329.69
          Pread   2530682.00    5097177.62
         Fwrite   3232085.62    3275942.56
          Fread   6306880.25    6645271.12

So on my system LZ4 is slower in write-only tests, while it performs
better in read-only and mixed (reads + writes) tests.

Official LZ4 benchmarks available here http://code.google.com/p/lz4/
(linux kernel uses revision r90).

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: make compression algorithm selection possible

Add and document `comp_algorithm' device attribute. This attribute allows
to show supported compression and currently selected compression
algorithms:

cat /sys/block/zram0/comp_algorithm
[lzo] lz4

and change selected compression algorithm:
echo lzo > /sys/block/zram0/comp_algorithm

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: add set_max_streams knob

This patch allows to change max_comp_streams on initialised zcomp.

Introduce zcomp set_max_streams() knob, zcomp_strm_multi_set_max_streams()
and zcomp_strm_single_set_max_streams() callbacks to change streams limit
for zcomp_strm_multi and zcomp_strm_single, accordingly. set_max_streams
for single steam zcomp does nothing.

If user has lowered the limit, then zcomp_strm_multi_set_max_streams()
attempts to immediately free extra streams (as much as it can, depending
on idle streams availability).

Note, this patch does not allow to change stream 'policy' from single to
multi stream (or vice versa) on already initialised compression backend.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: add multi stream functionality

Existing zram (zcomp) implementation has only one compression stream
(buffer and algorithm private part), so in order to prevent data
corruption only one write (compress operation) can use this compression
stream, forcing all concurrent write operations to wait for stream lock
to be released.  This patch changes zcomp to keep a compression streams
list of user-defined size (via sysfs device attr).  Each write operation
still exclusively holds compression stream, the difference is that we
can have N write operations (depending on size of streams list)
executing in parallel.  See TEST section later in commit message for
performance data.

Introduce struct zcomp_strm_multi and a set of functions to manage
zcomp_strm stream access.  zcomp_strm_multi has a list of idle
zcomp_strm structs, spinlock to protect idle list and wait queue, making
it possible to perform parallel compressions.

The following set of functions added:
- zcomp_strm_multi_find()/zcomp_strm_multi_release()
  find and release a compression stream, implement required locking
- zcomp_strm_multi_create()/zcomp_strm_multi_destroy()
  create and destroy zcomp_strm_multi

zcomp ->strm_find() and ->strm_release() callbacks are set during
initialisation to zcomp_strm_multi_find()/zcomp_strm_multi_release()
correspondingly.

Each time zcomp issues a zcomp_strm_multi_find() call, the following set
of operations performed:

- spin lock strm_lock
- if idle list is not empty, remove zcomp_strm from idle list, spin
  unlock and return zcomp stream pointer to caller
- if idle list is empty, current adds itself to wait queue. it will be
  awaken by zcomp_strm_multi_release() caller.

zcomp_strm_multi_release():
- spin lock strm_lock
- add zcomp stream to idle list
- spin unlock, wake up sleeper

Minchan Kim reported that spinlock-based locking scheme has demonstrated
a severe perfomance regression for single compression stream case,
comparing to mutex-based (see https://lkml.org/lkml/2014/2/18/16)

base                      spinlock                    mutex

==Initial write           ==Initial write             ==Initial  write
records:  5               records:  5                 records:   5
avg:      1642424.35      avg:      699610.40         avg:       1655583.71
std:      39890.95(2.43%) std:      232014.19(33.16%) std:       52293.96
max:      1690170.94      max:      1163473.45        max:       1697164.75
min:      1568669.52      min:      573429.88         min:       1553410.23
==Rewrite                 ==Rewrite                   ==Rewrite
records:  5               records:  5                 records:   5
avg:      1611775.39      avg:      501406.64         avg:       1684419.11
std:      17144.58(1.06%) std:      15354.41(3.06%)   std:       18367.42
max:      1641800.95      max:      531356.78         max:       1706445.84
min:      1593515.27      min:      488817.78         min:       1655335.73

When only one compression stream available, mutex with spin on owner
tends to perform much better than frequent wait_event()/wake_up().  This
is why single stream implemented as a special case with mutex locking.

Introduce and document zram device attribute max_comp_streams.  This
attr shows and stores current zcomp's max number of zcomp streams
(max_strm).  Extend zcomp's zcomp_create() with `max_strm' parameter.
`max_strm' limits the number of zcomp_strm structs in compression
backend's idle list (max_comp_streams).

max_comp_streams used during initialisation as follows:
-- passing to zcomp_create() max_strm equals to 1 will initialise zcomp
using single compression stream zcomp_strm_single (mutex-based locking).
-- passing to zcomp_create() max_strm greater than 1 will initialise zcomp
using multi compression stream zcomp_strm_multi (spinlock-based locking).

default max_comp_streams value is 1, meaning that zram with single stream
will be initialised.

Later patch will introduce configuration knob to change max_comp_streams
on already initialised and used zcomp.

TEST
iozone -t 3 -R -r 16K -s 60M -I +Z

       test           base       1 strm (mutex)     3 strm (spinlock)
-----------------------------------------------------------------------
Initial write      589286.78       583518.39          718011.05
       Rewrite      604837.97       596776.38         1515125.72
  Random write      584120.11       595714.58         1388850.25
        Pwrite      535731.17       541117.38          739295.27
        Fwrite     1418083.88      1478612.72         1484927.06

Usage example:
set max_comp_streams to 4
        echo 4 > /sys/block/zram0/max_comp_streams

show current max_comp_streams (default value is 1).
        cat /sys/block/zram0/max_comp_streams

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: factor out single stream compression

This is preparation patch to add multi stream support to zcomp.

Introduce struct zcomp_strm_single and a set of functions to manage
zcomp_strm stream access.  zcomp_strm_single implements single compession
stream, same way as current zcomp implementation.  This moves zcomp_strm
stream control and locking from zcomp, so compressing backend zcomp is not
aware of required locking.

Single and multi streams require different locking schemes.  Minchan Kim
reported that spinlock-based locking scheme (which is used in multi stream
implementation) has demonstrated a severe perfomance regression for single
compression stream case, comparing to mutex-based.  see
https://lkml.org/lkml/2014/2/18/16

The following set of functions added:
- zcomp_strm_single_find()/zcomp_strm_single_release()
  find and release a compression stream, implement required locking
- zcomp_strm_single_create()/zcomp_strm_single_destroy()
  create and destroy zcomp_strm_single

New ->strm_find() and ->strm_release() callbacks added to zcomp, which are
set to zcomp_strm_single_find() and zcomp_strm_single_release() during
initialisation.  Instead of direct locking and zcomp_strm access from
zcomp_strm_find() and zcomp_strm_release(), zcomp now calls ->strm_find()
and ->strm_release() correspondingly.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: use zcomp compressing backends

Do not perform direct LZO compress/decompress calls, initialise
and use zcomp LZO backend (single compression stream) instead.

[[email protected]: resolve conflicts with zram-delete-zram_init_device-fix.patch]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: introduce compressing backend abstraction

ZRAM performs direct LZO compression algorithm calls, making it the one
and only option.  While LZO is generally performs well, LZ4 algorithm
tends to have a faster decompression (see http://code.google.com/p/lz4/
for full report)

Name            Ratio  C.speed D.speed
                        MB/s    MB/s
LZ4 (r101)      2.084    422    1820
LZO 2.06        2.106    414     600

Thus, users who have mostly read (decompress) usage scenarious or mixed
workflow (writes with relatively high read ops number) will benefit from
using LZ4 compression backend.

Introduce compressing backend abstraction zcomp in order to support
multiple compression algorithms with the following set of operations:

        .create
        .destroy
        .compress
        .decompress

Schematically zram write() usually contains the following steps:
0) preparation (decompression of partioal IO, etc.)
1) lock buffer_lock mutex (protects meta compress buffers)
2) compress (using meta compress buffers)
3) alloc and map zs_pool object
4) copy compressed data (from meta compress buffers) to object allocated by 3)
5) free previous pool page, assign a new one
6) unlock buffer_lock mutex

As we can see, compressing buffers must remain untouched from 1) to 4),
because, otherwise, concurrent write() can overwrite data.  At the same
time, zram_meta must be aware of a) specific compression algorithm memory
requirements and b) necessary locking to protect compression buffers.  To
remove requirement a) new struct zcomp_strm introduced, which contains a
compress/decompress `buffer' and compression algorithm `private' part.
While struct zcomp implements zcomp_strm stream handling and locking and
removes requirement b) from zram meta.  zcomp ->create() and ->destroy(),
respectively, allocate and deallocate algorithm specific zcomp_strm
`private' part.

Every zcomp has zcomp stream and mutex to protect its compression stream.
Stream usage semantics remains the same -- only one write can hold stream
lock and use its buffers.  zcomp_strm_find() turns caller into exclusive
user of a stream (holding stream mutex until zram release stream), and
zcomp_strm_release() makes zcomp stream available (unlock the stream
mutex).  Hence no concurrent write (compression) operations possible at
the moment.

iozone -t 3 -R -r 16K -s 60M -I +Z

       test            base           patched
--------------------------------------------------
  Initial write      597992.91       591660.58
        Rewrite      609674.34       616054.97
           Read     2404771.75      2452909.12
        Re-read     2459216.81      2470074.44
   Reverse Read     1652769.66      1589128.66
    Stride read     2202441.81      2202173.31
    Random read     2236311.47      2276565.31
Mixed workload     1423760.41      1709760.06
   Random write      579584.08       615933.86
         Pwrite      597550.02       594933.70
          Pread     1703672.53      1718126.72
         Fwrite     1330497.06      1461054.00
          Fread     3922851.00      3957242.62

Usage examples:

comp = zcomp_create(NAME) /* NAME e.g. "lzo" */

which initialises compressing backend if requested algorithm is supported.

Compress:
zstrm = zcomp_strm_find(comp)
zcomp_compress(comp, zstrm, src, &dst_len)
[..] /* copy compressed data */
zcomp_strm_release(comp, zstrm)

Decompress:
zcomp_decompress(comp, src, src_len, dst);

Free compessing backend and its zcomp stream:
zcomp_destroy(comp)

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: delete zram_init_device()

allocate new `zram_meta' in disksize_store() only for uninitialised zram
device, saving a number of allocations and deallocations in case if
disksize_store() was called on currently used device. at the same time
zram_meta stack variable is not necessary, because we can set ->meta
directly. there is also no need in setting QUEUE_FLAG_NONROT queue on
every disksize_store(), set it once during device creation.

[[email protected]: handle zram->meta alloc fail case]
[[email protected]: prevent lockdep spew of init_lock]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
Cc: Sasha Levin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: document failed_reads, failed_writes stats

Document `failed_reads' and `failed_writes' device attributes.
Remove info about `discard' - there is no such zram attr.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: move zram size warning to documentation

Move zram warning about disksize and size of memory correlation to zram
documentation.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: drop not used table `count' member

struct table `count' member is not used.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Minchan Kim <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: report failed read and write stats

zram accounted but did not report numbers of failed read and write
queries. make these stats available as failed_reads and failed_writes
attrs.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: remove zram stats code duplication

Introduce ZRAM_ATTR_RO macro that generates device_attribute and default
ATTR show() function for existing atomic64_t zram stats.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: use atomic64_t for all zram stats

This is a preparation patch for stats code duplication removal.

1) use atomic64_t for `pages_zero' and `pages_stored' zram stats.

2) `compr_size' and `pages_zero' struct zram_stats members did not
   follow the existing device attr naming scheme: zram_stats.ATTR has
   ATTR_show() function.  rename them:

   -- compr_size -> compr_data_size
   -- pages_zero -> zero_pages

Minchan Kim's note:
If we really have trouble with atomic stat operation, we could
change it with percpu_counter so that it could solve atomic overhead and
unnecessary memory space by introducing unsigned long instead of 64bit
atomic_t.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: remove good and bad compress stats

Remove `good' and `bad' compressed sub-requests stats.  RW request may
cause a number of RW sub-requests.  zram used to account `good' compressed
sub-queries (with compressed size less than 50% of original size), `bad'
compressed sub-queries (with compressed size greater that 75% of original
size), leaving sub-requests with compression size between 50% and 75% of
original size not accounted and not reported.  zram already accounts each
sub-request's compression size so we can calculate real device compression
ratio.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: do not pass rw argument to __zram_make_request()

Do not pass rw argument down the __zram_make_request() -> zram_bvec_rw()
chain, decode it in zram_bvec_rw() instead. Besides, this is the place
where we distinguish READ and WRITE bio data directions, so account zram
RW stats here, instead of __zram_make_request(). This also allows to
account a real number of zram READ/WRITE operations, not just requests
(single RW request may cause a number of zram RW ops with separate
locking, compression/decompression, etc).

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: drop `init_done' struct zram member

Introduce init_done() helper function which allows us to drop `init_done'
struct zram member. init_done() uses the fact that ->init_done == 1
equals to ->meta != NULL.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_alloc.c: change mm debug routines back to EXPORT_SYMBOL

A new dump_page() routine was recently added, and marked
EXPORT_SYMBOL_GPL.  dump_page() was also added to the VM_BUG_ON_PAGE()
macro, and so the end result is that non-GPL code can no longer call
get_page() and a few other routines.

This only happens if the kernel was compiled with CONFIG_DEBUG_VM.

Change dump_page() to be EXPORT_SYMBOL.

Longer explanation:

Prior to commit 309381feaee5 ("mm: dump page when hitting a VM_BUG_ON
using VM_BUG_ON_PAGE") , it was possible to build MIT-licensed (non-GPL)
drivers on Fedora.  Fedora is semi-unique, in that it sets
CONFIG_VM_DEBUG.

Because Fedora sets CONFIG_VM_DEBUG, they end up pulling in dump_page(),
via VM_BUG_ON_PAGE, via get_page().  As one of the authors of NVIDIA's
new, open source, "UVM-Lite" kernel module, I originally choose to use
the kernel's get_page() routine from within nvidia_uvm_page_cache.c,
because get_page() has always seemed to be very clearly intended for use
by non-GPL, driver code.

So I'm hoping that making get_page() widely accessible again will not be
too controversial.  We did check with Fedora first, and they responded
(https://bugzilla.redhat.com/show_bug.cgi?id=1074710#c3) that we should
try to get upstream changed, before asking Fedora to change.  Their
reasoning seems beneficial to Linux: leaving CONFIG_DEBUG_VM set allows
Fedora to help catch mm bugs.

Signed-off-by: John Hubbard <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Josh Boyer <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

numa: use LAST_CPUPID_SHIFT to calculate LAST_CPUPID_MASK

LAST_CPUPID_MASK is calculated using LAST_CPUPID_WIDTH.  However
LAST_CPUPID_WIDTH itself can be 0.  (when LAST_CPUPID_NOT_IN_PAGE_FLAGS is
set).  In such a case LAST_CPUPID_MASK turns out to be 0.

But with recent commit 1ae71d0319: (mm: numa: bugfix for
LAST_CPUPID_NOT_IN_PAGE_FLAGS) if LAST_CPUPID_MASK is 0,
page_cpupid_xchg_last() and page_cpupid_reset_last() causes
page->_last_cpupid to be set to 0.

This causes performance regression. Its almost as if numa_balancing is
off.

Fix LAST_CPUPID_MASK by using LAST_CPUPID_SHIFT instead of
LAST_CPUPID_WIDTH.

Some performance numbers and perf stats with and without the fix.

(3.14-rc6)
----------
numa01

Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':

         12,27,462 cs                                                           [100.00%]
          2,41,957 migrations                                                   [100.00%]
       1,68,01,713 faults                                                       [100.00%]
    7,99,35,29,041 cache-misses
            98,808 migrate:mm_migrate_pages                                     [100.00%]

    1407.690148814 seconds time elapsed

numa02

Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':

            63,065 cs                                                           [100.00%]
            14,364 migrations                                                   [100.00%]
          2,08,118 faults                                                       [100.00%]
      25,32,59,404 cache-misses
                12 migrate:mm_migrate_pages                                     [100.00%]

      63.840827219 seconds time elapsed

(3.14-rc6 with fix)
-------------------
numa01

Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':

          9,68,911 cs                                                           [100.00%]
          1,01,414 migrations                                                   [100.00%]
         88,38,697 faults                                                       [100.00%]
    4,42,92,51,042 cache-misses
          4,25,060 migrate:mm_migrate_pages                                     [100.00%]

     685.965331189 seconds time elapsed

numa02

Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':

            17,543 cs                                                           [100.00%]
             2,962 migrations                                                   [100.00%]
          1,17,843 faults                                                       [100.00%]
      11,80,61,644 cache-misses
            12,358 migrate:mm_migrate_pages                                     [100.00%]

      20.380132343 seconds time elapsed

Signed-off-by: Srikar Dronamraju <[email protected]>
Cc: Liu Ping Fan <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

madvise: correct the comment of MADV_DODUMP flag

s/MADV_NODUMP/MADV_DONTDUMP/

Signed-off-by: Zhang Yanfei <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/readahead.c: inline ra_submit

Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left
ra_submit with a single function call.

Move ra_submit to internal.h and inline it to save some stack. Thanks
to Andrew Morton for commenting different versions.

Signed-off-by: Fabian Frederick <[email protected]>
Suggested-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: hugetlb: fix softlockup when a large number of hugepages are freed.

When I decrease the value of nr_hugepage in procfs a lot, softlockup
happens.  It is because there is no chance of context switch during this
process.

On the other hand, when I allocate a large number of hugepages, there is
some chance of context switch.  Hence softlockup doesn't happen during
this process.  So it's necessary to add the context switch in the
freeing process as same as allocating process to avoid softlockup.

When I freed 12 TB hugapages with kernel-2.6.32-358.el6, the freeing
process occupied a CPU over 150 seconds and following softlockup message
appeared twice or more.

$ echo 6000000 > /proc/sys/vm/nr_hugepages
$ cat /proc/sys/vm/nr_hugepages
6000000
$ grep ^Huge /proc/meminfo
HugePages_Total:   6000000
HugePages_Free:    6000000
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
$ echo 0 > /proc/sys/vm/nr_hugepages

BUG: soft lockup - CPU#16 stuck for 67s! [sh:12883] ...
Pid: 12883, comm: sh Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
  free_pool_huge_page+0xb8/0xd0
  set_max_huge_pages+0x128/0x190
  hugetlb_sysctl_handler_common+0x113/0x140
  hugetlb_sysctl_handler+0x1e/0x20
  proc_sys_call_handler+0x97/0xd0
  proc_sys_write+0x14/0x20
  vfs_write+0xb8/0x1a0
  sys_write+0x51/0x90
  __audit_syscall_exit+0x265/0x290
  system_call_fastpath+0x16/0x1b

I have not confirmed this problem with upstream kernels because I am not
able to prepare the machine equipped with 12TB memory now.  However I
confirmed that the amount of decreasing hugepages was directly
proportional to the amount of required time.

I measured required times on a smaller machine.  It showed 130-145
hugepages decreased in a millisecond.

  Amount of decreasing     Required time      Decreasing rate
  hugepages                     (msec)         (pages/msec)
  ------------------------------------------------------------
  10,000 pages == 20GB         70 -  74          135-142
  30,000 pages == 60GB        208 - 229          131-144

It means decrement of 6TB hugepages will trigger softlockup with the
default threshold 20sec, in this decreasing rate.

Signed-off-by: Masayoshi Mizuma <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memblock.c: use PFN_PHYS()

Replace ((phys_addr_t)(x) << PAGE_SHIFT) by pfn macro.

Signed-off-by: Fabian Frederick <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memblock: use for_each_memblock()

This is a small cleanup.

Signed-off-by: Emil Medve <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: remove unused arg of set_page_dirty_balance()

There's only one caller of set_page_dirty_balance() and that will call it
with page_mkwrite == 0.

The page_mkwrite argument was unused since commit b827e496c893 "mm: close
page_mkwrite races".

Signed-off-by: Miklos Szeredi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: try_to_unmap_cluster() should lock_page() before mlocking

A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
fuzzing with trinity.  The call site try_to_unmap_cluster() does not lock
the pages other than its check_page parameter (which is already locked).

The BUG_ON in mlock_vma_page() is not documented and its purpose is
somewhat unclear, but apparently it serializes against page migration,
which could otherwise fail to transfer the PG_mlocked flag.  This would
not be fatal, as the page would be eventually encountered again, but
NR_MLOCK accounting would become distorted nevertheless.  This patch adds
a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
effect.

The call site try_to_unmap_cluster() is fixed so that for page !=
check_page, trylock_page() is attempted (to avoid possible deadlocks as we
already have check_page locked) and mlock_vma_page() is performed only
upon success.  If the page lock cannot be obtained, the page is left
without PG_mlocked, which is again not a problem in the whole unevictable
memory design.

Signed-off-by: Vlastimil Babka <[email protected]>
Signed-off-by: Bob Liu <[email protected]>
Reported-by: Sasha Levin <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: page_alloc: spill to remote nodes before waking kswapd

On NUMA systems, a node may start thrashing cache or even swap anonymous
pages while there are still free pages on remote nodes.

This is a result of commits 81c0a2bb515f ("mm: page_alloc: fair zone
allocator policy") and fff4068cba48 ("mm: page_alloc: revert NUMA aspect
of fair allocation policy").

Before those changes, the allocator would first try all allowed zones,
including those on remote nodes, before waking any kswapds.  But now,
the allocator fastpath doubles as the fairness pass, which in turn can
only consider the local node to prevent remote spilling based on
exhausted fairness batches alone.  Remote nodes are only considered in
the slowpath, after the kswapds are woken up.  But if remote nodes still
have free memory, kswapd should not be woken to rebalance the local node
or it may thrash cash or swap prematurely.

Fix this by adding one more unfair pass over the zonelist that is
allowed to spill to remote nodes after the local fairness pass fails but
before entering the slowpath and waking the kswapds.

This also gets rid of the GFP_THISNODE exemption from the fairness
protocol because the unfair pass is no longer tied to kswapd, which
GFP_THISNODE is not allowed to wake up.

However, because remote spills can be more frequent now - we prefer them
over local kswapd reclaim - the allocation batches on remote nodes could
underflow more heavily.  When resetting the batches, use
atomic_long_read() directly instead of zone_page_state() to calculate the
delta as the latter filters negative counter values.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: <[email protected]> [3.12+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: rename high level charging functions

mem_cgroup_newpage_charge is used only for charging anonymous memory so
it is better to rename it to mem_cgroup_charge_anon.

mem_cgroup_cache_charge is used for file backed memory so rename it to
mem_cgroup_charge_file.

Signed-off-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: sanitize __mem_cgroup_try_charge() call protocol

Some callsites pass a memcg directly, some callsites pass an mm that
then has to be translated to a memcg. This makes for a terrible
function interface.

Just push the mm-to-memcg translation into the respective callsites and
always pass a memcg to mem_cgroup_try_charge().

[[email protected]: add charge mm helper]
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: do not replicate get_mem_cgroup_from_mm in __mem_cgroup_try_charge

__mem_cgroup_try_charge duplicates get_mem_cgroup_from_mm for charges
which came without a memcg. The only reason seems to be a tiny
optimization when css_tryget is not called if the charge can be consumed
from the stock. Nevertheless css_tryget is very cheap since it has been
reworked to use per-cpu counting so this optimization doesn't give us
anything these days.

So let's drop the code duplication so that the code is more readable.

Signed-off-by: Michal Hocko <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: get_mem_cgroup_from_mm()

Instead of returning NULL from try_get_mem_cgroup_from_mm() when the mm
owner is exiting, just return root_mem_cgroup. This makes sense for all
callsites and gets rid of some of them having to fallback manually.

[[email protected]: fix warnings]
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Fengguang Wu <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: remove unnecessary !mm check from try_get_mem_cgroup_from_mm()

Users pass either a mm that has been established under task lock, or use
a verified current->mm, which means the task can't be exiting.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcg: push !mm handling out to page cache charge function

Only page cache charges can happen without an mm context, so push this
special case out of the inner core and into the cache charge function.

An ancient comment explains that the mm can also be NULL in case the
task is currently being migrated, but that is not actually true with the
current case, so just remove it.

Signed-off-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcg: inline mem_cgroup_charge_common()

mem_cgroup_charge_common() is used by both cache and anon pages, but
most of its body only applies to anon pages and the remainder is not
worth having in a separate function.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcg: remove mem_cgroup_move_account_page_stat()

It used to disable preemption and run sanity checks but now it's only
taking a number out of one percpu counter and putting it into another.
Do this directly in the callsite and save the indirection.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcg: remove unnecessary preemption disabling

lock_page_cgroup() disables preemption, remove explicit preemption
disabling for code paths holding this lock.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: use 'const char *' insted of 'char *' for reason in dump_page()

I tried to use 'dump_page(page, __func__)' for debugging, but it triggers
warning:

warning: passing argument 2 of `dump_page' discards `const' qualifier from pointer target type [enabled by default]

Let's convert 'reason' to 'const char *' in dump_page() and friends: we
shouldn't modify it anyway.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmalloc.c: enhance vm_map_ram() comment

vm_map_ram() has a fragmentation problem when it cannot purge a
chunk(ie, 4M address space) if there is a pinning object in that
addresss space.  So it could consume all VMALLOC address space easily.

We can fix the fragmentation problem by using vmap instead of
vm_map_ram() but vmap() is known to be slow compared to vm_map_ram().
Minchan said vm_map_ram is 5 times faster than vmap in his tests.  So I
thought we should fix fragment problem of vm_map_ram because our
proprietary GPU driver has used it heavily.

On second thought, it's not an easy because we should reuse freed space
for solving the problem and it could make more IPI and bitmap operation
for searching hole.  It could mitigate API's goal which is very fast
mapping.  And even fragmentation problem wouldn't show in 64 bit
machine.

Another option is that the user should separate long-life and short-life
object and use vmap for long-life but vm_map_ram for short-life.  If we
inform the user about the characteristic of vm_map_ram the user can
choose one according to the page lifetime.

Let's add some notice messages to user.

[[email protected]: tweak comment text]
Signed-off-by: Gioh Kim <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: fix 'ERROR: do not initialise globals to 0 or NULL' and coding style

Signed-off-by: Choi Gi-yong <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mempool: add unlikely and likely hints

Add unlikely and likely hints to the function mempool_free. It lays out
the code in such a way that the common path is executed straighforward and
saves a cache line.

Signed-off-by: Mikulas Patocka <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, compaction: determine isolation mode only once

The conditions that control the isolation mode in
isolate_migratepages_range() do not change during the iteration, so
extract them out and only define the value once.

This actually does have an effect, gcc doesn't optimize it itself because
of cc->sync.

Signed-off-by: David Rientjes <[email protected]>
Cc: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

res_counter: remove interface for locked charging and uncharging

The res_counter_{charge,uncharge}_locked() variants are not used in the
kernel outside of the resource counter code itself, so remove the
interface.

Signed-off-by: David Rientjes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Jianguo Wu <[email protected]>
Cc: Tim Hockin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, mempolicy: remove per-process flag

PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
There's no significant performance degradation to checking
current->mempolicy rather than current->flags & PF_MEMPOLICY in the
allocation path, especially since this is considered unlikely().

Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
64GB of memory and without a mempolicy:

threads before after
16 1249409 1244487
32 1281786 1246783
48 1239175 1239138
64 1244642 1241841
80 1244346 1248918
96 1266436 1254316
112 1307398 1312135
128 1327607 1326502

Per-process flags are a scarce resource so we should free them up whenever
possible and make them available. We'll be using it shortly for memcg oom
reserves.

Signed-off-by: David Rientjes <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Jianguo Wu <[email protected]>
Cc: Tim Hockin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, mempolicy: rename slab_node for clarity

slab_node() is actually a mempolicy function, so rename it to
mempolicy_slab_node() to make it clearer that it used for processes with
mempolicies.

At the same time, cleanup its code by saving numa_mem_id() in a local
variable (since we require a node with memory, not just any node) and
remove an obsolete comment that assumes the mempolicy is actually passed
into the function.

Signed-off-by: David Rientjes <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Jianguo Wu <[email protected]>
Cc: Tim Hockin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fork: collapse copy_flags into copy_process

copy_flags() does not use the clone_flags formal and can be collapsed
into copy_process() for cleaner code.

Signed-off-by: David Rientjes <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Jianguo Wu <[email protected]>
Cc: Tim Hockin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: use macros from compiler.h instead of __attribute__((...))

To increase compiler portability there is <linux/compiler.h> which
provides convenience macros for various gcc constructs. Eg: __weak for
__attribute__((weak)). I've replaced all instances of gcc attributes with
the right macro in the memory management (/mm) subsystem.

[[email protected]: while-we're-there consistency tweaks]
Signed-off-by: Gideon Israel Dsouza <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: per-thread vma caching

This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed.  There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma().  Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality.  On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number.  The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed.  Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question.  Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
   scheme does improve ~50% hit rate by just adding a few more slots to
   the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 50.61%   | 19.90            |
| patched        | 73.45%   | 13.58            |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
   approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 75.28%   | 11.03            |
| patched        | 88.09%   | 9.31             |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 70.66%   | 17.14            |
| patched        | 91.15%   | 12.57            |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
   approach always shows nearly perfect hit rates, while baseline is just
   about non-existent.  The amounts of cycles can fluctuate between
   anywhere from ~60 to ~116 for the baseline scheme, but this approach
   reduces it considerably.  For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 1.06%    | 91.54            |
| patched        | 99.97%   | 14.18            |
+----------------+----------+------------------+

[[email protected]: fix nommu build, per Davidlohr]
[[email protected]: document vmacache_valid() logic]
[[email protected]: attempt to untangle header files]
[[email protected]: add vmacache_find() BUG_ON]
[[email protected]: add vmacache_valid_mm() (from Oleg)]
[[email protected]: coding-style fixes]
[[email protected]: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Reviewed-by: Michel Lespinasse <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Tested-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: implement ->map_pages for shmem/tmpfs

In shmem/tmpfs, we also use the generic filemap_map_pages, seems the
additional checking is not worth a separate version of map_pages for it.

Signed-off-by: Ning Qu <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dave Chinner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: add debugfs tunable for fault_around_order

Let's allow people to tweak faultaround at runtime.

[[email protected]: coding-style fixes]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Ning Qu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: cleanup size checks in filemap_fault() and filemap_map_pages()

Minor cleanups:
- 'size' variable is now in bytes, not pages;
- use round_up(): it should be easier to read.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Ning Qu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: implement ->map_pages for page cache

filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.

It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Ning Qu <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: introduce vm_ops->map_pages()

Here's new version of faultaround patchset.  It took a while to tune it
and collect performance data.

First patch adds new callback ->map_pages to vm_operations_struct.

->map_pages() is called when VM asks to map easy accessible pages.
Filesystem should find and map pages associated with offsets from
"pgoff" till "max_pgoff".  ->map_pages() is called with page table
locked and must not block.  If it's not possible to reach a page without
blocking, filesystem should skip it.  Filesystem should use do_set_pte()
to setup page table entry.  Pointer to entry associated with offset
"pgoff" is passed in "pte" field in vm_fault structure.  Pointers to
entries for other offsets should be calculated relative to "pte".

Currently VM use ->map_pages only on read page fault path.  We try to
map FAULT_AROUND_PAGES a time.  FAULT_AROUND_PAGES is 16 for now.
Performance data for different FAULT_AROUND_ORDER is below.

TODO:
- implement ->map_pages() for shmem/tmpfs;
- modify get_user_pages() to be able to use ->map_pages() and implement
   mmap(MAP_POPULATE|MAP_NONBLOCK) on top.

=========================================================================
Tested on 4-socket machine (120 threads) with 128GiB of RAM.

Few real-world workloads. The sweet spot for FAULT_AROUND_ORDER here is
somewhere between 3 and 5. Let's say 4 :)

Linux build (make -j60)
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
minor-faults 283,301,572 247,151,987 212,215,789 204,772,882 199,568,944 194,703,779 193,381,485
time, seconds 151.227629483 153.920996480 151.356125472 150.863792049 150.879207877 151.150764954 151.450962358
Linux rebuild (make -j60)
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
minor-faults 5,396,854 4,148,444 2,855,286 2,577,282 2,361,957 2,169,573 2,112,643
time, seconds 27.404543757 27.559725591 27.030057426 26.855045126 26.678618635 26.974523490 26.761320095
Git test suite (make -j60 test)
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
minor-faults 129,591,823 99,200,751 66,106,718 57,606,410 51,510,808 45,776,813 44,085,515
time, seconds 66.087215026 64.784546905 64.401156567 65.282708668 66.034016829 66.793780811 67.237810413

Two synthetic tests: access every word in file in sequential/random order.
It doesn't improve much after FAULT_AROUND_ORDER == 4.

Sequential access 16GiB file
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
1 thread
minor-faults 4,195,437 2,098,275 525,068 262,251 131,170 32,856 8,282
time, seconds 7.250461742 6.461711074 5.493859139 5.488488147 5.707213983 5.898510832 5.109232856
8 threads
minor-faults 33,557,540 16,892,728 4,515,848 2,366,999 1,423,382 442,732 142,339
time, seconds 16.649304881 9.312555263 6.612490639 6.394316732 6.669827501 6.75078944 6.371900528
32 threads
minor-faults 134,228,222 67,526,810 17,725,386 9,716,537 4,763,731 1,668,921 537,200
time, seconds 49.164430543 29.712060103 12.938649729 10.175151004 11.840094583 9.594081325 9.928461797
60 threads
minor-faults 251,687,988 126,146,952 32,919,406 18,208,804 10,458,947 2,733,907 928,217
time, seconds 86.260656897 49.626551828 22.335007632 17.608243696 16.523119035 16.339489186 16.326390902
120 threads
minor-faults 503,352,863 252,939,677 67,039,168 35,191,827 19,170,091 4,688,357 1,471,862
time, seconds 124.589206333 79.757867787 39.508707872 32.167281632 29.972989292 28.729834575 28.042251622
Random access 1GiB file
1 thread
minor-faults 262,636 132,743 34,369 17,299 8,527 3,451 1,222
time, seconds 15.351890914 16.613802482 16.569227308 15.179220992 16.557356122 16.578247824 15.365266994
8 threads
minor-faults 2,098,948 1,061,871 273,690 154,501 87,110 25,663 7,384
time, seconds 15.040026343 15.096933500 14.474757288 14.289129964 14.411537468 14.296316837 14.395635804
32 threads
minor-faults 8,390,734 4,231,023 1,054,432 528,847 269,242 97,746 26,881
time, seconds 20.430433109 21.585235358 22.115062928 14.872878951 14.880856305 14.883370649 14.821261690
60 threads
minor-faults 15,733,258 7,892,809 1,973,393 988,266 594,789 164,994 51,691
time, seconds 26.577302548 25.692397770 18.728863715 20.153026398 21.619101933 17.745086260 17.613215273
120 threads
minor-faults 31,471,111 15,816,616 3,959,209 1,978,685 1,008,299 264,635 96,010
time, seconds 41.835322703 40.459786095 36.085306105 35.313894834 35.814445675 36.552633793 34.289210594

Touch only one page in page table in 16GiB file
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
1 thread
minor-faults 8,372 8,324 8,270 8,260 8,249 8,239 8,237
time, seconds 0.039892712 0.045369149 0.051846126 0.063681685 0.079095975 0.17652406 0.541213386
8 threads
minor-faults 65,731 65,681 65,628 65,620 65,608 65,599 65,596
time, seconds 0.124159196 0.488600638 0.156854426 0.191901957 0.242631486 0.543569456 1.677303984
32 threads
minor-faults 262,388 262,341 262,285 262,276 262,266 262,257 263,183
time, seconds 0.452421421 0.488600638 0.565020946 0.648229739 0.789850823 1.651584361 5.000361559
60 threads
minor-faults 491,822 491,792 491,723 491,711 491,701 491,691 491,825
time, seconds 0.763288616 0.869620515 0.980727360 1.161732354 1.466915814 3.04041448 9.308612938
120 threads
minor-faults 983,466 983,655 983,366 983,372 983,363 984,083 984,164
time, seconds 1.595846553 1.667902182 2.008959376 2.425380942 2.941368804 5.977807890 18.401846125

This patch (of 2):

Introduce new vm_ops callback ->map_pages() and uses it for mapping easy
accessible pages around fault address.

On read page fault, if filesystem provides ->map_pages(), we try to map up
to FAULT_AROUND_PAGES pages around page fault address in hope to reduce
number of minor page faults.

We call ->map_pages first and use ->fault() as fallback if page by the
offset is not ready to be mapped (cold page cache or something).

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Ning Qu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/lguest/page_tables.c: rename do_set_pte()

"mm: introduce vm_ops->map_pages()" wants to export a do_set_pte() from core
kernel. Rename lguest's do_set_pte() to something more lguest-specific.

Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Rusty Russell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

tools/vm/page-types.c: page-cache sniffing feature

After this patch 'page-types' can walk over a file's mappings and
analyze populated page cache pages mostly without disturbing its state.

It maps chunk of file, marks VMA as MADV_RANDOM to turn off readahead,
pokes VMA via mincore() to determine cached pages, triggers page-fault
only for them, and finally gathers information via pagemap/kpageflags.
Before unmap it marks VMA as MADV_SEQUENTIAL for ignoring reference
bits.

usage: page-types -f <path>

If <path> is directory it will analyse all files in all subdirectories.

Symlinks are not followed as well as mount points.  Hardlinks aren't
handled, they'll be dumped as many times as they are found.  Recursive
walk brings all dentries into dcache and populates page cache of
block-devices aka 'Buffers'.

Probably it's worth to add ioctl for dumping file page cache as array of
PFNs as a replacement for this hackish juggling with
mmap/madvise/mincore/pagemap.  Also recursive walk could be replaced
with dumping cached inodes via some ioctl or debugfs interface followed
by openning them via open_by_handle_at, this would fix hardlinks
handling and unneeded population of dcache and buffers.  This interface
might be used as data source for constructing readahead plans and for
background optimizations of actively used files.

collateral changes:
+ fix 64-bit LFS: define _FILE_OFFSET_BITS instead of _LARGEFILE64_SOURCE
+ replace lseek + read with single pread
+ make show_page_range() reusable after flush

usage example:

  ~/src/linux/tools/vm$ sudo ./page-types -L -f page-types
  foffset offset    flags
  page-types       Inode: 2229277       Size: 89065 (22 pages)
  Modify: Tue Feb 25 12:00:59 2014 (162 seconds ago)
  Access: Tue Feb 25 12:01:00 2014 (161 seconds ago)
  0       3cbf3b     __RU_lA____M________________________
  1       38946a     __RU_lA____M________________________
  2       1a3cec     __RU_lA____M________________________
  3       1a8321     __RU_lA____M________________________
  4       3af7cc     __RU_lA____M________________________
  5       1ed532     __RU_lA_____________________________
  6       2e436a     __RU_lA_____________________________
  7       29a35e     ___U_lA_____________________________
  8       2de86e     ___U_lA_____________________________
  9       3bdfb4     ___U_lA_____________________________
  10      3cd8a3     ___U_lA_____________________________
  11      2afa50     ___U_lA_____________________________
  12      2534c2     ___U_lA_____________________________
  13      1b7a40     ___U_lA_____________________________
  14      17b0be     ___U_lA_____________________________
  15      392b0c     ___U_lA_____________________________
  16      3ba46a     __RU_lA_____________________________
  17      397dc8     ___U_lA_____________________________
  18      1f2a36     ___U_lA_____________________________
  19      21fd30     __RU_lA_____________________________
  20      2c35ba     __RU_l______________________________
  21      20f181     __RU_l______________________________

               flags page-count   MB  symbolic-flags                        long-symbolic-flags
  0x000000000000002c          2    0  __RU_l______________________________  referenced,uptodate,lru
  0x0000000000000068         11    0  ___U_lA_____________________________  uptodate,lru,active
  0x000000000000006c          4    0  __RU_lA_____________________________  referenced,uptodate,lru,active
  0x000000000000086c          5    0  __RU_lA____M________________________  referenced,uptodate,lru,active,mmap
               total         22    0

  ~/src/linux/tools/vm$ sudo ./page-types -f /
               flags page-count     MB  symbolic-flags                        long-symbolic-flags
  0x0000000000000028      21761     85  ___U_l______________________________  uptodate,lru
  0x000000000000002c     127279    497  __RU_l______________________________  referenced,uptodate,lru
  0x0000000000000068      74160    289  ___U_lA_____________________________  uptodate,lru,active
  0x000000000000006c      84469    329  __RU_lA_____________________________  referenced,uptodate,lru,active
  0x000000000000007c          1      0  __RUDlA_____________________________  referenced,uptodate,dirty,lru,active
  0x0000000000000228        370      1  ___U_l___I__________________________  uptodate,lru,reclaim
  0x0000000000000828         49      0  ___U_l_____M________________________  uptodate,lru,mmap
  0x000000000000082c        126      0  __RU_l_____M________________________  referenced,uptodate,lru,mmap
  0x0000000000000868        137      0  ___U_lA____M________________________  uptodate,lru,active,mmap
  0x000000000000086c      12890     50  __RU_lA____M________________________  referenced,uptodate,lru,active,mmap
               total     321242   1254

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Borislav Petkov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: disable split page table lock for !MMU

There's no reason to enable split page table lock if don't have page
tables.

It also triggers build error at least on ARM since we don't define
pmd_page() for !MMU.

  In file included from arch/arm/kernel/asm-offsets.c:14:0:
  include/linux/mm.h: In function 'pte_lockptr':
  include/linux/mm.h:1392:2: error: implicit declaration of function 'pmd_page' [-Werror=implicit-function-declaration]
  include/linux/mm.h:1392:2: warning: passing argument 1 of 'ptlock_ptr' makes pointer from integer without a cast [enabled by default]
  include/linux/mm.h:1384:27: note: expected 'struct page *' but argument is of type 'int'

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reported-by: Uwe Kleine-König <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

exec: kill the unnecessary mm->def_flags setting in load_elf_binary()

load_elf_binary() sets current->mm->def_flags = def_flags and def_flags
is always zero. Not only this looks strange, this is unnecessary
because mm_init() has already set ->def_flags = 0.

Signed-off-by: Alex Thorlton <[email protected]>
Suggested-by: Oleg Nesterov <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE

Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It
also adds a prctl control which allows us to set the THP disable bit in
mm->def_flags so that VMs will pick up the setting as they are created.

Signed-off-by: Alex Thorlton <[email protected]>
Suggested-by: Oleg Nesterov <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags"

The main motivation behind this patch is to provide a way to disable THP
for jobs where the code cannot be modified, and using a malloc hook with
madvise is not an option (i.e.  statically allocated data).  This patch
allows us to do just that, without affecting other jobs running on the
system.

We need to do this sort of thing for jobs where THP hurts performance,
due to the possibility of increased remote memory accesses that can be
created by situations such as the following:

When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will
be handed out, and the THP will be stuck on whatever node the chunk was
originally referenced from.  If many remote nodes need to do work on
that same chunk, they'll be making remote accesses.

With THP disabled, 4K pages can be handed out to separate nodes as
they're needed, greatly reducing the amount of remote accesses to
memory.

This patch is based on some of my work combined with some
suggestions/patches given by Oleg Nesterov.  The main goal here is to
add a prctl switch to allow us to disable to THP on a per mm_struct
basis.

Here's a bit of test data with the new patch in place...

First with the flag unset:

  # perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
  Setting thp_disabled for this task...
  thp_disable: 0
  Set thp_disabled state to 0
  Process pid = 18027

                                                                                                                       PF/
                                  MAX        MIN                                  TOTCPU/      TOT_PF/   TOT_PF/     WSEC/
  TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU       CPU     WALL_SEC   SYS_SEC       CPU   NODES
   512      1.120      0.060      0.000    0.110      0.110     0.000    28571428864 -9223372036854775808  55803572      23

   Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

    273719072.841402 task-clock                #  641.026 CPUs utilized           [100.00%]
           1,008,986 context-switches          #    0.000 M/sec                   [100.00%]
               7,717 CPU-migrations            #    0.000 M/sec                   [100.00%]
           1,698,932 page-faults               #    0.000 M/sec
  355,222,544,890,379 cycles                   #    1.298 GHz                     [100.00%]
  536,445,412,234,588 stalled-cycles-frontend  #  151.02% frontend cycles idle    [100.00%]
  409,110,531,310,223 stalled-cycles-backend   #  115.17% backend  cycles idle    [100.00%]
  148,286,797,266,411 instructions             #    0.42  insns per cycle
                                               #    3.62  stalled cycles per insn [100.00%]
  27,061,793,159,503 branches                  #   98.867 M/sec                   [100.00%]
       1,188,655,196 branch-misses             #    0.00% of all branches

       427.001706337 seconds time elapsed

Now with the flag set:

  # perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
  Setting thp_disabled for this task...
  thp_disable: 1
  Set thp_disabled state to 1
  Process pid = 144957

                                                                                                                       PF/
                                  MAX        MIN                                  TOTCPU/      TOT_PF/   TOT_PF/     WSEC/
  TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU       CPU     WALL_SEC   SYS_SEC       CPU   NODES
   512      0.620      0.260      0.250    0.320      0.570     0.001    51612901376 128000000000 100806448      23

   Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

    138789390.540183 task-clock                #  641.959 CPUs utilized           [100.00%]
             534,205 context-switches          #    0.000 M/sec                   [100.00%]
               4,595 CPU-migrations            #    0.000 M/sec                   [100.00%]
          63,133,119 page-faults               #    0.000 M/sec
  147,977,747,269,768 cycles                   #    1.066 GHz                     [100.00%]
  200,524,196,493,108 stalled-cycles-frontend  #  135.51% frontend cycles idle    [100.00%]
  105,175,163,716,388 stalled-cycles-backend   #   71.07% backend  cycles idle    [100.00%]
  180,916,213,503,160 instructions             #    1.22  insns per cycle
                                               #    1.11  stalled cycles per insn [100.00%]
  26,999,511,005,868 branches                  #  194.536 M/sec                   [100.00%]
         714,066,351 branch-misses             #    0.00% of all branches

       216.196778807 seconds time elapsed

As with previous versions of the patch, We're getting about a 2x
performance increase here.  Here's a link to the test case I used, along
with the little wrapper to activate the flag:

  http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz

This patch (of 3):

Revert commit 8e72033f2a48 and add in code to fix up any issues caused
by the revert.

The revert is necessary because hugepage_madvise would return -EINVAL
when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
patch set.

Here's a snip of an e-mail from Gerald detailing the original purpose of
this code, and providing justification for the revert:

  "The intent of commit 8e72033f2a48 was to guard against any future
   programming errors that may result in an madvice(MADV_HUGEPAGE) on
   guest mappings, which would crash the kernel.

   Martin suggested adding the bit to arch/s390/mm/pgtable.c, if
   8e72033f2a48 was to be reverted, because that check will also prevent
   a kernel crash in the case described above, it will now send a
   SIGSEGV instead.

   This would now also allow to do the madvise on other parts, if
   needed, so it is a more flexible approach.  One could also say that
   it would have been better to do it this way right from the
   beginning..."

Signed-off-by: Alex Thorlton <[email protected]>
Suggested-by: Oleg Nesterov <[email protected]>
Tested-by: Christian Borntraeger <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/compaction: clean-up code on success of ballon isolation

It is just for clean-up to reduce code size and improve readability.
There is no functional change.

Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/compaction: check pageblock suitability once per pageblock

isolation_suitable() and migrate_async_suitable() is used to be sure
that this pageblock range is fine to be migragted.  It isn't needed to
call it on every page.  Current code do well if not suitable, but, don't
do well when suitable.

1) It re-checks isolation_suitable() on each page of a pageblock that was
   already estabilished as suitable.
2) It re-checks migrate_async_suitable() on each page of a pageblock that
   was not entered through the next_pageblock: label, because
   last_pageblock_nr is not otherwise updated.

This patch fixes situation by 1) calling isolation_suitable() only once
per pageblock and 2) always updating last_pageblock_nr to the pageblock
that was just checked.

Additionally, move PageBuddy() check after pageblock unit check, since
pageblock check is the first thing we should do and makes things more
simple.

[[email protected]: rephrase commit description]
Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/compaction: change the timing to check to drop the spinlock

It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th
pfn page.  This may results in below situation while isolating
migratepage.

1. try isolate 0x0 ~ 0x200 pfn pages.
2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop
   the spinlock.
3. Then, to complete isolating, retry to aquire the lock.

I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the
criteria about dropping the lock.  This has no harm 0x0 pfn, because, at
this time, locked variable would be false.

Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/compaction: do not call suitable_migration_target() on every page

suitable_migration_target() checks that pageblock is suitable for
migration target.  In isolate_freepages_block(), it is called on every
page and this is inefficient.  So make it called once per pageblock.

suitable_migration_target() also checks if page is highorder or not, but
it's criteria for highorder is pageblock order.  So calling it once
within pageblock range has no problem.

Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/compaction: disallow high-order page for migration target

Purpose of compaction is to get a high order page.  Currently, if we
find high-order page while searching migration target page, we break it
to order-0 pages and use them as migration target.  It is contrary to
purpose of compaction, so disallow high-order page to be used for
migration target.

Additionally, clean-up logic in suitable_migration_target() to simplify
the code.  There is no functional changes from this clean-up.

Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: exclude memoryless nodes from zone_reclaim

We had a report about strange OOM killer strikes on a PPC machine
although there was a lot of swap free and a tons of anonymous memory
which could be swapped out.  In the end it turned out that the OOM was a
side effect of zone reclaim which wasn't unmapping and swapping out and
so the system was pushed to the OOM.  Although this sounds like a bug
somewhere in the kswapd vs.  zone reclaim vs.  direct reclaim
interaction numactl on the said hardware suggests that the zone reclaim
should not have been set in the first place:

  node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  node 0 size: 0 MB
  node 0 free: 0 MB
  node 2 cpus:
  node 2 size: 7168 MB
  node 2 free: 6019 MB
  node distances:
  node   0   2
  0:  10  40
  2:  40  10

So all the CPUs are associated with Node0 which doesn't have any memory
while Node2 contains all the available memory.  Node distances cause an
automatic zone_reclaim_mode enabling.

Zone reclaim is intended to keep the allocations local but this doesn't
make any sense on the memoryless nodes.  So let's exclude such nodes for
init_zone_allows_reclaim which evaluates zone reclaim behavior and
suitable reclaim_nodes.

Signed-off-by: Michal Hocko <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Nishanth Aravamudan <[email protected]>
Tested-by: Nishanth Aravamudan <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory.c: update comment in unmap_single_vma()

The described issue now occurs inside mmap_region(). And unfortunately
is still valid.

Signed-off-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmscan: do not check compaction_ready on promoted zones

We abort direct reclaim if we find the zone is ready for compaction.
Sometimes the zone is just a promoted highmem zone to force a scan of
highmem, which is not the intended zone the caller want to allocate a
page from. In this situation, setting aborted_reclaim to indicate the
caller turned back to retry the allocation is waste of time and could
cause a loop in __alloc_pages_slowpath().

This patch does not check compaction_ready() on promoted zones to avoid
the above situation. Only set aborted_reclaim if the caller intended
zone is ready for compaction.

Signed-off-by: Weijie Yang <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmscan: restore sc->gfp_mask after promoting it to __GFP_HIGHMEM

We promote sc->gfp_mask to __GFP_HIGHMEM to forcibly scan highmem if
there are too many buffer_heads pinning highmem. See cc715d99e5 ("mm:
vmscan: forcibly scan highmem if there are too many buffer_heads pinning
highmem").

This patch restores sc->gfp_mask to its caller original value after
finishing the scan job, to avoid the impact on other invocations from
its upper caller, such as vmpressure_prio(), shrink_slab().

Signed-off-by: Weijie Yang <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: move mmu notifier call from change_protection to change_pmd_range

The NUMA scanning code can end up iterating over many gigabytes of
unpopulated memory, especially in the case of a freshly started KVM
guest with lots of memory.

This results in the mmu notifier code being called even when there are
no mapped pages in a virtual address range. The amount of time wasted
can be enough to trigger soft lockup warnings with very large KVM
guests.

This patch moves the mmu notifier call to the pmd level, which
represents 1GB areas of memory on x86-64. Furthermore, the mmu notifier
code is only called from the address in the PMD where present mappings
are first encountered.

The hugetlbfs code is left alone for now; hugetlb mappings are not
relocatable, and as such are left alone by the NUMA code, and should
never trigger this problem to begin with.

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Reported-by: Xing Gang <[email protected]>
Tested-by: Chegu Vinod <[email protected]>
Cc: Sasha Levin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: numa: recheck for transhuge pages under lock during protection changes

Sasha reported the following bug using trinity

  kernel BUG at mm/mprotect.c:149!
  invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  Dumping ftrace buffer:
     (ftrace buffer empty)
  Modules linked in:
  CPU: 20 PID: 26219 Comm: trinity-c216 Tainted: G        W    3.14.0-rc5-next-20140305-sasha-00011-ge06f5f3-dirty #105
  task: ffff8800b6c80000 ti: ffff880228436000 task.ti: ffff880228436000
  RIP: change_protection_range+0x3b3/0x500
  Call Trace:
    change_protection+0x25/0x30
    change_prot_numa+0x1b/0x30
    task_numa_work+0x279/0x360
    task_work_run+0xae/0xf0
    do_notify_resume+0x8e/0xe0
    retint_signal+0x4d/0x92

The VM_BUG_ON was added in -mm by the patch "mm,numa: reorganize
change_pmd_range".  The race existed without the patch but was just
harder to hit.

The problem is that a transhuge check is made without holding the PTL.
It's possible at the time of the check that a parallel fault clears the
pmd and inserts a new one which then triggers the VM_BUG_ON check.  This
patch removes the VM_BUG_ON but fixes the race by rechecking transhuge
under the PTL when marking page tables for NUMA hinting and bailing if a
race occurred.  It is not a problem for calls to mprotect() as they hold
mmap_sem for write.

Signed-off-by: Mel Gorman <[email protected]>
Reported-by: Sasha Levin <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm,numa: reorganize change_pmd_range()

Reorganize the order of ifs in change_pmd_range a little, in preparation
for the next patch.

[[email protected]: fix indenting, per David]
Signed-off-by: Rik van Riel <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Reported-by: Xing Gang <[email protected]>
Tested-by: Chegu Vinod <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Sasha Levin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/hugetlb.c: add NULL check of return value of huge_pte_offset

huge_pte_offset() could return NULL, so we need NULL check to avoid
potential NULL pointer dereferences.

Signed-off-by: Naoya Horiguchi <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ntfs: logging clean-up

- Convert spinlock/static array to va_format (inspired by Joe Perches
  help on previous logging patches).

- Convert printk(KERN_ERR to pr_warn in __ntfs_warning.

- Convert printk(KERN_ERR to pr_err in __ntfs_error.

- Convert printk(KERN_DEBUG to pr_debug in __ntfs_debug.  (Note that
  __ntfs_debug is still guarded by #if DEBUG)

- Improve !DEBUG to parse all arguments (Joe Perches).

- Sparse pr_foo() conversions in super.c

NTFS, NTFS-fs prefixes as well as 'warning' and 'error' were removed :
pr_foo() automatically adds module name and error level is already
specified.

Signed-off-by: Fabian Frederick <[email protected]>
Cc: Anton Altaparmakov <[email protected]>
Cc: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:
"Two patches to fix fallouts from the kernfs conversion:

  Li's patch to stop leaking cgroup_root refs across multiple mounts and
  the other fixes the 90s hang during shutdown caused by always using
  root's uid/gid for new cgroup dirs and files."

* 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: newly created dirs and files should be owned by the creator
  cgroup: fix top cgroup refcnt leak

Merge tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull CPU hotplug notifiers registration fixes from Rafael Wysocki:
"The purpose of this single series of commits from Srivatsa S Bhat
  (with a small piece from Gautham R Shenoy) touching multiple
  subsystems that use CPU hotplug notifiers is to provide a way to
  register them that will not lead to deadlocks with CPU online/offline
  operations as described in the changelog of commit 93ae4f978ca7f ("CPU
  hotplug: Provide lockless versions of callback registration
  functions").

  The first three commits in the series introduce the API and document
  it and the rest simply goes through the users of CPU hotplug notifiers
  and converts them to using the new method"

* tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (52 commits)
  net/iucv/iucv.c: Fix CPU hotplug callback registration
  net/core/flow.c: Fix CPU hotplug callback registration
  mm, zswap: Fix CPU hotplug callback registration
  mm, vmstat: Fix CPU hotplug callback registration
  profile: Fix CPU hotplug callback registration
  trace, ring-buffer: Fix CPU hotplug callback registration
  xen, balloon: Fix CPU hotplug callback registration
  hwmon, via-cputemp: Fix CPU hotplug callback registration
  hwmon, coretemp: Fix CPU hotplug callback registration
  thermal, x86-pkg-temp: Fix CPU hotplug callback registration
  octeon, watchdog: Fix CPU hotplug callback registration
  oprofile, nmi-timer: Fix CPU hotplug callback registration
  intel-idle: Fix CPU hotplug callback registration
  clocksource, dummy-timer: Fix CPU hotplug callback registration
  drivers/base/topology.c: Fix CPU hotplug callback registration
  acpi-cpufreq: Fix CPU hotplug callback registration
  zsmalloc: Fix CPU hotplug callback registration
  scsi, fcoe: Fix CPU hotplug callback registration
  scsi, bnx2fc: Fix CPU hotplug callback registration
  scsi, bnx2i: Fix CPU hotplug callback registration
  ...

cgroup: newly created dirs and files should be owned by the creator

While converting cgroup to kernfs, 2bd59d48ebfb ("cgroup: convert to
kernfs") accidentally dropped the logic which makes newly created
cgroup dirs and files owned by the current uid / gid.  This broke
cases where cgroup subtree management is delegated to !root as the sub
manager wouldn't be able to create more than single level of hierarchy
or put tasks into child cgroups it created.

Among other things, this breaks user session management in systemd and
one of the symptoms was 90s hang during shutdown.  User session
systemd running as the user creates a sub-service to initiate shutdown
and tries to put kill(1) into it but fails because cgroup.procs is
owned by root.  This leads to 90s hang during shutdown.

Implement cgroup_kn_set_ugid() which sets a kn's uid and gid to those
of the caller and use it from file and dir creation paths.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Linus Torvalds <[email protected]>

staging: rtl8723au: The 8723 only has two paths

Converting the driver from the original RTL provided version, by error
converted the code to use four, which caused all sorts of issues. The
confusion was caused by the RTL driver having support for both two and
four paths, and in some places had RF_PATH_MAX = 3. At the same time
it kept the data structures hard coded for two paths, in particular
the ones matching the efuse data.

Signed-off-by: Jes Sorensen <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

netdev: remove potentially harmful checks

Currently we're checking a variable for != NULL after actually
dereferencing it, in netdev_lower_get_next_private*().

It's counter-intuitive at best, and can lead to faulty usage (as it implies
that the variable can be NULL), so fix it by removing the useless checks.

Reported-by: Daniel Borkmann <[email protected]>
CC: "David S. Miller" <[email protected]>
CC: Eric Dumazet <[email protected]>
CC: Nicolas Dichtel <[email protected]>
CC: Jiri Pirko <[email protected]>
CC: stephen hemminger <[email protected]>
CC: Jerry Chu <[email protected]>
Signed-off-by: Veaceslav Falico <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Staging: unisys: mark drivers as BROKEN

Turns out these drivers like to mess around with the system even if the
hardware they control isn't present. That's not good, and people are
starting to report lots of issues with this in their build/boot testing.

So for now, let's just mark them as BROKEN, until the code gets
converted to use the proper driver model interaction (i.e. don't do
anything until the hardware is actually found in the system.)

Reported-by: Fengguang Wu <[email protected]>
Reported-by: Sasha Levin <[email protected]>
Cc: Benjamin Romer <[email protected]>
Cc: David Kershner <[email protected]>
Cc: someone <[email protected]>
Cc: Ken Cox <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Staging: unisys: verify that a control channel exists

The code didn't verify that a control channel exists before trying to
use it. It caused NULL ptr derefs which were easy to trigger by an
unpriviliged user simply by reading the proc file, causing:

[   68.161404] BUG: unable to handle kernel NULL pointer dereference at           (null)
[   68.162442] IP: visorchannel_read (drivers/staging/unisys/visorchannel/visorchannel_funcs.c:225)
[   68.163165] PGD 5ca21067 PUD 5ca20067 PMD 0
[   68.163712] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   68.164390] Dumping ftrace buffer:
[   68.164793]    (ftrace buffer empty)
[   68.165220] Modules linked in:
[   68.165601] CPU: 0 PID: 7915 Comm: cat Tainted: G        W     3.14.0-next-20140403-sasha-00012-gef5fa7d-dirty #373
[   68.166821] task: ffff88006e8c3000 ti: ffff88005ca30000 task.ti: ffff88005ca30000
[   68.167689] RIP: visorchannel_read (drivers/staging/unisys/visorchannel/visorchannel_funcs.c:225)
[   68.168683] RSP: 0018:ffff88005ca31e58  EFLAGS: 00010282
[   68.169302] RAX: ffff88005ca10000 RBX: ffff88005ca31e97 RCX: 0000000000000001
[   68.170019] RDX: ffff88005ca31e97 RSI: 0000000000000bd6 RDI: 0000000000000000
[   68.170019] RBP: ffff88005ca31e78 R08: 0000000000000000 R09: 0000000000000000
[   68.170019] R10: ffff880000000000 R11: 0000000000000001 R12: 0000000000000001
[   68.170019] R13: 0000000000000bd6 R14: 0000000000000000 R15: 0000000000008000
[   68.170019] FS:  00007f0e8c041700(0000) GS:ffff88007be00000(0000) knlGS:0000000000000000
[   68.170019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   68.170019] CR2: 0000000000000000 CR3: 000000006efe9000 CR4: 00000000000006b0
[   68.170019] Stack:
[   68.170019]  ffff88005ca31f50 ffff88005ca10000 000000000060e000 ffff88005ca31f50
[   68.170019]  ffff88005ca31ec8 ffffffff83e6f983 ffff8800780db810 0000000000008000
[   68.170019]  ffff88005ca31ec8 ffff88006da5f908 ffff8800780db800 000000000060e000
[   68.170019] Call Trace:
[   68.170019] proc_read_toolaction (drivers/staging/unisys/visorchipset/visorchipset_main.c:2541)
[   68.170019] proc_reg_read (fs/proc/inode.c:211)
[   68.170019] vfs_read (fs/read_write.c:408)
[   68.170019] SyS_read (fs/read_write.c:519 fs/read_write.c:511)
[   68.170019] tracesys (arch/x86/kernel/entry_64.S:749)
[   68.170019] Code: 00 00 66 66 66 66 90 55 48 89 e5 48 83 ec 20 48 89 5d e0 48 89 d3 4c 89 65 e8 49 89 cc 4c 89 6d f0 49 89 f5 4c 89 75 f8 49 89 fe <48> 8b 3f e8 4f f9 ff ff 85 c0 0f 88 97 00 00 00 4d 85 ed 0f 85
[   68.170019] RIP visorchannel_read (drivers/staging/unisys/visorchannel/visorchannel_funcs.c:225)
[   68.170019]  RSP <ffff88005ca31e58>
[   68.170019] CR2: 0000000000000000

Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

staging: unisys: Add missing close parentheses in filexfer.c

Add missing close parentheses in filexfer.c

Signed-off-by: Masanari Iida <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

pktgen: fix xmit test for BQL enabled devices

Similarly as in commit 8e2f1a63f221 ("packet: fix packet_direct_xmit
for BQL enabled drivers"), we test for __QUEUE_STATE_STACK_XOFF bit
in pktgen's xmit, which would not fully fill the device's TX ring for
BQL drivers that use netdev_tx_sent_queue(). Fix is to use, similarly
as we do in packet sockets, netif_xmit_frozen_or_drv_stopped() test.

Signed-off-by: Daniel Borkmann <[email protected]>
Cc: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/at91_ether: avoid NULL pointer dereference

The at91_ether driver calls macb_mii_init passing a 'struct macb'
structure whose tx_clk member is initialized to 0. However,
macb_handle_link_change() expects tx_clk to be the result of
a call to clk_get, and so IS_ERR(tx_clk) to be true if the clock
is invalid. This causes an oops when booting Linux 3.14 on the
csb637 board. The following changes avoids this.

Signed-off-by: Gilles Chanteperdrix <[email protected]>
Acked-by: Nicolas Ferre <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tipc: Let tipc_release() return 0

net/tipc/socket.c: In function ‘tipc_release’:
net/tipc/socket.c:352: warning: ‘res’ is used uninitialized in this function

Introduced by commit 24be34b5a0c9114541891d29dff1152bb1a8df34 ("tipc:
eliminate upcall function pointers between port and socket"), which
removed the sole initializer of "res".

Just return 0 to fix it.

Signed-off-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

at86rf230: fix MAX_CSMA_RETRIES parameter

This patch fix a copy&paste failure for setting the MAX_CSMA_RETRIES
value of the at86rf212 chip which was introduced by commit
f2fdd67c6bc89de0100410efb37de69b1c98ac03 ("ieee802154: enable
smart transmitter features of RF212")

Signed-off-by: Alexander Aring <[email protected]>
Cc: Phoebe Buckheister <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sched: remove sleep_on() and friends

This is the final piece in the puzzle, as all patches to remove the
last users of $interruptible_\|$sleep_on$_timeout\|$ have made it
into the 3.15 merge window. The work was long overdue, and this
interface in particular should not have survived the BKL removal
that was done a couple of years ago.

Citing Jon Corbet from http://lwn.net/2001/0201/kernel.php3":

"[...] it was suggested that the janitors look for and fix all code
  that calls sleep_on() [...] since (1) almost all such code is
  incorrect, and (2) Linus has agreed that those functions should
  be removed in the 2.5 development series".

We haven't quite made it for 2.5, but maybe we can merge this for 3.15.

Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client

Pull Ceph updates from Sage Weil:
"The biggest chunk is a series of patches from Ilya that add support
  for new Ceph osd and crush map features, including some new tunables,
  primary affinity, and the new encoding that is needed for erasure
  coding support.  This brings things into parity with the server side
  and the looming firefly release.  There is also support for allocation
  hints in RBD that help limit fragmentation on the server side.

  There is also a series of patches from Zheng fixing NFS reexport,
  directory fragmentation support, flock vs fnctl behavior, and some
  issues with clustered MDS.

  Finally, there are some miscellaneous fixes from Yunchuan Wen for
  fscache, Fabian Frederick for ACLs, and from me for fsync(dirfd)
  behavior"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (79 commits)
  ceph: skip invalid dentry during dcache readdir
  libceph: dump pool {read,write}_tier to debugfs
  libceph: output primary affinity values on osdmap updates
  ceph: flush cap release queue when trimming session caps
  ceph: don't grabs open file reference for aborted request
  ceph: drop extra open file reference in ceph_atomic_open()
  ceph: preallocate buffer for readdir reply
  libceph: enable PRIMARY_AFFINITY feature bit
  libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
  libceph: add support for osd primary affinity
  libceph: add support for primary_temp mappings
  libceph: return primary from ceph_calc_pg_acting()
  libceph: switch ceph_calc_pg_acting() to new helpers
  libceph: introduce apply_temps() helper
  libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers
  libceph: ceph_can_shift_osds(pool) and pool type defines
  libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions
  libceph: enable OSDMAP_ENC feature bit
  libceph: primary_affinity decode bits
  libceph: primary_affinity infrastructure
  ...

Merge tag 'for-f2fs-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
"This patch-set includes the following major enhancement patches.
   - introduce large directory support
   - introduce f2fs_issue_flush to merge redundant flush commands
   - merge write IOs as much as possible aligned to the segment
   - add sysfs entries to tune the f2fs configuration
   - use radix_tree for the free_nid_list to reduce in-memory operations
   - remove costly bit operations in f2fs_find_entry
   - enhance the readahead flow for CP/NAT/SIT/SSA blocks

  The other bug fixes are as follows:
   - recover xattr node blocks correctly after sudden-power-cut
   - fix to calculate the maximum number of node ids
   - enhance to handle many error cases

  And, there are a bunch of cleanups"

* tag 'for-f2fs-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (62 commits)
  f2fs: fix wrong statistics of inline data
  f2fs: check the acl's validity before setting
  f2fs: introduce f2fs_issue_flush to avoid redundant flush issue
  f2fs: fix to cover io->bio with io_rwsem
  f2fs: fix error path when fail to read inline data
  f2fs: use list_for_each_entry{_safe} for simplyfying code
  f2fs: avoid free slab cache under spinlock
  f2fs: avoid unneeded lookup when xattr name length is too long
  f2fs: avoid unnecessary bio submit when wait page writeback
  f2fs: return -EIO when node id is not matched
  f2fs: avoid RECLAIM_FS-ON-W warning
  f2fs: skip unnecessary node writes during fsync
  f2fs: introduce fi->i_sem to protect fi's info
  f2fs: change reclaim rate in percentage
  f2fs: add missing documentation for dir_level
  f2fs: remove unnecessary threshold
  f2fs: throttle the memory footprint with a sysfs entry
  f2fs: avoid to drop nat entries due to the negative nr_shrink
  f2fs: call f2fs_wait_on_page_writeback instead of native function
  f2fs: introduce nr_pages_to_write for segment alignment
  ...

Merge tag 'fbdev-omap-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux

Pull OMAP fbdev changes from Tomi Valkeinen:
"This is based on the already pulled fbdev-main changes, and this also
  merges .dts branch from Tony Lindgren (which has also been pulled), so
  that I was able to add the display related .dts changes.

  This contains OMAP related fbdev changes for 3.15.  The bulk of the
  patches are for adding Device Tree support for OMAP Display Subsystem:

   - SoCs: OMAP2/3/4

   - Boards: OMAP4 Panda, OMAP4 SDP, OMAP3 Beagle, OMAP3 Beagle-xM,
     OMAP3 IGEP0020, OMAP3 N900

   - Devices: TFP410 Encoder, tpd12s015 HDMI companion chip, Sony
     acx565akm panel, MIPI DSI Command mode panel and HDMI, DVI and
     Analog TV connectors"

* tag 'fbdev-omap-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux: (45 commits)
  OMAPDSS: HDMI: fix interlace output
  OMAPDSS: add missing __init for dss_init_ports
  ARM: OMAP2+: remove pdata quirks for displays
  OMAPDSS: remove DT hacks for regulators
  Doc/DT: Add DT binding documentation for tpd12s015 encoder
  Doc/DT: Add DT binding documentation for TFP410 encoder
  Doc/DT: Add DT binding documentation for Sony acx565akm panel
  Doc/DT: Add DT binding documentation for MIPI DSI CM Panel
  Doc/DT: Add DT binding documentation for HDMI Connector
  Doc/DT: Add DT binding documentation for DVI Connector
  Doc/DT: Add DT binding documentation for Analog TV Connector
  ARM: omap3-n900.dts: add display information
  ARM: omap3-igep0020.dts: add display information
  ARM: omap3-beagle-xm.dts: add display information
  ARM: omap3-beagle.dts: add display information
  ARM: omap4-sdp.dts: add display information
  Doc/DT: Add DT binding documentation for OMAP DSS
  OMAPDSS: acx565akm: Add DT support
  OMAPDSS: connector-analog-tv: Add DT support
  OMAPDSS: hdmi-connector: Add DT support
  ...

Merge tag 'mfd-for-linus-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd

Pull MFD updates from Lee Jones:
"Changes to existing drivers:
   - Use of managed resources - omap, twl4030, ti_am335x_tscadc
   - Advanced error handling - omap
   - Rework clk management - omap
   - Device Tree (re-)work - tc3589x, pm8921, da9055, sec
   - IRC management overhaul and !BROKEN - pm8921
   - Convert to regmap - ssbi, pm8921
   - Use simple power-management ops - ucb1x00
   - Include file clean-up - adp5520, cs5535, janz, lpc_ich,
      - lpc_sch, max14577, mcp-sa11x0, pcf50633-adc, rc5t583,
       rdc321x-southbridge, retu, smsc-ece1099, ti-ssp, ti_am335x_tscadc,
tps65912, vexpress-config, wm8350, ywm8350
   - Various bug fixes across the subsystem
      - NULL/invalid pointer dereference prevention
      - Resource leak mitigation,
      - Variable used initialised
      - Staticise various containers
      - Enforce return value checks

  New drivers/supported devices:
   - Add support for s2mps14 and s2mpa01 to sec
   - Add support for da9063 (v5) to da9063
   - Add support for atom-c2000 to gpio-ich
   - Add support for come-{mbt10,cbt6,chl6} to kempld
   - Add support for da9053 to da9052
   - Add support for itco-wdt (v3) and baytrail to lpc_ich
   - Add new drivers for tps65218, rtsx_usb, bcm590xx

  (Re-)moved drivers:
   - twl4030 ==> drivers/iio
   - ti-ssp  ==> /dev/null"

* tag 'mfd-for-linus-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (103 commits)
  mfd: wm5110: Correct default for HEADPHONE_DETECT_1
  mfd: arizona: Correct small errors in the DT binding documentation
  mfd: arizona: Mark DSP clocking register as volatile
  mfd: devicetree: bindings: Add pm8xxx RTC description
  mfd: kempld-core: Fix potential hang-up during boot
  mfd: sec-core: Fix uninitialized 'regmap_rtc' on S2MPA01
  mfd: tps65910: Fix regmap_irq_chip_data leak on mfd_add_devices fail
  mfd: tps65910: Fix possible invalid pointer dereference on regmap_add_irq_chip fail
  mfd: sec-core: Fix I2C dummy device resource leak on probe failure
  mfd: sec-core: Add of_compatible strings for clock MFD cells
  mfd: Remove obsolete ti-ssp driver
  Documentation: mfd: s2mps11: Describe S5M8767 and S2MPS14 clocks
  mfd: bcm590xx: Fix type argument for module device table
  mfd: lpc_ich: Add support for Intel Bay Trail SoC
  mfd: lpc_ich: Add support for NM10 GPIO
  mfd: lpc_ich: Change Avoton to iTCO v3
  watchdog: iTCO_wdt: Add support for v3 silicon
  mfd: lpc_ich: Add support for iTCO v3
  mfd: lpc_ich: Remove lpc_ich_cfg struct use
  mfd: lpc_ich: Only configure watchdog or GPIO when present
  ...

mac802154: fix duplicate #include headers

The commit e6278d92005e ("mac802154: use header operations to
create/parse headers") included the header

net/ieee802154_netdev.h

which had been included by the commit b70ab2e87f17 ("ieee802154:
enforce consistent endianness in the 802.15.4 stack"). Fix this
duplicate #include by deleting the latter one as the required header
has already been in place.

Signed-off-by: Jean Sacren <[email protected]>
Cc: Alexander Smirnov <[email protected]>
Cc: Dmitry Eremin-Solenikov <[email protected]>
Cc: Phoebe Buckheister <[email protected]>
Cc: [email protected]
Signed-off-by: David S. Miller <[email protected]>

sxgbe: fix duplicate #include headers

The commit 1edb9ca69e8a ("net: sxgbe: add basic framework for
Samsung 10Gb ethernet driver") added support for Samsung 10Gb
ethernet driver(sxgbe) with a minor issue of including linux/io.h
header twice in sxgbe_dma.c file. Fix the duplicate #include by
deleting the top one so that all the rest good #include headers
would be preserved in the alphabetical order.

Signed-off-by: Jean Sacren <[email protected]>
Cc: Byungho An <[email protected]>
Cc: Girish K S <[email protected]>
Cc: Siva Reddy Kallam <[email protected]>
Cc: Vipul Pandya <[email protected]>
Acked-by: Byungho An <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'for-linus-20140405' of git://git.infradead.org/linux-mtd

Pull MTD updates from Brian Norris:
- A few SPI NOR ID definitions
- Kill the NAND "max pagesize" restriction
- Fix some x16 bus-width NAND support
- Add NAND JEDEC parameter page support
- DT bindings for NAND ECC
- GPMI NAND updates (subpage reads)
- More OMAP NAND refactoring
- New STMicro SPI NOR driver (now in 40 patches!)
- A few other random bugfixes

* tag 'for-linus-20140405' of git://git.infradead.org/linux-mtd: (120 commits)
  Fix index regression in nand_read_subpage
  mtd: diskonchip: mem resource name is not optional
  mtd: nand: fix mention to CONFIG_MTD_NAND_ECC_BCH
  mtd: nand: fix GET/SET_FEATURES address on 16-bit devices
  mtd: omap2: Use devm_ioremap_resource()
  mtd: denali_dt: Use devm_ioremap_resource()
  mtd: devices: elm: update DRIVER_NAME as "omap-elm"
  mtd: devices: elm: configure parallel channels based on ecc_steps
  mtd: devices: elm: clean elm_load_syndrome
  mtd: devices: elm: check for hardware engine's design constraints
  mtd: st_spi_fsm: Succinctly reorganise .remove()
  mtd: st_spi_fsm: Allow loop to run at least once before giving up CPU
  mtd: st_spi_fsm: Correct vendor name spelling issue - missing "M"
  mtd: st_spi_fsm: Avoid duplicating MTD core code
  mtd: st_spi_fsm: Remove useless consts from function arguments
  mtd: st_spi_fsm: Convert ST SPI FSM (NOR) Flash driver to new DT partitions
  mtd: st_spi_fsm: Move runtime configurable msg sequences into device's struct
  mtd: st_spi_fsm: Supply the W25Qxxx chip specific configuration call-back
  mtd: st_spi_fsm: Supply the S25FLxxx chip specific configuration call-back
  mtd: st_spi_fsm: Supply the MX25xxx chip specific configuration call-back
  ...

net: filter: be more defensive on div/mod by X==0

The old interpreter behaviour was that we returned with 0
whenever we found a division by 0 would take place. In the new
interpreter we would currently just skip that instead and
continue execution.

It's true that a value of 0 as return might not be appropriate
in all cases, but current users (socket filters -> drop
packet, seccomp -> SECCOMP_RET_KILL, cls_bpf -> unclassified,
etc) seem fine with that behaviour. Better this than undefined
BPF program behaviour as it's expected that A contains the
result of the division. In future, as more use cases open up,
we could further adapt this return value to our needs, if
necessary.

So reintroduce return of 0 for division by 0 as in the old
interpreter. Also in case of K which is guaranteed to be 32bit
wide, sk_chk_filter() already takes care of preventing division
by 0 invoked through K, so we can generally spare us these tests.

Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

arm64: Add missing Kconfig for CONFIG_STRICT_DEVMEM

The Kconfig for CONFIG_STRICT_DEVMEM is missing despite being
used in mmap.c. Add it.

Signed-off-by: Laura Abbott <[email protected]>
Signed-off-by: Catalin Marinas <[email protected]>

Xen: do hv callback accounting only on x86

Patch 99c8b79d3c1 "xen: Add proper irq accounting for HYPERCALL vector"
added a call to inc_irq_stat(irq_hv_callback_count) in common Xen code,
however both the inc_irq_stat function and the irq_hv_callback_count
counter are architecture specific.

This makes the code build again on ARM by moving the call into the
existing #ifdef CONFIG_X86. We may want to later do the same implementation
on ARM that x86 has though.

Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: David Vrabel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Xen <[email protected]>

Merge commit '683b6c6f82a60fabf47012581c2cfbf1b037ab95' into stable/for-linus-3.15

This merge of the irq-core-for-linus branch broke the ARM build when
Xen is enabled.

Conflicts:
drivers/xen/events/events_base.c

f2fs: fix wrong statistics of inline data

If we remove a file that has inline data after mount, our statistics turns to
inaccurate.

cat /sys/kernel/debug/f2fs/status
- Inline_data Inode: 4294967295

Let's add stat_inc_inline_inode() to stat inline info of the file when lookup.

Change log from v1:
o stat in f2fs_lookup() instead of in do_read_inode() for excluding wrong stat.

Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>