Git Repo - linux.git/log

uapi: rename ext2_swab() to swab() and share globally in swab.h

ext2_swab() is defined locally in lib/find_bit.c However it is not
specific to ext2, neither to bitmaps.

There are many potential users of it, so rename it to just swab() and
move to include/uapi/linux/swab.h

ABI guarantees that size of unsigned long corresponds to BITS_PER_LONG,
therefore drop unneeded cast.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yury Norov <[email protected]>
Cc: Allison Randal <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: William Breathitt Gray <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/scatterlist.c: adjust indentation in __sg_alloc_table

Clang warns:

  ../lib/scatterlist.c:314:5: warning: misleading indentation; statement
  is not part of the previous 'if' [-Wmisleading-indentation]
                          return -ENOMEM;
                          ^
  ../lib/scatterlist.c:311:4: note: previous statement is here
                          if (prv)
                          ^
  1 warning generated.

This warning occurs because there is a space before the tab on this
line.  Remove it so that the indentation is consistent with the Linux
kernel coding style and clang no longer warns.

Link: http://lkml.kernel.org/r/[email protected]
Link: https://github.com/ClangBuiltLinux/linux/issues/830
Fixes: edce6820a9fd ("scatterlist: prevent invalid free when alloc fails")
Signed-off-by: Nathan Chancellor <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

btrfs: use larger zlib buffer for s390 hardware compression

In order to benefit from s390 zlib hardware compression support,
increase the btrfs zlib workspace buffer size from 1 to 4 pages (if s390
zlib hardware support is enabled on the machine).

This brings up to 60% better performance in hardware on s390 compared to
the PAGE_SIZE buffer and much more compared to the software zlib
processing in btrfs. In case of memory pressure, fall back to a single
page buffer during workspace allocation.

The data compressed with larger input buffers will still conform to zlib
standard and thus can be decompressed also on a systems that uses only
PAGE_SIZE buffer for btrfs zlib.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mikhail Zaslonko <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: David Sterba <[email protected]>
Cc: Richard Purdie <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Eduard Shishkin <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/zlib: add zlib_deflate_dfltcc_enabled() function

Add a new function to zlib.h checking if s390 Deflate-Conversion
facility is installed and enabled.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mikhail Zaslonko <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: David Sterba <[email protected]>
Cc: Eduard Shishkin <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Richard Purdie <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

s390/boot: add dfltcc= kernel command line parameter

Add the new kernel command line parameter 'dfltcc=' to configure s390
zlib hardware support.

Format: { on | off | def_only | inf_only | always }
on:       s390 zlib hardware support for compression on
           level 1 and decompression (default)
off:      No s390 zlib hardware support
def_only: s390 zlib hardware support for deflate
           only (compression on level 1)
inf_only: s390 zlib hardware support for inflate
           only (decompression)
always:   Same as 'on' but ignores the selected compression
           level always using hardware support (used for debugging)

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mikhail Zaslonko <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: David Sterba <[email protected]>
Cc: Eduard Shishkin <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Richard Purdie <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/zlib: add s390 hardware support for kernel zlib_inflate

Add decompression functions to zlib_dfltcc library. Update zlib_inflate
functions with the hooks for s390 hardware support and adjust workspace
structures with extra parameter lists required for hardware inflate
decompression.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ilya Leoshkevich <[email protected]>
Signed-off-by: Mikhail Zaslonko <[email protected]>
Co-developed-by: Ilya Leoshkevich <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: David Sterba <[email protected]>
Cc: Eduard Shishkin <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Richard Purdie <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

s390/boot: rename HEAP_SIZE due to name collision

Change the conflicting macro name in preparation for zlib_inflate
hardware support.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mikhail Zaslonko <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/zlib: add s390 hardware support for kernel zlib_deflate

Patch series "S390 hardware support for kernel zlib", v3.

With IBM z15 mainframe the new DFLTCC instruction is available.  It
implements deflate algorithm in hardware (Nest Acceleration Unit - NXU)
with estimated compression and decompression performance orders of
magnitude faster than the current zlib.

This patchset adds s390 hardware compression support to kernel zlib.
The code is based on the userspace zlib implementation:

https://github.com/madler/zlib/pull/410

The coding style is also preserved for future maintainability.  There is
only limited set of userspace zlib functions represented in kernel.
Apart from that, all the memory allocation should be performed in
advance.  Thus, the workarea structures are extended with the parameter
lists required for the DEFLATE CONVENTION CALL instruction.

Since kernel zlib itself does not support gzip headers, only Adler-32
checksum is processed (also can be produced by DFLTCC facility).  Like
it was implemented for userspace, kernel zlib will compress in hardware
on level 1, and in software on all other levels.  Decompression will
always happen in hardware (when enabled).

Two DFLTCC compression calls produce the same results only when they
both are made on machines of the same generation, and when the
respective buffers have the same offset relative to the start of the
page.  Therefore care should be taken when using hardware compression
when reproducible results are desired.  However it does always produce
the standard conform output which can be inflated anyway.

The new kernel command line parameter 'dfltcc' is introduced to
configure s390 zlib hardware support:

    Format: { on | off | def_only | inf_only | always }
     on:       s390 zlib hardware support for compression on
               level 1 and decompression (default)
     off:      No s390 zlib hardware support
     def_only: s390 zlib hardware support for deflate
               only (compression on level 1)
     inf_only: s390 zlib hardware support for inflate
               only (decompression)
     always:   Same as 'on' but ignores the selected compression
               level always using hardware support (used for debugging)

The main purpose of the integration of the NXU support into the kernel
zlib is the use of hardware deflate in btrfs filesystem with on-the-fly
compression enabled.  Apart from that, hardware support can also be used
during boot for decompressing the kernel or the ramdisk image

With the patch for btrfs expanding zlib buffer from 1 to 4 pages (patch
6) the following performance results have been achieved using the
ramdisk with btrfs.  These are relative numbers based on throughput rate
and compression ratio for zlib level 1:

  Input data              Deflate rate   Inflate rate   Compression ratio
                          NXU/Software   NXU/Software   NXU/Software
  stream of zeroes        1.46           1.02           1.00
  random ASCII data       10.44          3.00           0.96
  ASCII text (dickens)    6,21           3.33           0.94
  binary data (vmlinux)   8,37           3.90           1.02

This means that s390 hardware deflate can provide up to 10 times faster
compression (on level 1) and up to 4 times faster decompression (refers
to all compression levels) for btrfs zlib.

Disclaimer: Performance results are based on IBM internal tests using DD
command-line utility on btrfs on a Fedora 30 based internal driver in
native LPAR on a z15 system.  Results may vary based on individual
workload, configuration and software levels.

This patch (of 9):

Create zlib_dfltcc library with the s390 DEFLATE CONVERSION CALL
implementation and related compression functions.  Update zlib_deflate
functions with the hooks for s390 hardware support and adjust workspace
structures with extra parameter lists required for hardware deflate.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ilya Leoshkevich <[email protected]>
Signed-off-by: Mikhail Zaslonko <[email protected]>
Co-developed-by: Ilya Leoshkevich <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: David Sterba <[email protected]>
Cc: Eduard Shishkin <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Richard Purdie <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

iio: adc: qcom-vadc-common: use <linux/units.h> helpers

This switches the qcom-vadc-common to use milli_kelvin_to_millicelsius()
in <linux/units.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: Jonathan Cameron <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Zhang Rui <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thermal: armada: remove unused TO_MCELSIUS macro

This removes unused TO_MCELSIUS() macro.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Sujith Thomas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

iwlwifi: use <linux/units.h> helpers

This switches the iwlwifi driver to use celsius_to_kelvin() and
kelvin_to_celsius() in <linux/units.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Acked-by: Luca Coelho <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Zhang Rui <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

iwlegacy: use <linux/units.h> helpers

This switches the iwlegacy driver to use celsius_to_kelvin() and
kelvin_to_celsius() in <linux/units.h>.

[[email protected]: fix build warnings with format string]
Link: http://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Acked-by: Kalle Valo <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thermal: remove kelvin to/from Celsius conversion helpers from <linux/thermal.h>

This removes the kelvin to/from Celsius conversion helper macros in
<linux/thermal.h> which were switched to the inline helper functions in
<linux/units.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

nvme: hwmon: switch to use <linux/units.h> helpers

This switches the nvme driver to use kelvin_to_millicelsius() and
millicelsius_to_kelvin() in <linux/units.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thermal: intel_pch: switch to use <linux/units.h> helpers

This switches the intel pch thermal driver to use
deci_kelvin_to_millicelsius() in <linux/units.h> instead of helpers in
<linux/thermal.h>.

This is preparation for centralizing the kelvin to/from Celsius
conversion helpers in <linux/units.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thermal: int340x: switch to use <linux/units.h> helpers

This switches the int340x thermal zone driver to use
deci_kelvin_to_millicelsius() and millicelsius_to_deci_kelvin() in
<linux/units.h> instead of helpers in <linux/thermal.h>.

This is preparation for centralizing the kelvin to/from Celsius
conversion helpers in <linux/units.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

platform/x86: intel_menlow: switch to use <linux/units.h> helpers

This switches the intel_menlow driver to use deci_kelvin_to_celsius()
and celsius_to_deci_kelvin() in <linux/units.h> instead of helpers in
<linux/thermal.h>.

This is preparation for centralizing the kelvin to/from Celsius
conversion helpers in <linux/units.h>.

This also removes a trailing space, while we're at it.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

platform/x86: asus-wmi: switch to use <linux/units.h> helpers

The asus-wmi driver doesn't implement the thermal device functionality
directly, so including <linux/thermal.h> just for
DECI_KELVIN_TO_CELSIUS() is a bit odd.

This switches the asus-wmi driver to use deci_kelvin_to_millicelsius()
in <linux/units.h>.

The format string is changed from %d to %ld due to function returned
type.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: Andy Shevchenko <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ACPI: thermal: switch to use <linux/units.h> helpers

This switches the ACPI thermal zone driver to use
celsius_to_deci_kelvin(), deci_kelvin_to_celsius(), and
deci_kelvin_to_millicelsius_with_offset() in <linux/units.h> instead of
helpers in <linux/thermal.h>.

This is preparation for centralizing the kelvin to/from Celsius
conversion helpers in <linux/units.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/units.h: add helpers for kelvin to/from Celsius conversion

Patch series "add header file for kelvin to/from Celsius conversion
helpers", v4.

There are several helper macros to convert kelvin to/from Celsius in
<linux/thermal.h> for thermal drivers.  These are useful for any other
drivers or subsystems, but it's odd to include <linux/thermal.h> just
for the helpers.

This adds a new <linux/units.h> that provides the equivalent inline
functions for any drivers or subsystems, and switches all the users of
conversion helpers in <linux/thermal.h> to use <linux/units.h> helpers.

This patch (of 12):

There are several helper macros to convert kelvin to/from Celsius in
<linux/thermal.h> for thermal drivers.  These are useful for any other
drivers or subsystems, but it's odd to include <linux/thermal.h> just
for the helpers.

This adds a new <linux/units.h> that provides the equivalent inline
functions for any drivers or subsystems.  It is intended to replace the
helpers in <linux/thermal.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/block/zram/zram_drv.c: fix error return codes not being returned in writeback_store

Currently when an error code -EIO or -ENOSPC in the for-loop of
writeback_store the error code is being overwritten by a ret = len
assignment at the end of the function and the error codes are being
lost. Fix this by assigning ret = len at the start of the function and
remove the assignment from the end, hence allowing ret to be preserved
when error codes are assigned to it.

Addresses Coverity ("Unused value")

Link: http://lkml.kernel.org/r/[email protected]
Fixes: a939888ec38b ("zram: support idle/huge page writeback")
Signed-off-by: Colin Ian King <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: try to avoid worst-case scenario on same element pages

The worst-case scenario on finding same element pages is that almost all
elements are same at the first glance but only last few elements are
different.

Since the same element tends to be grouped from the beginning of the
pages, if we check the first element with the last element before
looping through all elements, we might have some chances to quickly
detect non-same element pages.

1. Test is done under LG webOS TV (64-bit arch)
2. Dump the swap-out pages (~819200 pages)
3. Analyze the pages with simple test script which counts the iteration
    number and measures the speed at off-line

Under 64-bit arch, the worst iteration count is PAGE_SIZE / 8 bytes =
512.  The speed is based on the time to consume page_same_filled()
function only.  The result, on average, is listed as below:

                                     Num of Iter    Speed(MB/s)
  Looping-Forward (Orig)                 38            99265
  Looping-Backward                       36           102725
  Last-element-check (This Patch)        33           125072

The result shows that the average iteration count decreases by 13% and
the speed increases by 25% with this patch.  This patch does not
increase the overall time complexity, though.

I also ran simpler version which uses backward loop.  Just looping
backward also makes some improvement, but less than this patch.

[[email protected]: fix off-by-one]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Taejoon Song <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: fix comments related to node reclaim

As zone reclaim has been replaced by node reclaim, this patch fixes
related comments.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Hao Lee <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/memory.h: drop fields 'hw' and 'phys_callback' from struct memory_block

memory_block structure elements 'hw' and 'phys_callback' are not getting
used. This was originally added with commit 3947be1969a9 ("[PATCH]
memory hotplug: sysfs and add/remove functions") but never seem to have
been used. Just drop them now.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/mm.h: remove dead code totalram_pages_set()

totalram_pages_set() was introduced in commit ca79b0c211af ("mm: convert
totalram_pages and totalhigh_pages variables to atomic"), but no one
uses it.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/mm.h: clean up obsolete check on space in page->flags

The check was intended to make sure we don't overrun page flags. But
it's obsolete because it doesn't include LAST_CPUPID_WIDTH nor
KASAN_TAG_WIDTH.

Just remove check since we already have it covered in
linux/page-flags-layout.h (near the end of the file).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yu Zhao <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zswap: potential NULL dereference on error in init_zswap()

The "pool" pointer can be NULL at the end of the init_zswap(). (We
would allocate a new pool later in that situation)

So in the error handling then we need to make sure pool is a valid
pointer before calling "zswap_pool_destroy(pool);" because that function
dereferences the argument.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 93d4dfa9fbd0 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Dan Carpenter <[email protected]>
Cc: Vitaly Wool <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/zswap.c: add allocation hysteresis if pool limit is hit

zswap will always try to shrink pool when zswap is full.  If there is a
high pressure on zswap it will result in flipping pages in and out zswap
pool without any real benefit, and the overall system performance will
drop.  The previous discussion on this subject [1] ended up with a
suggestion to implement a sort of hysteresis to refuse taking pages into
zswap pool until it has sufficient space if the limit has been hit.
This is my take on this.

Hysteresis is controlled with a sysfs-configurable parameter (namely,
/sys/kernel/debug/zswap/accept_threhsold_percent).  It specifies the
threshold at which zswap would start accepting pages again after it
became full.  Setting this parameter to 100 disables the hysteresis and
sets the zswap behavior to pre-hysteresis state.

[1] https://lkml.org/lkml/2019/11/8/949

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vitaly Wool <[email protected]>
Cc: Dan Streetman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_isolation: fix potential warning from user

It makes sense to call the WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE)
from start_isolate_page_range(), but should avoid triggering it from
userspace, i.e, from is_mem_section_removable() because it could crash
the system by a non-root user if warn_on_panic is set.

While at it, simplify the code a bit by removing an unnecessary jump
label.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/hotplug: silence a lockdep splat with printk()

It is not that hard to trigger lockdep splats by calling printk from
under zone->lock.  Most of them are false positives caused by lock
chains introduced early in the boot process and they do not cause any
real problems (although most of the early boot lock dependencies could
happen after boot as well).  There are some console drivers which do
allocate from the printk context as well and those should be fixed.  In
any case, false positives are not that trivial to workaround and it is
far from optimal to lose lockdep functionality for something that is a
non-issue.

So change has_unmovable_pages() so that it no longer calls dump_page()
itself - instead it returns a "struct page *" of the unmovable page back
to the caller so that in the case of a has_unmovable_pages() failure,
the caller can call dump_page() after releasing zone->lock.  Also, make
dump_page() is able to report a CMA page as well, so the reason string
from has_unmovable_pages() can be removed.

Even though has_unmovable_pages doesn't hold any reference to the
returned page this should be reasonably safe for the purpose of
reporting the page (dump_page) because it cannot be hotremoved in the
context of memory unplug.  The state of the page might change but that
is the case even with the existing code as zone->lock only plays role
for free pages.

While at it, remove a similar but unnecessary debug-only printk() as
well.  A sample of one of those lockdep splats is,

  WARNING: possible circular locking dependency detected
  ------------------------------------------------------
  test.sh/8653 is trying to acquire lock:
  ffffffff865a4460 (console_owner){-.-.}, at:
  console_unlock+0x207/0x750

  but task is already holding lock:
  ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
  __offline_isolated_pages+0x179/0x3e0

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (&(&zone->lock)->rlock){-.-.}:
         __lock_acquire+0x5b3/0xb40
         lock_acquire+0x126/0x280
         _raw_spin_lock+0x2f/0x40
         rmqueue_bulk.constprop.21+0xb6/0x1160
         get_page_from_freelist+0x898/0x22c0
         __alloc_pages_nodemask+0x2f3/0x1cd0
         alloc_pages_current+0x9c/0x110
         allocate_slab+0x4c6/0x19c0
         new_slab+0x46/0x70
         ___slab_alloc+0x58b/0x960
         __slab_alloc+0x43/0x70
         __kmalloc+0x3ad/0x4b0
         __tty_buffer_request_room+0x100/0x250
         tty_insert_flip_string_fixed_flag+0x67/0x110
         pty_write+0xa2/0xf0
         n_tty_write+0x36b/0x7b0
         tty_write+0x284/0x4c0
         __vfs_write+0x50/0xa0
         vfs_write+0x105/0x290
         redirected_tty_write+0x6a/0xc0
         do_iter_write+0x248/0x2a0
         vfs_writev+0x106/0x1e0
         do_writev+0xd4/0x180
         __x64_sys_writev+0x45/0x50
         do_syscall_64+0xcc/0x76c
         entry_SYSCALL_64_after_hwframe+0x49/0xbe

  -> #2 (&(&port->lock)->rlock){-.-.}:
         __lock_acquire+0x5b3/0xb40
         lock_acquire+0x126/0x280
         _raw_spin_lock_irqsave+0x3a/0x50
         tty_port_tty_get+0x20/0x60
         tty_port_default_wakeup+0xf/0x30
         tty_port_tty_wakeup+0x39/0x40
         uart_write_wakeup+0x2a/0x40
         serial8250_tx_chars+0x22e/0x440
         serial8250_handle_irq.part.8+0x14a/0x170
         serial8250_default_handle_irq+0x5c/0x90
         serial8250_interrupt+0xa6/0x130
         __handle_irq_event_percpu+0x78/0x4f0
         handle_irq_event_percpu+0x70/0x100
         handle_irq_event+0x5a/0x8b
         handle_edge_irq+0x117/0x370
         do_IRQ+0x9e/0x1e0
         ret_from_intr+0x0/0x2a
         cpuidle_enter_state+0x156/0x8e0
         cpuidle_enter+0x41/0x70
         call_cpuidle+0x5e/0x90
         do_idle+0x333/0x370
         cpu_startup_entry+0x1d/0x1f
         start_secondary+0x290/0x330
         secondary_startup_64+0xb6/0xc0

  -> #1 (&port_lock_key){-.-.}:
         __lock_acquire+0x5b3/0xb40
         lock_acquire+0x126/0x280
         _raw_spin_lock_irqsave+0x3a/0x50
         serial8250_console_write+0x3e4/0x450
         univ8250_console_write+0x4b/0x60
         console_unlock+0x501/0x750
         vprintk_emit+0x10d/0x340
         vprintk_default+0x1f/0x30
         vprintk_func+0x44/0xd4
         printk+0x9f/0xc5

  -> #0 (console_owner){-.-.}:
         check_prev_add+0x107/0xea0
         validate_chain+0x8fc/0x1200
         __lock_acquire+0x5b3/0xb40
         lock_acquire+0x126/0x280
         console_unlock+0x269/0x750
         vprintk_emit+0x10d/0x340
         vprintk_default+0x1f/0x30
         vprintk_func+0x44/0xd4
         printk+0x9f/0xc5
         __offline_isolated_pages.cold.52+0x2f/0x30a
         offline_isolated_pages_cb+0x17/0x30
         walk_system_ram_range+0xda/0x160
         __offline_pages+0x79c/0xa10
         offline_pages+0x11/0x20
         memory_subsys_offline+0x7e/0xc0
         device_offline+0xd5/0x110
         state_store+0xc6/0xe0
         dev_attr_store+0x3f/0x60
         sysfs_kf_write+0x89/0xb0
         kernfs_fop_write+0x188/0x240
         __vfs_write+0x50/0xa0
         vfs_write+0x105/0x290
         ksys_write+0xc6/0x160
         __x64_sys_write+0x43/0x50
         do_syscall_64+0xcc/0x76c
         entry_SYSCALL_64_after_hwframe+0x49/0xbe

  other info that might help us debug this:

  Chain exists of:
    console_owner --> &(&port->lock)->rlock --> &(&zone->lock)->rlock

   Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    lock(&(&zone->lock)->rlock);
                                 lock(&(&port->lock)->rlock);
                                 lock(&(&zone->lock)->rlock);
    lock(console_owner);

   *** DEADLOCK ***

  9 locks held by test.sh/8653:
   #0: ffff88839ba7d408 (sb_writers#4){.+.+}, at:
  vfs_write+0x25f/0x290
   #1: ffff888277618880 (&of->mutex){+.+.}, at:
  kernfs_fop_write+0x128/0x240
   #2: ffff8898131fc218 (kn->count#115){.+.+}, at:
  kernfs_fop_write+0x138/0x240
   #3: ffffffff86962a80 (device_hotplug_lock){+.+.}, at:
  lock_device_hotplug_sysfs+0x16/0x50
   #4: ffff8884374f4990 (&dev->mutex){....}, at:
  device_offline+0x70/0x110
   #5: ffffffff86515250 (cpu_hotplug_lock.rw_sem){++++}, at:
  __offline_pages+0xbf/0xa10
   #6: ffffffff867405f0 (mem_hotplug_lock.rw_sem){++++}, at:
  percpu_down_write+0x87/0x2f0
   #7: ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
  __offline_isolated_pages+0x179/0x3e0
   #8: ffffffff865a4920 (console_lock){+.+.}, at:
  vprintk_emit+0x100/0x340

  stack backtrace:
  Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10,
  BIOS U34 05/21/2019
  Call Trace:
   dump_stack+0x86/0xca
   print_circular_bug.cold.31+0x243/0x26e
   check_noncircular+0x29e/0x2e0
   check_prev_add+0x107/0xea0
   validate_chain+0x8fc/0x1200
   __lock_acquire+0x5b3/0xb40
   lock_acquire+0x126/0x280
   console_unlock+0x269/0x750
   vprintk_emit+0x10d/0x340
   vprintk_default+0x1f/0x30
   vprintk_func+0x44/0xd4
   printk+0x9f/0xc5
   __offline_isolated_pages.cold.52+0x2f/0x30a
   offline_isolated_pages_cb+0x17/0x30
   walk_system_ram_range+0xda/0x160
   __offline_pages+0x79c/0xa10
   offline_pages+0x11/0x20
   memory_subsys_offline+0x7e/0xc0
   device_offline+0xd5/0x110
   state_store+0xc6/0xe0
   dev_attr_store+0x3f/0x60
   sysfs_kf_write+0x89/0xb0
   kernfs_fop_write+0x188/0x240
   __vfs_write+0x50/0xa0
   vfs_write+0x105/0x290
   ksys_write+0xc6/0x160
   __x64_sys_write+0x43/0x50
   do_syscall_64+0xcc/0x76c
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug: pass in nid to online_pages()

Patch series "mm/memory_hotplug: pass in nid to online_pages()".

Simplify onlining code and get rid of find_memory_block(). Pass in the
nid from the memory block we are trying to online directly, instead of
manually looking it up.

This patch (of 2):

No need to lookup the memory block, we can directly pass in the nid.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mmap.c: get rid of odd jump labels in find_mergeable_anon_vma()

The jump labels try_prev and none are not really needed in
find_mergeable_anon_vma(), eliminate them to improve readability.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, thp: fix defrag setting if newline is not used

If thp defrag setting "defer" is used and a newline is *not* used when
writing to the sysfs file, this is interpreted as the "defer+madvise"
option.

This is because we do prefix matching and if five characters are written
without a newline, the current code ends up comparing to the first five
bytes of the "defer+madvise" option and using that instead.

Use the more appropriate sysfs_streq() that handles the trailing newline
for us.  Since this doubles as a nice cleanup, do it in enabled_store()
as well.

The current implementation relies on prefix matching: the number of
bytes compared is either the number of bytes written or the length of
the option being compared.  With a newline, "defer\n" does not match
"defer+"madvise"; without a newline, however, "defer" is considered to
match "defer+madvise" (prefix matching is only comparing the first five
bytes).  End result is that writing "defer" is broken unless it has an
additional trailing character.

This means that writing "madv" in the past would match and set
"madvise".  With strict checking, that no longer is the case but it is
unlikely anybody is currently doing this.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 21440d7eb904 ("mm, thp: add new defer+madvise defrag option")
Signed-off-by: David Rientjes <[email protected]>
Suggested-by: Andrew Morton <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate: add stable check in migrate_vma_insert_page()

migrate_vma_insert_page() closely follows the code in:
  __handle_mm_fault()
    handle_pte_fault()
      do_anonymous_page()

Add a call to check_stable_address_space() after locking the page table
entry before inserting a ZONE_DEVICE private zero page mapping similar
to page faulting a new anonymous page.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ralph Campbell <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Chris Down <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate: clean up some minor coding style

Fix some comment typos and coding style clean up in preparation for the
next patch. No functional changes.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ralph Campbell <[email protected]>
Acked-by: Chris Down <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate: remove useless mask of start address

Addresses passed to walk_page_range() callback functions are already
page aligned and don't need to be masked with PAGE_MASK.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ralph Campbell <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Chris Down <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/huge_memory.c: reduce critical section protected by split_queue_lock

split_queue_lock protects data in struct deferred_split. We can release
the lock after delete the page from deferred_split_queue.

This patch moves the THP accounting out of the lock protection, which is
introduced in commit 65c453778aea ("mm, rmap: account shmem thp pages").

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/huge_memory.c: use head to emphasize the purpose of page

During split huge page, it checks the property of the page.  Currently
we do the check on page and head without emphasizing the check is on the
compound page.  In case the page passed to split_huge_page_to_list is a
tail page, audience would take some time to think about whether the
check is on compound page or tail page itself.

To make it explicit, use head instead of page for those checks.  After
this, audience would be more clear about the checks are on compound page
and the page is used to do the split and dump error message if failed.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/huge_memory.c: use head to check huge zero page

The page could be a tail page, if this is the case, this BUG_ON will
never be triggered.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: e9b61f19858a ("thp: reintroduce split_huge_page()")
Signed-off-by: Wei Yang <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, oom: dump stack of victim when reaping failed

When a process cannot be oom reaped, for whatever reason, currently the
list of locks that are held is currently dumped to the kernel log.

Much more interesting is the stack trace of the victim that cannot be
reaped. If the stack trace is dumped, we have the ability to find
related occurrences in the same kernel code and hopefully solve the
issue that is making it wedged.

Dump the stack trace when a process fails to be oom reaped.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Rientjes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memblock: Use __func__ in remaining memblock_dbg() call sites

Replace open function name strings with %s (__func__) in all remaining
memblock_dbg() call sites.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memblock: define memblock_physmem_add()

On the s390 platform memblock.physmem array is being built by directly
calling into memblock_add_range() which is a low level function not
intended to be used outside of memblock.  Hence lets conditionally add
helper functions for physmem array when HAVE_MEMBLOCK_PHYS_MAP is
enabled.  Also use MAX_NUMNODES instead of 0 as node ID similar to
memblock_add() and memblock_reserve().  Make memblock_add_range() a
static function as it is no longer getting used outside of memblock.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Acked-by: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Collin Walling <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Philipp Rudo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

tools/vm/slabinfo: fix sanity checks enabling

The sysfs file name for enabling sanity checking is called
'sanity_checks' and not 'sanity'.

The name of the file has never changed since the introduction of the
slub allocator. Obviously, most people turn the checks on via the
command line option and not during runtime using slabinfo.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Daniel Wagner <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: "Tobin C. Harding" <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE

Commit 1b2ffb7896ad ("[PATCH] Zone reclaim: Allow modification of zone
reclaim behavior")' defined RECLAIM_OFF/RECLAIM_ZONE, but never use
them, so better to remove them.

[[email protected]: fix sanity checks enabling]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: renumber the bits for neatness]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alex Shi <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: "Tobin C. Harding" <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmscan: remove prefetch_prev_lru_page

This macro was never used in git history. So better to remove.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmscan.c: remove unused return value of shrink_node

The return value of shrink_node is not used, so remove unnecessary
operations.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Liu Song <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: remove "count" parameter from has_unmovable_pages()

Now that the memory isolate notifier is gone, the parameter is always 0.
Drop it and cleanup has_unmovable_pages().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Michael Ellerman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: remove the memory isolate notifier

Luckily, we have no users left, so we can get rid of it. Cleanup
set_migratetype_isolate() a little bit.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Michael Ellerman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_alloc: skip non present sections on zone initialization

memmap_init_zone() can be called on the ranges with holes during the
boot.  It will skip any non-valid PFNs one-by-one.  It works fine as
long as holes are not too big.

But huge holes in the memory map causes a problem.  It takes over 20
seconds to walk 32TiB hole.  x86-64 with 5-level paging allows for much
larger holes in the memory map which would practically hang the system.

Deferred struct page init doesn't help here.  It only works on the
present ranges.

Skipping non-present sections would fix the issue.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: "Jin, Zhi" <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/early_ioremap.c: use %pa to print resource_size_t variables

%pa takes into consideration the special types such as resource_size_t.
Use this specifier %instead of explicit casting.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/test_kasan.c: fix memory leak in kmalloc_oob_krealloc_more()

In case memory resources for _ptr2_ were allocated, release them before
return.

Notice that in case _ptr1_ happens to be NULL, krealloc() behaves
exactly like kmalloc().

Addresses-Coverity-ID: 1490594 ("Resource leak")
Link: http://lkml.kernel.org/r/20200123160115.GA4202@embeddedor
Fixes: 3f15801cdc23 ("lib: add kasan test module")
Signed-off-by: Gustavo A. R. Silva <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, tracing: print symbol name for kmem_alloc_node call_site events

Print the call_site ip of kmem_alloc_node using '%pS' to improve the
readability of raw slab trace points.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Junyong Sun <[email protected]>
Acked-by: Steven Rostedt (VMware) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Changbin Du <[email protected]>
Cc: Tim Murray <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_vma_mapped.c: explicitly compare pfn for normal, hugetlbfs and THP page

When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
The current implementation apply comparison as

- normal 4K page: page_pfn <= pfn < page_pfn + 1
- hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
- THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR

in pfn_in_hpage.  For hugetlbfs page, it should be page_pfn == pfn

Now, change pfn_in_hpage to pfn_is_match to highlight that comparison is
not only for THP and explicitly compare for these cases.

No impact upon current behavior, just make the code clear.  I think it
is important to make the code clear - comparing hugetlbfs page in range
page_pfn <= pfn < page_pfn + HPAGE_PMD_NR is confusing.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Li Xinhai <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Mike Kravetz <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memcontrol.c: cleanup some useless code

Compound pages handling in mem_cgroup_migrate is more convoluted than
necessary. The state is duplicated in compound variable and the same
could be achieved by PageTransHuge check which is trivial and
hpage_nr_pages is already PageTransHuge aware.

It is much simpler to just use hpage_nr_pages for nr_pages and replace
the local variable by PageTransHuge check directly

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kaitao Cheng <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/swapfile.c: swap_next should increase position index

If seq_file .next fuction does not change position index, read after
some lseek can generate unexpected output.

In Aug 2018 NeilBrown noticed commit 1f4aace60b0e ("fs/seq_file.c:
simplify seq_file iteration code and interface") "Some ->next functions
do not increment *pos when they return NULL...  Note that such ->next
functions are buggy and should be fixed.  A simple demonstration is

  dd if=/proc/swaps bs=1000 skip=1

Choose any block size larger than the size of /proc/swaps.  This will
always show the whole last line of /proc/swaps"

Described problem is still actual.  If you make lseek into middle of
last output line following read will output end of last line and whole
last line once again.

  $ dd if=/proc/swaps bs=1  # usual output
  Filename Type Size Used Priority
  /dev/dm-0                               partition 4194812 97536 -2
  104+0 records in
  104+0 records out
  104 bytes copied

  $ dd if=/proc/swaps bs=40 skip=1    # last line was generated twice
  dd: /proc/swaps: cannot skip to specified offset
  v/dm-0                               partition 4194812 97536 -2
  /dev/dm-0                               partition 4194812 97536 -2
  3+1 records in
  3+1 records out
  131 bytes copied

https://bugzilla.kernel.org/show_bug.cgi?id=206283

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vasily Averin <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, tree-wide: rename put_user_page*() to unpin_user_page*()

In order to provide a clearer, more symmetric API for pinning and
unpinning DMA pages. This way, pin_user_pages*() calls match up with
unpin_user_pages*() calls, and the API is a lot closer to being
self-explanatory.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1"

Fix the gup benchmark flags to use the symbolic FOLL_WRITE, instead of a
hard-coded "1" value.

Also, clean up the filtering of gup flags a little, by just doing it
once before issuing any of the get_user_pages*() calls. This makes it
harder to overlook, instead of having little "gup_flags & 1" phrases in
the function calls.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

powerpc: book3s64: convert to pin_user_pages() and put_user_page()

1. Convert from get_user_pages() to pin_user_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion

1. Change vfio from get_user_pages_remote(), to
   pin_user_pages_remote().

2. Because all FOLL_PIN-acquired pages must be released via
   put_user_page(), also convert the put_page() call over to
   put_user_pages_dirty_lock().

Note that this effectively changes the code's behavior in
vfio_iommu_type1.c: put_pfn(): it now ultimately calls
set_page_dirty_lock(), instead of set_page_dirty().  This is probably
more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

[1] https://lore.kernel.org/r/20190723153640 [email protected]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Tested-by: Alex Williamson <[email protected]>
Acked-by: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion

1. Change v4l2 from get_user_pages() to pin_user_pages().

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Acked-by: Hans Verkuil <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

net/xdp: set FOLL_PIN via pin_user_pages()

Convert net/xdp to use the new pin_longterm_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages.

In partial anticipation of this work, the net/xdp code was already calling
put_user_page() instead of put_page(). Therefore, in order to convert
from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required here is
to change get_user_pages() to pin_user_pages().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Acked-by: Björn Töpel <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/io_uring: set FOLL_PIN via pin_user_pages()

Convert fs/io_uring to use the new pin_user_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the io_uring code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required here is
to change get_user_pages() to pin_user_pages().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drm/via: set FOLL_PIN via pin_user_pages_fast()

Convert drm/via to use the new pin_user_pages_fast() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the drm/via driver was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required is to
change get_user_pages() to pin_user_pages().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Acked-by: Daniel Vetter <[email protected]>
Reviewed-by: Jérôme Glisse <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()

Convert process_vm_access to use the new pin_user_pages_remote() call,
which sets FOLL_PIN. Setting FOLL_PIN is now required for code that
requires tracking of pinned pages.

Also, release the pages via put_user_page*().

Also, rename "pages" to "pinned_pages", as this makes for easier reading
of process_vm_rw_single_vec().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Jérôme Glisse <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP

Convert infiniband to use the new pin_user_pages*() calls.

Also, revert earlier changes to Infiniband ODP that had it using
put_user_page(). ODP is "Case 3" in
Documentation/core-api/pin_user_pages.rst, which is to say, normal
get_user_pages() and put_page() is the API to use there.

The new pin_user_pages*() calls replace corresponding get_user_pages*()
calls, and set the FOLL_PIN flag. The FOLL_PIN flag requires that the
caller must return the pages via put_user_page*() calls, but infiniband
was already doing that as part of an earlier commit.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

goldish_pipe: convert to pin_user_pages() and put_user_page()

1. Call the new global pin_user_pages_fast(), from
   pin_goldfish_pages().

2. As required by pin_user_pages(), release these pages via
   put_user_page().  In this case, do so via put_user_pages_dirty_lock().

That has the side effect of calling set_page_dirty_lock(), instead of
set_page_dirty().  This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

Another side effect is that the release code is simplified because the
page[] loop is now in gup.c instead of here, so just delete the local
release_user_pages() entirely, and call put_user_pages_dirty_lock()
directly, instead.

[1] https://lore.kernel.org/r/20190723153640 [email protected]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup: introduce pin_user_pages*() and FOLL_PIN

Introduce pin_user_pages*() variations of get_user_pages*() calls, and
also pin_longterm_pages*() variations.

For now, these are placeholder calls, until the various call sites are
converted to use the correct get_user_pages*() or pin_user_pages*() API.

These variants will eventually all set FOLL_PIN, which is also
introduced, and thoroughly documented.

    pin_user_pages()
    pin_user_pages_remote()
    pin_user_pages_fast()

All pages that are pinned via the above calls, must be unpinned via
put_user_page().

The underlying rules are:

* FOLL_PIN is a gup-internal flag, so the call sites should not directly
  set it.  That behavior is enforced with assertions.

* Call sites that want to indicate that they are going to do DirectIO
  ("DIO") or something with similar characteristics, should call a
  get_user_pages()-like wrapper call that sets FOLL_PIN.  These wrappers
  will:

    * Start with "pin_user_pages" instead of "get_user_pages".  That
      makes it easy to find and audit the call sites.

    * Set FOLL_PIN

* For pages that are received via FOLL_PIN, those pages must be returned
  via put_user_page().

Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases in
this documentation.  (I've reworded it and expanded upon it.)

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]> [Documentation]
Reviewed-by: Jérôme Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

media/v4l2-core: set pages dirty upon releasing DMA buffers

After DMA is complete, and the device and CPU caches are synchronized,
it's still required to mark the CPU pages as dirty, if the data was
coming from the device. However, this driver was just issuing a bare
put_page() call, without any set_page_dirty*() call.

Fix the problem, by calling set_page_dirty_lock() if the CPU pages were
potentially receiving data from the device.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Hans Verkuil <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

IB/umem: use get_user_pages_fast() to pin DMA pages

And get rid of the mmap_sem calls, as part of that. Note that
get_user_pages_fast() will, if necessary, fall back to
__gup_longterm_unlocked(), which takes the mmap_sem as needed.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup: allow FOLL_FORCE for get_user_pages_fast()

Commit 817be129e6f2 ("mm: validate get_user_pages_fast flags") allowed
only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
This, combined with the fact that get_user_pages_fast() falls back to
"slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
if you need FOLL_FORCE, you cannot call get_user_pages_fast().

There does not appear to be any reason for filtering out FOLL_FORCE.
There is nothing in the _fast() implementation that requires that we
avoid writing to the pages. So it appears to have been an oversight.

Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 817be129e6f2 ("mm: validate get_user_pages_fast flags")
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call

Update VFIO to take advantage of the recently loosened restriction on
FOLL_LONGTERM with get_user_pages_remote().  Also, now it is possible to
fix a bug: the VFIO caller is logically a FOLL_LONGTERM user, but it
wasn't setting FOLL_LONGTERM.

Also, remove an unnessary pair of calls that were releasing and
reacquiring the mmap_sem.  There is no need to avoid holding mmap_sem
just in order to call page_to_pfn().

Also, now that the the DAX check ("if a VMA is DAX, don't allow long
term pinning") is in the internals of get_user_pages_remote() and
__gup_longterm_locked(), there's no need for it at the VFIO call site.  So
remove it.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Tested-by: Alex Williamson <[email protected]>
Acked-by: Alex Williamson <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Suggested-by: Jason Gunthorpe <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM

As it says in the updated comment in gup.c: current FOLL_LONGTERM
behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the FS
DAX check requirement on vmas.

However, the corresponding restriction in get_user_pages_remote() was
slightly stricter than is actually required: it forbade all
FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
that do not set the "locked" arg.

Update the code and comments to loosen the restriction, allowing
FOLL_LONGTERM in some cases.

Also, copy the DAX check ("if a VMA is DAX, don't allow long term
pinning") from the VFIO call site, all the way into the internals of
get_user_pages_remote() and __gup_longterm_locked(). That is:
get_user_pages_remote() calls __gup_longterm_locked(), which in turn
calls check_dax_vmas(). This check will then be removed from the VFIO
call site in a subsequent patch.

Thanks to Jason Gunthorpe for pointing out a clean way to fix this, and
to Dan Williams for helping clarify the DAX refactoring.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Tested-by: Alex Williamson <[email protected]>
Acked-by: Alex Williamson <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Suggested-by: Jason Gunthorpe <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

goldish_pipe: rename local pin_user_pages() routine

Avoid naming conflicts: rename local static function from
"pin_user_pages()" to "goldfish_pin_pages()".

An upcoming patch will introduce a global pin_user_pages() function.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Jérôme Glisse <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

An upcoming patch changes and complicates the refcounting and especially
the "put page" aspects of it.  In order to keep everything clean,
refactor the devmap page release routines:

* Rename put_devmap_managed_page() to page_is_devmap_managed(), and
  limit the functionality to "read only": return a bool, with no side
  effects.

* Add a new routine, put_devmap_managed_page(), to handle decrementing
  the refcount for ZONE_DEVICE pages.

* Change callers (just release_pages() and put_page()) to check
  page_is_devmap_managed() before calling the new
  put_devmap_managed_page() routine.  This is a performance point:
  put_page() is a hot path, so we need to avoid non- inline function calls
  where possible.

* Rename __put_devmap_managed_page() to free_devmap_managed_page(), and
  limit the functionality to unconditionally freeing a devmap page.

This is originally based on a separate patch by Ira Weiny, which applied
to an early version of the put_user_page() experiments.  Since then,
Jérôme Glisse suggested the refactoring described above.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ira Weiny <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Suggested-by: Jérôme Glisse <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: Cleanup __put_devmap_managed_page() vs ->page_free()

After the removal of the device-public infrastructure there are only 2
->page_free() call backs in the kernel.  One of those is a
device-private callback in the nouveau driver, the other is a generic
wakeup needed in the DAX case.  In the hopes that all ->page_free()
callbacks can be migrated to common core kernel functionality, move the
device-private specific actions in __put_devmap_managed_page() under the
is_device_private_page() conditional, including the ->page_free()
callback.  For the other page types just open-code the generic wakeup.

Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it
does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA
case.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jérôme Glisse <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup: move try_get_compound_head() to top, fix minor issues

An upcoming patch uses try_get_compound_head() more widely, so move it to
the top of gup.c.

Also fix a tiny spelling error and a checkpatch.pl warning.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup: factor out duplicate code from four routines

Patch series "mm/gup: prereqs to track dma-pinned pages: FOLL_PIN", v12.

Overview:

This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], and in a remarkable number of email threads since about
2017.  :)

A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.

I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on.  That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup").  In other words, opt-in by
changing from this:

    get_user_pages() (sets FOLL_GET)
    put_page()

to this:
    pin_user_pages() (sets FOLL_PIN)
    unpin_user_page()

Testing:

* I've done some overall kernel testing (LTP, and a few other goodies),
  and some directed testing to exercise some of the changes. And as you
  can see, gup_benchmark is enhanced to exercise this. Basically, I've
  been able to runtime test the core get_user_pages() and
  pin_user_pages() and related routines, but not so much on several of
  the call sites--but those are generally just a couple of lines
  changed, each.

  Not much of the kernel is actually using this, which on one hand
  reduces risk quite a lot. But on the other hand, testing coverage
  is low. So I'd love it if, in particular, the Infiniband and PowerPC
  folks could do a smoke test of this series for me.

  Runtime testing for the call sites so far is pretty light:

    * io_uring: Some directed tests from liburing exercise this, and
                they pass.
    * process_vm_access.c: A small directed test passes.
    * gup_benchmark: the enhanced version hits the new gup.c code, and
                     passes.
    * infiniband: Ran rdma-core tests: rdma-core/build/bin/run_tests.py
    * VFIO: compiles (I'm vowing to set up a run time test soon, but it's
                      not ready just yet)
    * powerpc: it compiles...
    * drm/via: compiles...
    * goldfish: compiles...
    * net/xdp: compiles...
    * media/v4l2: compiles...

[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/

This patch (of 22):

There are four locations in gup.c that have a fair amount of code
duplication.  This means that changing one requires making the same
changes in four places, not to mention reading the same code four times,
and wondering if there are subtle differences.

Factor out the common code into static functions, thus reducing the
overall line count and the code's complexity.

Also, take the opportunity to slightly improve the efficiency of the
error cases, by doing a mass subtraction of the refcount, surrounded by
get_page()/put_page().

Also, further simplify (slightly), by waiting until the the successful
end of each routine, to increment *nr.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jérôme Glisse <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup.c: use is_vm_hugetlb_page() to check whether to follow huge

No functional change, just leverage the helper function to improve
readability as others.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: David Rientjes <[email protected]>
Reviewed-by: Ralph Campbell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: fix gup_pud_range

sorry for not processing for a long time.  I met it again.

patch v1   https://lkml.org/lkml/2019/9/20/656

do_machine_check()
  do_memory_failure()
    memory_failure()
      hw_poison_user_mappings()
        try_to_unmap()
          pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));

...and now we have a swap entry that indicates that the page entry
refers to a bad (and poisoned) page of memory, but gup_fast() at this
level of the page table was ignoring swap entries, and incorrectly
assuming that "!pxd_none() == valid and present".

And this was not just a poisoned page problem, but a generaly swap entry
problem.  So, any swap entry type (device memory migration, numa
migration, or just regular swapping) could lead to the same problem.

Fix this by checking for pxd_present(), instead of pxd_none().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qiujun Huang <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap.c: clean up filemap_write_and_wait()

At some point filemap_write_and_wait() and
filemap_write_and_wait_range() got the exact same implementation with
the exception of the range being specified in *_range()

Similar to other functions in fs.h which call *_range(..., 0,
LLONG_MAX), change filemap_write_and_wait() to be a static inline which
calls filemap_write_and_wait_range()

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ira Weiny <[email protected]>
Reviewed-by: Nikolay Borisov <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/debug.c: always print flags in dump_page()

Commit 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line")
inadvertently removed printing of page flags for pages that are neither
anon nor ksm nor have a mapping. Fix that.

Using pr_cont() again would be a solution, but the commit explicitly
removed its use. Avoiding the danger of mixing up split lines from
multiple CPUs might be beneficial for near-panic dumps like this, so fix
this without reintroducing pr_cont().

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line")
Signed-off-by: Vlastimil Babka <[email protected]>
Reported-by: Anshuman Khandual <[email protected]>
Reported-by: Michal Hocko <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Ralph Campbell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/kmemleak: turn kmemleak_lock and object->lock to raw_spinlock_t

kmemleak_lock as a rwlock on RT can possibly be acquired in atomic
context which does work.

Since the kmemleak operation is performed in atomic context make it a
raw_spinlock_t so it can also be acquired on RT.  This is used for
debugging and is not enabled by default in a production like environment
(where performance/latency matters) so it makes sense to make it a
raw_spinlock_t instead trying to get rid of the atomic context.  Turn
also the kmemleak_object->lock into raw_spinlock_t which is acquired
(nested) while the kmemleak_lock is held.

The time spent in "echo scan > kmemleak" slightly improved on 64core box
with this patch applied after boot.

[[email protected]: redo the description, update comments. Merge the individual bits:  He Zhe did the kmemleak_lock, Liu Haitao the ->lock and Yongxin Liu forwarded Liu's patch.]
Link: http://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: He Zhe <[email protected]>
Signed-off-by: Liu Haitao <[email protected]>
Signed-off-by: Yongxin Liu <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/slub.c: avoid slub allocation while holding list_lock

If we are already under list_lock, don't call kmalloc().  Otherwise we
will run into a deadlock because kmalloc() also tries to grab the same
lock.

Fix the problem by using a static bitmap instead.

  WARNING: possible recursive locking detected
  --------------------------------------------
  mount-encrypted/4921 is trying to acquire lock:
  (&(&n->list_lock)->rlock){-.-.}, at: ___slab_alloc+0x104/0x437

  but task is already holding lock:
  (&(&n->list_lock)->rlock){-.-.}, at: __kmem_cache_shutdown+0x81/0x3cb

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&(&n->list_lock)->rlock);
    lock(&(&n->list_lock)->rlock);

   *** DEADLOCK ***

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: use ocfs2_update_inode_fsync_trans() to access t_tid in handle->h_transaction

For the uniform format, we use ocfs2_update_inode_fsync_trans() to
access t_tid in handle->h_transaction

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yan Wang <[email protected]>
Reviewed-by: Jun Piao <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Joseph Qi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: fix a NULL pointer dereference when call ocfs2_update_inode_fsync_trans()

I found a NULL pointer dereference in ocfs2_update_inode_fsync_trans(),
handle->h_transaction may be NULL in this situation:

ocfs2_file_write_iter
  ->__generic_file_write_iter
      ->generic_perform_write
        ->ocfs2_write_begin
          ->ocfs2_write_begin_nolock
            ->ocfs2_write_cluster_by_desc
              ->ocfs2_write_cluster
                ->ocfs2_mark_extent_written
                  ->ocfs2_change_extent_flag
                    ->ocfs2_split_extent
                      ->ocfs2_try_to_merge_extent
                        ->ocfs2_extend_rotate_transaction
                          ->ocfs2_extend_trans
                            ->jbd2_journal_restart
                              ->jbd2__journal_restart
                                // handle->h_transaction is NULL here
                                ->handle->h_transaction = NULL;
                                ->start_this_handle
                                  /* journal aborted due to storage
                                     network disconnection, return error */
                                  ->return -EROFS;
                         /* line 3806 in ocfs2_try_to_merge_extent (),
                            it will ignore ret error. */
                        ->ret = 0;
        ->...
        ->ocfs2_write_end
          ->ocfs2_write_end_nolock
            ->ocfs2_update_inode_fsync_trans
              // NULL pointer dereference
              ->oi->i_sync_tid = handle->h_transaction->t_tid;

The information of NULL pointer dereference as follows:
    JBD2: Detected IO errors while flushing file data on dm-11-45
    Aborting journal on device dm-11-45.
    JBD2: Error -5 detected when updating journal superblock for dm-11-45.
    (dd,22081,3):ocfs2_extend_trans:474 ERROR: status = -30
    (dd,22081,3):ocfs2_try_to_merge_extent:3877 ERROR: status = -30
    Unable to handle kernel NULL pointer dereference at
    virtual address 0000000000000008
    Mem abort info:
      ESR = 0x96000004
      Exception class = DABT (current EL), IL = 32 bits
      SET = 0, FnV = 0
      EA = 0, S1PTW = 0
    Data abort info:
      ISV = 0, ISS = 0x00000004
      CM = 0, WnR = 0
    user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000e74e1338
    [0000000000000008] pgd=0000000000000000
    Internal error: Oops: 96000004 [#1] SMP
    Process dd (pid: 22081, stack limit = 0x00000000584f35a9)
    CPU: 3 PID: 22081 Comm: dd Kdump: loaded
    Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019
    pstate: 60400009 (nZCv daif +PAN -UAO)
    pc : ocfs2_write_end_nolock+0x2b8/0x550 [ocfs2]
    lr : ocfs2_write_end_nolock+0x2a0/0x550 [ocfs2]
    sp : ffff0000459fba70
    x29: ffff0000459fba70 x28: 0000000000000000
    x27: ffff807ccf7f1000 x26: 0000000000000001
    x25: ffff807bdff57970 x24: ffff807caf1d4000
    x23: ffff807cc79e9000 x22: 0000000000001000
    x21: 000000006c6cd000 x20: ffff0000091d9000
    x19: ffff807ccb239db0 x18: ffffffffffffffff
    x17: 000000000000000e x16: 0000000000000007
    x15: ffff807c5e15bd78 x14: 0000000000000000
    x13: 0000000000000000 x12: 0000000000000000
    x11: 0000000000000000 x10: 0000000000000001
    x9 : 0000000000000228 x8 : 000000000000000c
    x7 : 0000000000000fff x6 : ffff807a308ed6b0
    x5 : ffff7e01f10967c0 x4 : 0000000000000018
    x3 : d0bc661572445600 x2 : 0000000000000000
    x1 : 000000001b2e0200 x0 : 0000000000000000
    Call trace:
     ocfs2_write_end_nolock+0x2b8/0x550 [ocfs2]
     ocfs2_write_end+0x4c/0x80 [ocfs2]
     generic_perform_write+0x108/0x1a8
     __generic_file_write_iter+0x158/0x1c8
     ocfs2_file_write_iter+0x668/0x950 [ocfs2]
     __vfs_write+0x11c/0x190
     vfs_write+0xac/0x1c0
     ksys_write+0x6c/0xd8
     __arm64_sys_write+0x24/0x30
     el0_svc_common+0x78/0x130
     el0_svc_handler+0x38/0x78
     el0_svc+0x8/0xc

To prevent NULL pointer dereference in this situation, we use
is_handle_aborted() before using handle->h_transaction->t_tid.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yan Wang <[email protected]>
Reviewed-by: Jun Piao <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Joseph Qi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2/dlm: move BITS_TO_BYTES() to bitops.h for wider use

There are users already and will be more of BITS_TO_BYTES() macro.  Move
it to bitops.h for wider use.

In the case of ocfs2 the replacement is identical.

As for bnx2x, there are two places where floor version is used.  In the
first case to calculate the amount of structures that can fit one memory
page.  In this case obviously the ceiling variant is correct and
original code might have a potential bug, if amount of bits % 8 is not
0.  In the second case the macro is used to calculate bytes transmitted
in one microsecond.  This will work for all speeds which is multiply of
1Gbps without any change, for the rest new code will give ceiling value,
for instance 100Mbps will give 13 bytes, while old code gives 12 bytes
and the arithmetically correct one is 12.5 bytes.  Further the value is
used to setup timer threshold which in any case has its own margins due
to certain resolution.  I don't see here an issue with slightly shifting
thresholds for low speed connections, the card is supposed to utilize
highest available rate, which is usually 10Gbps.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Acked-by: Sudarsana Reddy Kalluru <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2/dlm: remove redundant assignment to ret

The variable ret is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.

Addresses Coverity ("Unused value")

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Colin Ian King <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: make local header paths relative to C files

Gang He reports the failure of building fs/ocfs2/ as an external module
of the kernel installed on the system:

$ cd fs/ocfs2
$ make -C /lib/modules/`uname -r`/build M=`pwd` modules

If you want to make it work reliably, I'd recommend to remove ccflags-y
from the Makefiles, and to make header paths relative to the C files. I
think this is the correct usage of the #include "..." directive.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Masahiro Yamada <[email protected]>
Signed-off-by: Gang He <[email protected]>
Reported-by: Gang He <[email protected]>
Reviewed-by: Gang He <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Joseph Qi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: remove unneeded semicolons

Fixes coccicheck warnings:

fs/ocfs2/cluster/quorum.c:76:2-3: Unneeded semicolon
fs/ocfs2/dlmglue.c:573:2-3: Unneeded semicolon

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zhengbin <[email protected]>
Reported-by: Hulk Robot <[email protected]>
Acked-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs: ocfs: remove unnecessary assertion in dlm_migrate_lockres

In the only caller of dlm_migrate_lockres() - dlm_empty_lockres(),
target is checked for O2NM_MAX_NODES. Thus, the assertion in
dlm_migrate_lockres() is unnecessary and can be removed. The patch
eliminates such a check.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Aditya Pakki <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Joseph Qi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

scripts/spelling.txt: add "issus" typo

Add "issus" and correct it as "issues".

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Luca Ceresoli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

scripts/spelling.txt: add more spellings to spelling.txt

Here are some of the common spelling mistakes and typos that I've found
while fixing up spelling mistakes in the kernel. Most of them still
exist in more than two source files.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Xiong <[email protected]>
Cc: Colin Ian King <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Chris Paterson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: move_pages: report the number of non-attempted pages

Since commit a49bd4d71637 ("mm, numa: rework do_pages_move"), the
semantic of move_pages() has changed to return the number of
non-migrated pages if they were result of a non-fatal reasons (usually a
busy page).

This was an unintentional change that hasn't been noticed except for LTP
tests which checked for the documented behavior.

There are two ways to go around this change.  We can even get back to
the original behavior and return -EAGAIN whenever migrate_pages is not
able to migrate pages due to non-fatal reasons.  Another option would be
to simply continue with the changed semantic and extend move_pages
documentation to clarify that -errno is returned on an invalid input or
when migration simply cannot succeed (e.g.  -ENOMEM, -EBUSY) or the
number of pages that couldn't have been migrated due to ephemeral
reasons (e.g.  page is pinned or locked for other reasons).

This patch implements the second option because this behavior is in
place for some time without anybody complaining and possibly new users
depending on it.  Also it allows to have a slightly easier error
handling as the caller knows that it is worth to retry when err > 0.

But since the new semantic would be aborted immediately if migration is
failed due to ephemeral reasons, need include the number of
non-attempted pages in the return value too.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
Signed-off-by: Yang Shi <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Cc: <[email protected]> [4.17+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: thp: don't need care deferred split queue in memcg charge move path

If compound is true, this means it is a PMD mapped THP.  Which implies
the page is not linked to any defer list.  So the first code chunk will
not be executed.

Also with this reason, it would not be proper to add this page to a
defer list.  So the second code chunk is not correct.

Based on this, we should remove the defer list related code.

[[email protected]: better patch title]
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
Signed-off-by: Wei Yang <[email protected]>
Suggested-by: Kirill A. Shutemov <[email protected]>
Acked-by: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: <[email protected]> [5.4+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug: fix remove_memory() lockdep splat

The daxctl unit test for the dax_kmem driver currently triggers the
(false positive) lockdep splat below.  It results from the fact that
remove_memory_block_devices() is invoked under the mem_hotplug_lock()
causing lockdep entanglements with cpu_hotplug_lock() and sysfs (kernfs
active state tracking).  It is a false positive because the sysfs
attribute path triggering the memory remove is not the same attribute
path associated with memory-block device.

sysfs_break_active_protection() is not applicable since there is no real
deadlock conflict, instead move memory-block device removal outside the
lock.  The mem_hotplug_lock() is not needed to synchronize the
memory-block device removal vs the page online state, that is already
handled by lock_device_hotplug().  Specifically, lock_device_hotplug()
is sufficient to allow try_remove_memory() to check the offline state of
the memblocks and be assured that any in progress online attempts are
flushed / blocked by kernfs_drain() / attribute removal.

The add_memory() path safely creates memblock devices under the
mem_hotplug_lock().  There is no kernfs active state synchronization in
the memblock device_register() path, so nothing to fix there.

This change is only possible thanks to the recent change that refactored
memory block device removal out of arch_remove_memory() (commit
4c4b7f9ba948 "mm/memory_hotplug: remove memory block devices before
arch_remove_memory()"), and David's due diligence tracking down the
guarantees afforded by kernfs_drain().  Not flagged for -stable since
this only impacts ongoing development and lockdep validation, not a
runtime issue.

    ======================================================
    WARNING: possible circular locking dependency detected
    5.5.0-rc3+ #230 Tainted: G           OE
    ------------------------------------------------------
    lt-daxctl/6459 is trying to acquire lock:
    ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80

    but task is already holding lock:
    ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (mem_hotplug_lock.rw_sem){++++}:
           __lock_acquire+0x39c/0x790
           lock_acquire+0xa2/0x1b0
           get_online_mems+0x3e/0xb0
           kmem_cache_create_usercopy+0x2e/0x260
           kmem_cache_create+0x12/0x20
           ptlock_cache_init+0x20/0x28
           start_kernel+0x243/0x547
           secondary_startup_64+0xb6/0xc0

    -> #1 (cpu_hotplug_lock.rw_sem){++++}:
           __lock_acquire+0x39c/0x790
           lock_acquire+0xa2/0x1b0
           cpus_read_lock+0x3e/0xb0
           online_pages+0x37/0x300
           memory_subsys_online+0x17d/0x1c0
           device_online+0x60/0x80
           state_store+0x65/0xd0
           kernfs_fop_write+0xcf/0x1c0
           vfs_write+0xdb/0x1d0
           ksys_write+0x65/0xe0
           do_syscall_64+0x5c/0xa0
           entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (kn->count#241){++++}:
           check_prev_add+0x98/0xa40
           validate_chain+0x576/0x860
           __lock_acquire+0x39c/0x790
           lock_acquire+0xa2/0x1b0
           __kernfs_remove+0x25f/0x2e0
           kernfs_remove_by_name_ns+0x41/0x80
           remove_files.isra.0+0x30/0x70
           sysfs_remove_group+0x3d/0x80
           sysfs_remove_groups+0x29/0x40
           device_remove_attrs+0x39/0x70
           device_del+0x16a/0x3f0
           device_unregister+0x16/0x60
           remove_memory_block_devices+0x82/0xb0
           try_remove_memory+0xb5/0x130
           remove_memory+0x26/0x40
           dev_dax_kmem_remove+0x44/0x6a [kmem]
           device_release_driver_internal+0xe4/0x1c0
           unbind_store+0xef/0x120
           kernfs_fop_write+0xcf/0x1c0
           vfs_write+0xdb/0x1d0
           ksys_write+0x65/0xe0
           do_syscall_64+0x5c/0xa0
           entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

    Chain exists of:
      kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem

     Possible unsafe locking scenario:

           CPU0                    CPU1
           ----                    ----
      lock(mem_hotplug_lock.rw_sem);
                                   lock(cpu_hotplug_lock.rw_sem);
                                   lock(mem_hotplug_lock.rw_sem);
      lock(kn->count#241);

     *** DEADLOCK ***

No fixes tag as this has been a long standing issue that predated the
addition of kernfs lockdep annotations.

Link: http://lkml.kernel.org/r/157991441887.2763922.4770790047389427325.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate.c: also overwrite error when it is bigger than zero

If we get here after successfully adding page to list, err would be 1 to
indicate the page is queued in the list.

Current code has two problems:

  * on success, 0 is not returned
  * on error, if add_page_for_migratioin() return 1, and the following err1
    from do_move_pages_to_node() is set, the err1 is not returned since err
    is 1

And these behaviors break the user interface.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: e0153fc2c760 ("mm: move_pages: return valid node id in status if the page is already on the target node").
Signed-off-by: Wei Yang <[email protected]>
Acked-by: Yang Shi <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/sparse.c: reset section's mem_map when fully deactivated

After commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"),
when a mem section is fully deactivated, section_mem_map still records
the section's start pfn, which is not used any more and will be
reassigned during re-addition.

In analogy with alloc/free pattern, it is better to clear all fields of
section_mem_map.

Beside this, it breaks the user space tool "makedumpfile" [1], which
makes assumption that a hot-removed section has mem_map as NULL, instead
of checking directly against SECTION_MARKED_PRESENT bit. (makedumpfile
will be better to change the assumption, and need a patch)

The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" ,
trigger a crash, and save vmcore by makedumpfile

[1]: makedumpfile, commit e73016540293 ("[v1.6.7] Update version")

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Pingfan Liu <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Kazuhito Hagio <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mempolicy.c: fix out of bounds write in mpol_parse_str()

What we are trying to do is change the '=' character to a NUL terminator
and then at the end of the function we restore it back to an '='. The
problem is there are two error paths where we jump to the end of the
function before we have replaced the '=' with NUL.

We end up putting the '=' in the wrong place (possibly one element
before the start of the buffer).

Link: http://lkml.kernel.org/r/[email protected]
Reported-by: [email protected]
Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
Signed-off-by: Dan Carpenter <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Dmitry Vyukov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: fix a crash in wb_workfn when a device disappears

Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures.  In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.

With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi.  There is a refcount which prevents
the bdi object from being released (and hence, unregistered).  So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).

Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly.  It does
this without informing the file system, or the memcg code, or anything
else.  This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown.  So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk().  As a result, *boom*.

Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.

The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on.  It was triggering
several times a day in a heavily loaded production environment.

Google Bug Id: 145475544

Link: https://lore.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/test_bitmap: correct test data offsets for 32-bit

On 32-bit platform the size of long is only 32 bits which makes wrong
offset in the array of 64 bit size.

Calculate offset based on BITS_PER_LONG.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 30544ed5de43 ("lib/bitmap: introduce bitmap_replace() helper")
Signed-off-by: Andy Shevchenko <[email protected]>
Reported-by: Guenter Roeck <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Yury Norov <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>