lib/scatterlist.c: adjust indentation in __sg_alloc_table
Clang warns:
../lib/scatterlist.c:314:5: warning: misleading indentation; statement
is not part of the previous 'if' [-Wmisleading-indentation]
return -ENOMEM;
^
../lib/scatterlist.c:311:4: note: previous statement is here
if (prv)
^
1 warning generated.
This warning occurs because there is a space before the tab on this
line. Remove it so that the indentation is consistent with the Linux
kernel coding style and clang no longer warns.
Mikhail Zaslonko [Fri, 31 Jan 2020 06:16:33 +0000 (22:16 -0800)]
btrfs: use larger zlib buffer for s390 hardware compression
In order to benefit from s390 zlib hardware compression support,
increase the btrfs zlib workspace buffer size from 1 to 4 pages (if s390
zlib hardware support is enabled on the machine).
This brings up to 60% better performance in hardware on s390 compared to
the PAGE_SIZE buffer and much more compared to the software zlib
processing in btrfs. In case of memory pressure, fall back to a single
page buffer during workspace allocation.
The data compressed with larger input buffers will still conform to zlib
standard and thus can be decompressed also on a systems that uses only
PAGE_SIZE buffer for btrfs zlib.
Mikhail Zaslonko [Fri, 31 Jan 2020 06:16:27 +0000 (22:16 -0800)]
s390/boot: add dfltcc= kernel command line parameter
Add the new kernel command line parameter 'dfltcc=' to configure s390
zlib hardware support.
Format: { on | off | def_only | inf_only | always }
on: s390 zlib hardware support for compression on
level 1 and decompression (default)
off: No s390 zlib hardware support
def_only: s390 zlib hardware support for deflate
only (compression on level 1)
inf_only: s390 zlib hardware support for inflate
only (decompression)
always: Same as 'on' but ignores the selected compression
level always using hardware support (used for debugging)
Mikhail Zaslonko [Fri, 31 Jan 2020 06:16:23 +0000 (22:16 -0800)]
lib/zlib: add s390 hardware support for kernel zlib_inflate
Add decompression functions to zlib_dfltcc library. Update zlib_inflate
functions with the hooks for s390 hardware support and adjust workspace
structures with extra parameter lists required for hardware inflate
decompression.
Mikhail Zaslonko [Fri, 31 Jan 2020 06:16:17 +0000 (22:16 -0800)]
lib/zlib: add s390 hardware support for kernel zlib_deflate
Patch series "S390 hardware support for kernel zlib", v3.
With IBM z15 mainframe the new DFLTCC instruction is available. It
implements deflate algorithm in hardware (Nest Acceleration Unit - NXU)
with estimated compression and decompression performance orders of
magnitude faster than the current zlib.
This patchset adds s390 hardware compression support to kernel zlib.
The code is based on the userspace zlib implementation:
https://github.com/madler/zlib/pull/410
The coding style is also preserved for future maintainability. There is
only limited set of userspace zlib functions represented in kernel.
Apart from that, all the memory allocation should be performed in
advance. Thus, the workarea structures are extended with the parameter
lists required for the DEFLATE CONVENTION CALL instruction.
Since kernel zlib itself does not support gzip headers, only Adler-32
checksum is processed (also can be produced by DFLTCC facility). Like
it was implemented for userspace, kernel zlib will compress in hardware
on level 1, and in software on all other levels. Decompression will
always happen in hardware (when enabled).
Two DFLTCC compression calls produce the same results only when they
both are made on machines of the same generation, and when the
respective buffers have the same offset relative to the start of the
page. Therefore care should be taken when using hardware compression
when reproducible results are desired. However it does always produce
the standard conform output which can be inflated anyway.
The new kernel command line parameter 'dfltcc' is introduced to
configure s390 zlib hardware support:
Format: { on | off | def_only | inf_only | always }
on: s390 zlib hardware support for compression on
level 1 and decompression (default)
off: No s390 zlib hardware support
def_only: s390 zlib hardware support for deflate
only (compression on level 1)
inf_only: s390 zlib hardware support for inflate
only (decompression)
always: Same as 'on' but ignores the selected compression
level always using hardware support (used for debugging)
The main purpose of the integration of the NXU support into the kernel
zlib is the use of hardware deflate in btrfs filesystem with on-the-fly
compression enabled. Apart from that, hardware support can also be used
during boot for decompressing the kernel or the ramdisk image
With the patch for btrfs expanding zlib buffer from 1 to 4 pages (patch
6) the following performance results have been achieved using the
ramdisk with btrfs. These are relative numbers based on throughput rate
and compression ratio for zlib level 1:
Input data Deflate rate Inflate rate Compression ratio
NXU/Software NXU/Software NXU/Software
stream of zeroes 1.46 1.02 1.00
random ASCII data 10.44 3.00 0.96
ASCII text (dickens) 6,21 3.33 0.94
binary data (vmlinux) 8,37 3.90 1.02
This means that s390 hardware deflate can provide up to 10 times faster
compression (on level 1) and up to 4 times faster decompression (refers
to all compression levels) for btrfs zlib.
Disclaimer: Performance results are based on IBM internal tests using DD
command-line utility on btrfs on a Fedora 30 based internal driver in
native LPAR on a z15 system. Results may vary based on individual
workload, configuration and software levels.
This patch (of 9):
Create zlib_dfltcc library with the s390 DEFLATE CONVERSION CALL
implementation and related compression functions. Update zlib_deflate
functions with the hooks for s390 hardware support and adjust workspace
structures with extra parameter lists required for hardware deflate.
Akinobu Mita [Fri, 31 Jan 2020 06:15:57 +0000 (22:15 -0800)]
thermal: remove kelvin to/from Celsius conversion helpers from <linux/thermal.h>
This removes the kelvin to/from Celsius conversion helper macros in
<linux/thermal.h> which were switched to the inline helper functions in
<linux/units.h>.
Akinobu Mita [Fri, 31 Jan 2020 06:15:44 +0000 (22:15 -0800)]
thermal: int340x: switch to use <linux/units.h> helpers
This switches the int340x thermal zone driver to use
deci_kelvin_to_millicelsius() and millicelsius_to_deci_kelvin() in
<linux/units.h> instead of helpers in <linux/thermal.h>.
This is preparation for centralizing the kelvin to/from Celsius
conversion helpers in <linux/units.h>.
Akinobu Mita [Fri, 31 Jan 2020 06:15:40 +0000 (22:15 -0800)]
platform/x86: intel_menlow: switch to use <linux/units.h> helpers
This switches the intel_menlow driver to use deci_kelvin_to_celsius()
and celsius_to_deci_kelvin() in <linux/units.h> instead of helpers in
<linux/thermal.h>.
This is preparation for centralizing the kelvin to/from Celsius
conversion helpers in <linux/units.h>.
This also removes a trailing space, while we're at it.
Akinobu Mita [Fri, 31 Jan 2020 06:15:37 +0000 (22:15 -0800)]
platform/x86: asus-wmi: switch to use <linux/units.h> helpers
The asus-wmi driver doesn't implement the thermal device functionality
directly, so including <linux/thermal.h> just for
DECI_KELVIN_TO_CELSIUS() is a bit odd.
This switches the asus-wmi driver to use deci_kelvin_to_millicelsius()
in <linux/units.h>.
The format string is changed from %d to %ld due to function returned
type.
Akinobu Mita [Fri, 31 Jan 2020 06:15:33 +0000 (22:15 -0800)]
ACPI: thermal: switch to use <linux/units.h> helpers
This switches the ACPI thermal zone driver to use
celsius_to_deci_kelvin(), deci_kelvin_to_celsius(), and
deci_kelvin_to_millicelsius_with_offset() in <linux/units.h> instead of
helpers in <linux/thermal.h>.
This is preparation for centralizing the kelvin to/from Celsius
conversion helpers in <linux/units.h>.
Akinobu Mita [Fri, 31 Jan 2020 06:15:28 +0000 (22:15 -0800)]
include/linux/units.h: add helpers for kelvin to/from Celsius conversion
Patch series "add header file for kelvin to/from Celsius conversion
helpers", v4.
There are several helper macros to convert kelvin to/from Celsius in
<linux/thermal.h> for thermal drivers. These are useful for any other
drivers or subsystems, but it's odd to include <linux/thermal.h> just
for the helpers.
This adds a new <linux/units.h> that provides the equivalent inline
functions for any drivers or subsystems, and switches all the users of
conversion helpers in <linux/thermal.h> to use <linux/units.h> helpers.
This patch (of 12):
There are several helper macros to convert kelvin to/from Celsius in
<linux/thermal.h> for thermal drivers. These are useful for any other
drivers or subsystems, but it's odd to include <linux/thermal.h> just
for the helpers.
This adds a new <linux/units.h> that provides the equivalent inline
functions for any drivers or subsystems. It is intended to replace the
helpers in <linux/thermal.h>.
Colin Ian King [Fri, 31 Jan 2020 06:15:25 +0000 (22:15 -0800)]
drivers/block/zram/zram_drv.c: fix error return codes not being returned in writeback_store
Currently when an error code -EIO or -ENOSPC in the for-loop of
writeback_store the error code is being overwritten by a ret = len
assignment at the end of the function and the error codes are being
lost. Fix this by assigning ret = len at the start of the function and
remove the assignment from the end, hence allowing ret to be preserved
when error codes are assigned to it.
Taejoon Song [Fri, 31 Jan 2020 06:15:22 +0000 (22:15 -0800)]
zram: try to avoid worst-case scenario on same element pages
The worst-case scenario on finding same element pages is that almost all
elements are same at the first glance but only last few elements are
different.
Since the same element tends to be grouped from the beginning of the
pages, if we check the first element with the last element before
looping through all elements, we might have some chances to quickly
detect non-same element pages.
1. Test is done under LG webOS TV (64-bit arch)
2. Dump the swap-out pages (~819200 pages)
3. Analyze the pages with simple test script which counts the iteration
number and measures the speed at off-line
Under 64-bit arch, the worst iteration count is PAGE_SIZE / 8 bytes =
512. The speed is based on the time to consume page_same_filled()
function only. The result, on average, is listed as below:
Num of Iter Speed(MB/s)
Looping-Forward (Orig) 38 99265
Looping-Backward 36 102725
Last-element-check (This Patch) 33 125072
The result shows that the average iteration count decreases by 13% and
the speed increases by 25% with this patch. This patch does not
increase the overall time complexity, though.
I also ran simpler version which uses backward loop. Just looping
backward also makes some improvement, but less than this patch.
include/linux/memory.h: drop fields 'hw' and 'phys_callback' from struct memory_block
memory_block structure elements 'hw' and 'phys_callback' are not getting
used. This was originally added with commit 3947be1969a9 ("[PATCH]
memory hotplug: sysfs and add/remove functions") but never seem to have
been used. Just drop them now.
Dan Carpenter [Fri, 31 Jan 2020 06:15:07 +0000 (22:15 -0800)]
zswap: potential NULL dereference on error in init_zswap()
The "pool" pointer can be NULL at the end of the init_zswap(). (We
would allocate a new pool later in that situation)
So in the error handling then we need to make sure pool is a valid
pointer before calling "zswap_pool_destroy(pool);" because that function
dereferences the argument.
Vitaly Wool [Fri, 31 Jan 2020 06:15:04 +0000 (22:15 -0800)]
mm/zswap.c: add allocation hysteresis if pool limit is hit
zswap will always try to shrink pool when zswap is full. If there is a
high pressure on zswap it will result in flipping pages in and out zswap
pool without any real benefit, and the overall system performance will
drop. The previous discussion on this subject [1] ended up with a
suggestion to implement a sort of hysteresis to refuse taking pages into
zswap pool until it has sufficient space if the limit has been hit.
This is my take on this.
Hysteresis is controlled with a sysfs-configurable parameter (namely,
/sys/kernel/debug/zswap/accept_threhsold_percent). It specifies the
threshold at which zswap would start accepting pages again after it
became full. Setting this parameter to 100 disables the hysteresis and
sets the zswap behavior to pre-hysteresis state.
Qian Cai [Fri, 31 Jan 2020 06:15:01 +0000 (22:15 -0800)]
mm/page_isolation: fix potential warning from user
It makes sense to call the WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE)
from start_isolate_page_range(), but should avoid triggering it from
userspace, i.e, from is_mem_section_removable() because it could crash
the system by a non-root user if warn_on_panic is set.
While at it, simplify the code a bit by removing an unnecessary jump
label.
Qian Cai [Fri, 31 Jan 2020 06:14:57 +0000 (22:14 -0800)]
mm/hotplug: silence a lockdep splat with printk()
It is not that hard to trigger lockdep splats by calling printk from
under zone->lock. Most of them are false positives caused by lock
chains introduced early in the boot process and they do not cause any
real problems (although most of the early boot lock dependencies could
happen after boot as well). There are some console drivers which do
allocate from the printk context as well and those should be fixed. In
any case, false positives are not that trivial to workaround and it is
far from optimal to lose lockdep functionality for something that is a
non-issue.
So change has_unmovable_pages() so that it no longer calls dump_page()
itself - instead it returns a "struct page *" of the unmovable page back
to the caller so that in the case of a has_unmovable_pages() failure,
the caller can call dump_page() after releasing zone->lock. Also, make
dump_page() is able to report a CMA page as well, so the reason string
from has_unmovable_pages() can be removed.
Even though has_unmovable_pages doesn't hold any reference to the
returned page this should be reasonably safe for the purpose of
reporting the page (dump_page) because it cannot be hotremoved in the
context of memory unplug. The state of the page might change but that
is the case even with the existing code as zone->lock only plays role
for free pages.
While at it, remove a similar but unnecessary debug-only printk() as
well. A sample of one of those lockdep splats is,
WARNING: possible circular locking dependency detected
------------------------------------------------------
test.sh/8653 is trying to acquire lock: ffffffff865a4460 (console_owner){-.-.}, at:
console_unlock+0x207/0x750
but task is already holding lock: ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
__offline_isolated_pages+0x179/0x3e0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
Patch series "mm/memory_hotplug: pass in nid to online_pages()".
Simplify onlining code and get rid of find_memory_block(). Pass in the
nid from the memory block we are trying to online directly, instead of
manually looking it up.
This patch (of 2):
No need to lookup the memory block, we can directly pass in the nid.
David Rientjes [Fri, 31 Jan 2020 06:14:48 +0000 (22:14 -0800)]
mm, thp: fix defrag setting if newline is not used
If thp defrag setting "defer" is used and a newline is *not* used when
writing to the sysfs file, this is interpreted as the "defer+madvise"
option.
This is because we do prefix matching and if five characters are written
without a newline, the current code ends up comparing to the first five
bytes of the "defer+madvise" option and using that instead.
Use the more appropriate sysfs_streq() that handles the trailing newline
for us. Since this doubles as a nice cleanup, do it in enabled_store()
as well.
The current implementation relies on prefix matching: the number of
bytes compared is either the number of bytes written or the length of
the option being compared. With a newline, "defer\n" does not match
"defer+"madvise"; without a newline, however, "defer" is considered to
match "defer+madvise" (prefix matching is only comparing the first five
bytes). End result is that writing "defer" is broken unless it has an
additional trailing character.
This means that writing "madv" in the past would match and set
"madvise". With strict checking, that no longer is the case but it is
unlikely anybody is currently doing this.
Ralph Campbell [Fri, 31 Jan 2020 06:14:44 +0000 (22:14 -0800)]
mm/migrate: add stable check in migrate_vma_insert_page()
migrate_vma_insert_page() closely follows the code in:
__handle_mm_fault()
handle_pte_fault()
do_anonymous_page()
Add a call to check_stable_address_space() after locking the page table
entry before inserting a ZONE_DEVICE private zero page mapping similar
to page faulting a new anonymous page.
Wei Yang [Fri, 31 Jan 2020 06:14:32 +0000 (22:14 -0800)]
mm/huge_memory.c: use head to emphasize the purpose of page
During split huge page, it checks the property of the page. Currently
we do the check on page and head without emphasizing the check is on the
compound page. In case the page passed to split_huge_page_to_list is a
tail page, audience would take some time to think about whether the
check is on compound page or tail page itself.
To make it explicit, use head instead of page for those checks. After
this, audience would be more clear about the checks are on compound page
and the page is used to do the split and dump error message if failed.
David Rientjes [Fri, 31 Jan 2020 06:14:26 +0000 (22:14 -0800)]
mm, oom: dump stack of victim when reaping failed
When a process cannot be oom reaped, for whatever reason, currently the
list of locks that are held is currently dumped to the kernel log.
Much more interesting is the stack trace of the victim that cannot be
reaped. If the stack trace is dumped, we have the ability to find
related occurrences in the same kernel code and hopefully solve the
issue that is making it wedged.
Dump the stack trace when a process fails to be oom reaped.
On the s390 platform memblock.physmem array is being built by directly
calling into memblock_add_range() which is a low level function not
intended to be used outside of memblock. Hence lets conditionally add
helper functions for physmem array when HAVE_MEMBLOCK_PHYS_MAP is
enabled. Also use MAX_NUMNODES instead of 0 as node ID similar to
memblock_add() and memblock_reserve(). Make memblock_add_range() a
static function as it is no longer getting used outside of memblock.
Daniel Wagner [Fri, 31 Jan 2020 06:14:17 +0000 (22:14 -0800)]
tools/vm/slabinfo: fix sanity checks enabling
The sysfs file name for enabling sanity checking is called
'sanity_checks' and not 'sanity'.
The name of the file has never changed since the introduction of the
slub allocator. Obviously, most people turn the checks on via the
command line option and not during runtime using slabinfo.
Alex Shi [Fri, 31 Jan 2020 06:14:14 +0000 (22:14 -0800)]
mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE
Commit 1b2ffb7896ad ("[PATCH] Zone reclaim: Allow modification of zone
reclaim behavior")' defined RECLAIM_OFF/RECLAIM_ZONE, but never use
them, so better to remove them.
mm/page_alloc: skip non present sections on zone initialization
memmap_init_zone() can be called on the ranges with holes during the
boot. It will skip any non-valid PFNs one-by-one. It works fine as
long as holes are not too big.
But huge holes in the memory map causes a problem. It takes over 20
seconds to walk 32TiB hole. x86-64 with 5-level paging allows for much
larger holes in the memory map which would practically hang the system.
Deferred struct page init doesn't help here. It only works on the
present ranges.
Skipping non-present sections would fix the issue.
in pfn_in_hpage. For hugetlbfs page, it should be page_pfn == pfn
Now, change pfn_in_hpage to pfn_is_match to highlight that comparison is
not only for THP and explicitly compare for these cases.
No impact upon current behavior, just make the code clear. I think it
is important to make the code clear - comparing hugetlbfs page in range
page_pfn <= pfn < page_pfn + HPAGE_PMD_NR is confusing.
Kaitao Cheng [Fri, 31 Jan 2020 06:13:42 +0000 (22:13 -0800)]
mm/memcontrol.c: cleanup some useless code
Compound pages handling in mem_cgroup_migrate is more convoluted than
necessary. The state is duplicated in compound variable and the same
could be achieved by PageTransHuge check which is trivial and
hpage_nr_pages is already PageTransHuge aware.
It is much simpler to just use hpage_nr_pages for nr_pages and replace
the local variable by PageTransHuge check directly
Vasily Averin [Fri, 31 Jan 2020 06:13:39 +0000 (22:13 -0800)]
mm/swapfile.c: swap_next should increase position index
If seq_file .next fuction does not change position index, read after
some lseek can generate unexpected output.
In Aug 2018 NeilBrown noticed commit 1f4aace60b0e ("fs/seq_file.c:
simplify seq_file iteration code and interface") "Some ->next functions
do not increment *pos when they return NULL... Note that such ->next
functions are buggy and should be fixed. A simple demonstration is
dd if=/proc/swaps bs=1000 skip=1
Choose any block size larger than the size of /proc/swaps. This will
always show the whole last line of /proc/swaps"
Described problem is still actual. If you make lseek into middle of
last output line following read will output end of last line and whole
last line once again.
$ dd if=/proc/swaps bs=1 # usual output
Filename Type Size Used Priority
/dev/dm-0 partition 4194812 97536 -2
104+0 records in
104+0 records out
104 bytes copied
$ dd if=/proc/swaps bs=40 skip=1 # last line was generated twice
dd: /proc/swaps: cannot skip to specified offset
v/dm-0 partition 4194812 97536 -2
/dev/dm-0 partition 4194812 97536 -2
3+1 records in
3+1 records out
131 bytes copied
John Hubbard [Fri, 31 Jan 2020 06:13:35 +0000 (22:13 -0800)]
mm, tree-wide: rename put_user_page*() to unpin_user_page*()
In order to provide a clearer, more symmetric API for pinning and
unpinning DMA pages. This way, pin_user_pages*() calls match up with
unpin_user_pages*() calls, and the API is a lot closer to being
self-explanatory.
John Hubbard [Fri, 31 Jan 2020 06:13:32 +0000 (22:13 -0800)]
mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1"
Fix the gup benchmark flags to use the symbolic FOLL_WRITE, instead of a
hard-coded "1" value.
Also, clean up the filtering of gup flags a little, by just doing it
once before issuing any of the get_user_pages*() calls. This makes it
harder to overlook, instead of having little "gup_flags & 1" phrases in
the function calls.
John Hubbard [Fri, 31 Jan 2020 06:13:24 +0000 (22:13 -0800)]
vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
1. Change vfio from get_user_pages_remote(), to
pin_user_pages_remote().
2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().
Note that this effectively changes the code's behavior in
vfio_iommu_type1.c: put_pfn(): it now ultimately calls
set_page_dirty_lock(), instead of set_page_dirty(). This is probably
more accurate.
As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]
John Hubbard [Fri, 31 Jan 2020 06:13:17 +0000 (22:13 -0800)]
net/xdp: set FOLL_PIN via pin_user_pages()
Convert net/xdp to use the new pin_longterm_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages.
In partial anticipation of this work, the net/xdp code was already calling
put_user_page() instead of put_page(). Therefore, in order to convert
from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required here is
to change get_user_pages() to pin_user_pages().
John Hubbard [Fri, 31 Jan 2020 06:13:13 +0000 (22:13 -0800)]
fs/io_uring: set FOLL_PIN via pin_user_pages()
Convert fs/io_uring to use the new pin_user_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().
In partial anticipation of this work, the io_uring code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required here is
to change get_user_pages() to pin_user_pages().
John Hubbard [Fri, 31 Jan 2020 06:13:09 +0000 (22:13 -0800)]
drm/via: set FOLL_PIN via pin_user_pages_fast()
Convert drm/via to use the new pin_user_pages_fast() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().
In partial anticipation of this work, the drm/via driver was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required is to
change get_user_pages() to pin_user_pages().
John Hubbard [Fri, 31 Jan 2020 06:13:05 +0000 (22:13 -0800)]
mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
Convert process_vm_access to use the new pin_user_pages_remote() call,
which sets FOLL_PIN. Setting FOLL_PIN is now required for code that
requires tracking of pinned pages.
Also, release the pages via put_user_page*().
Also, rename "pages" to "pinned_pages", as this makes for easier reading
of process_vm_rw_single_vec().
John Hubbard [Fri, 31 Jan 2020 06:13:02 +0000 (22:13 -0800)]
IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
Convert infiniband to use the new pin_user_pages*() calls.
Also, revert earlier changes to Infiniband ODP that had it using
put_user_page(). ODP is "Case 3" in
Documentation/core-api/pin_user_pages.rst, which is to say, normal
get_user_pages() and put_page() is the API to use there.
The new pin_user_pages*() calls replace corresponding get_user_pages*()
calls, and set the FOLL_PIN flag. The FOLL_PIN flag requires that the
caller must return the pages via put_user_page*() calls, but infiniband
was already doing that as part of an earlier commit.
John Hubbard [Fri, 31 Jan 2020 06:12:58 +0000 (22:12 -0800)]
goldish_pipe: convert to pin_user_pages() and put_user_page()
1. Call the new global pin_user_pages_fast(), from
pin_goldfish_pages().
2. As required by pin_user_pages(), release these pages via
put_user_page(). In this case, do so via put_user_pages_dirty_lock().
That has the side effect of calling set_page_dirty_lock(), instead of
set_page_dirty(). This is probably more accurate.
As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]
Another side effect is that the release code is simplified because the
page[] loop is now in gup.c instead of here, so just delete the local
release_user_pages() entirely, and call put_user_pages_dirty_lock()
directly, instead.
All pages that are pinned via the above calls, must be unpinned via
put_user_page().
The underlying rules are:
* FOLL_PIN is a gup-internal flag, so the call sites should not directly
set it. That behavior is enforced with assertions.
* Call sites that want to indicate that they are going to do DirectIO
("DIO") or something with similar characteristics, should call a
get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
will:
* Start with "pin_user_pages" instead of "get_user_pages". That
makes it easy to find and audit the call sites.
* Set FOLL_PIN
* For pages that are received via FOLL_PIN, those pages must be returned
via put_user_page().
Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases in
this documentation. (I've reworded it and expanded upon it.)
John Hubbard [Fri, 31 Jan 2020 06:12:50 +0000 (22:12 -0800)]
media/v4l2-core: set pages dirty upon releasing DMA buffers
After DMA is complete, and the device and CPU caches are synchronized,
it's still required to mark the CPU pages as dirty, if the data was
coming from the device. However, this driver was just issuing a bare
put_page() call, without any set_page_dirty*() call.
Fix the problem, by calling set_page_dirty_lock() if the CPU pages were
potentially receiving data from the device.
John Hubbard [Fri, 31 Jan 2020 06:12:47 +0000 (22:12 -0800)]
IB/umem: use get_user_pages_fast() to pin DMA pages
And get rid of the mmap_sem calls, as part of that. Note that
get_user_pages_fast() will, if necessary, fall back to
__gup_longterm_unlocked(), which takes the mmap_sem as needed.
John Hubbard [Fri, 31 Jan 2020 06:12:43 +0000 (22:12 -0800)]
mm/gup: allow FOLL_FORCE for get_user_pages_fast()
Commit 817be129e6f2 ("mm: validate get_user_pages_fast flags") allowed
only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
This, combined with the fact that get_user_pages_fast() falls back to
"slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
if you need FOLL_FORCE, you cannot call get_user_pages_fast().
There does not appear to be any reason for filtering out FOLL_FORCE.
There is nothing in the _fast() implementation that requires that we
avoid writing to the pages. So it appears to have been an oversight.
Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().
Update VFIO to take advantage of the recently loosened restriction on
FOLL_LONGTERM with get_user_pages_remote(). Also, now it is possible to
fix a bug: the VFIO caller is logically a FOLL_LONGTERM user, but it
wasn't setting FOLL_LONGTERM.
Also, remove an unnessary pair of calls that were releasing and
reacquiring the mmap_sem. There is no need to avoid holding mmap_sem
just in order to call page_to_pfn().
Also, now that the the DAX check ("if a VMA is DAX, don't allow long
term pinning") is in the internals of get_user_pages_remote() and
__gup_longterm_locked(), there's no need for it at the VFIO call site. So
remove it.
John Hubbard [Fri, 31 Jan 2020 06:12:36 +0000 (22:12 -0800)]
mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM
As it says in the updated comment in gup.c: current FOLL_LONGTERM
behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the FS
DAX check requirement on vmas.
However, the corresponding restriction in get_user_pages_remote() was
slightly stricter than is actually required: it forbade all
FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
that do not set the "locked" arg.
Update the code and comments to loosen the restriction, allowing
FOLL_LONGTERM in some cases.
Also, copy the DAX check ("if a VMA is DAX, don't allow long term
pinning") from the VFIO call site, all the way into the internals of
get_user_pages_remote() and __gup_longterm_locked(). That is:
get_user_pages_remote() calls __gup_longterm_locked(), which in turn
calls check_dax_vmas(). This check will then be removed from the VFIO
call site in a subsequent patch.
Thanks to Jason Gunthorpe for pointing out a clean way to fix this, and
to Dan Williams for helping clarify the DAX refactoring.
John Hubbard [Fri, 31 Jan 2020 06:12:28 +0000 (22:12 -0800)]
mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
An upcoming patch changes and complicates the refcounting and especially
the "put page" aspects of it. In order to keep everything clean,
refactor the devmap page release routines:
* Rename put_devmap_managed_page() to page_is_devmap_managed(), and
limit the functionality to "read only": return a bool, with no side
effects.
* Add a new routine, put_devmap_managed_page(), to handle decrementing
the refcount for ZONE_DEVICE pages.
* Change callers (just release_pages() and put_page()) to check
page_is_devmap_managed() before calling the new
put_devmap_managed_page() routine. This is a performance point:
put_page() is a hot path, so we need to avoid non- inline function calls
where possible.
* Rename __put_devmap_managed_page() to free_devmap_managed_page(), and
limit the functionality to unconditionally freeing a devmap page.
Dan Williams [Fri, 31 Jan 2020 06:12:24 +0000 (22:12 -0800)]
mm: Cleanup __put_devmap_managed_page() vs ->page_free()
After the removal of the device-public infrastructure there are only 2
->page_free() call backs in the kernel. One of those is a
device-private callback in the nouveau driver, the other is a generic
wakeup needed in the DAX case. In the hopes that all ->page_free()
callbacks can be migrated to common core kernel functionality, move the
device-private specific actions in __put_devmap_managed_page() under the
is_device_private_page() conditional, including the ->page_free()
callback. For the other page types just open-code the generic wakeup.
Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it
does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA
case.
John Hubbard [Fri, 31 Jan 2020 06:12:17 +0000 (22:12 -0800)]
mm/gup: factor out duplicate code from four routines
Patch series "mm/gup: prereqs to track dma-pinned pages: FOLL_PIN", v12.
Overview:
This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], and in a remarkable number of email threads since about
2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
unpin_user_page()
Testing:
* I've done some overall kernel testing (LTP, and a few other goodies),
and some directed testing to exercise some of the changes. And as you
can see, gup_benchmark is enhanced to exercise this. Basically, I've
been able to runtime test the core get_user_pages() and
pin_user_pages() and related routines, but not so much on several of
the call sites--but those are generally just a couple of lines
changed, each.
Not much of the kernel is actually using this, which on one hand
reduces risk quite a lot. But on the other hand, testing coverage
is low. So I'd love it if, in particular, the Infiniband and PowerPC
folks could do a smoke test of this series for me.
Runtime testing for the call sites so far is pretty light:
* io_uring: Some directed tests from liburing exercise this, and
they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and
passes.
* infiniband: Ran rdma-core tests: rdma-core/build/bin/run_tests.py
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
This patch (of 22):
There are four locations in gup.c that have a fair amount of code
duplication. This means that changing one requires making the same
changes in four places, not to mention reading the same code four times,
and wondering if there are subtle differences.
Factor out the common code into static functions, thus reducing the
overall line count and the code's complexity.
Also, take the opportunity to slightly improve the efficiency of the
error cases, by doing a mass subtraction of the refcount, surrounded by
get_page()/put_page().
Also, further simplify (slightly), by waiting until the the successful
end of each routine, to increment *nr.
...and now we have a swap entry that indicates that the page entry
refers to a bad (and poisoned) page of memory, but gup_fast() at this
level of the page table was ignoring swap entries, and incorrectly
assuming that "!pxd_none() == valid and present".
And this was not just a poisoned page problem, but a generaly swap entry
problem. So, any swap entry type (device memory migration, numa
migration, or just regular swapping) could lead to the same problem.
Fix this by checking for pxd_present(), instead of pxd_none().
Ira Weiny [Fri, 31 Jan 2020 06:12:07 +0000 (22:12 -0800)]
mm/filemap.c: clean up filemap_write_and_wait()
At some point filemap_write_and_wait() and
filemap_write_and_wait_range() got the exact same implementation with
the exception of the range being specified in *_range()
Similar to other functions in fs.h which call *_range(..., 0,
LLONG_MAX), change filemap_write_and_wait() to be a static inline which
calls filemap_write_and_wait_range()
Vlastimil Babka [Fri, 31 Jan 2020 06:12:03 +0000 (22:12 -0800)]
mm/debug.c: always print flags in dump_page()
Commit 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line")
inadvertently removed printing of page flags for pages that are neither
anon nor ksm nor have a mapping. Fix that.
Using pr_cont() again would be a solution, but the commit explicitly
removed its use. Avoiding the danger of mixing up split lines from
multiple CPUs might be beneficial for near-panic dumps like this, so fix
this without reintroducing pr_cont().
He Zhe [Fri, 31 Jan 2020 06:12:00 +0000 (22:12 -0800)]
mm/kmemleak: turn kmemleak_lock and object->lock to raw_spinlock_t
kmemleak_lock as a rwlock on RT can possibly be acquired in atomic
context which does work.
Since the kmemleak operation is performed in atomic context make it a
raw_spinlock_t so it can also be acquired on RT. This is used for
debugging and is not enabled by default in a production like environment
(where performance/latency matters) so it makes sense to make it a
raw_spinlock_t instead trying to get rid of the atomic context. Turn
also the kmemleak_object->lock into raw_spinlock_t which is acquired
(nested) while the kmemleak_lock is held.
The time spent in "echo scan > kmemleak" slightly improved on 64core box
with this patch applied after boot.
Yu Zhao [Fri, 31 Jan 2020 06:11:57 +0000 (22:11 -0800)]
mm/slub.c: avoid slub allocation while holding list_lock
If we are already under list_lock, don't call kmalloc(). Otherwise we
will run into a deadlock because kmalloc() also tries to grab the same
lock.
Fix the problem by using a static bitmap instead.
WARNING: possible recursive locking detected
--------------------------------------------
mount-encrypted/4921 is trying to acquire lock:
(&(&n->list_lock)->rlock){-.-.}, at: ___slab_alloc+0x104/0x437
but task is already holding lock:
(&(&n->list_lock)->rlock){-.-.}, at: __kmem_cache_shutdown+0x81/0x3cb
other info that might help us debug this:
Possible unsafe locking scenario:
Andy Shevchenko [Fri, 31 Jan 2020 06:11:47 +0000 (22:11 -0800)]
ocfs2/dlm: move BITS_TO_BYTES() to bitops.h for wider use
There are users already and will be more of BITS_TO_BYTES() macro. Move
it to bitops.h for wider use.
In the case of ocfs2 the replacement is identical.
As for bnx2x, there are two places where floor version is used. In the
first case to calculate the amount of structures that can fit one memory
page. In this case obviously the ceiling variant is correct and
original code might have a potential bug, if amount of bits % 8 is not
0. In the second case the macro is used to calculate bytes transmitted
in one microsecond. This will work for all speeds which is multiply of
1Gbps without any change, for the rest new code will give ceiling value,
for instance 100Mbps will give 13 bytes, while old code gives 12 bytes
and the arithmetically correct one is 12.5 bytes. Further the value is
used to setup timer threshold which in any case has its own margins due
to certain resolution. I don't see here an issue with slightly shifting
thresholds for low speed connections, the card is supposed to utilize
highest available rate, which is usually 10Gbps.
Colin Ian King [Fri, 31 Jan 2020 06:11:43 +0000 (22:11 -0800)]
ocfs2/dlm: remove redundant assignment to ret
The variable ret is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Masahiro Yamada [Fri, 31 Jan 2020 06:11:40 +0000 (22:11 -0800)]
ocfs2: make local header paths relative to C files
Gang He reports the failure of building fs/ocfs2/ as an external module
of the kernel installed on the system:
$ cd fs/ocfs2
$ make -C /lib/modules/`uname -r`/build M=`pwd` modules
If you want to make it work reliably, I'd recommend to remove ccflags-y
from the Makefiles, and to make header paths relative to the C files. I
think this is the correct usage of the #include "..." directive.
Aditya Pakki [Fri, 31 Jan 2020 06:11:33 +0000 (22:11 -0800)]
fs: ocfs: remove unnecessary assertion in dlm_migrate_lockres
In the only caller of dlm_migrate_lockres() - dlm_empty_lockres(),
target is checked for O2NM_MAX_NODES. Thus, the assertion in
dlm_migrate_lockres() is unnecessary and can be removed. The patch
eliminates such a check.
Xiong [Fri, 31 Jan 2020 06:11:27 +0000 (22:11 -0800)]
scripts/spelling.txt: add more spellings to spelling.txt
Here are some of the common spelling mistakes and typos that I've found
while fixing up spelling mistakes in the kernel. Most of them still
exist in more than two source files.
Yang Shi [Fri, 31 Jan 2020 06:11:24 +0000 (22:11 -0800)]
mm: move_pages: report the number of non-attempted pages
Since commit a49bd4d71637 ("mm, numa: rework do_pages_move"), the
semantic of move_pages() has changed to return the number of
non-migrated pages if they were result of a non-fatal reasons (usually a
busy page).
This was an unintentional change that hasn't been noticed except for LTP
tests which checked for the documented behavior.
There are two ways to go around this change. We can even get back to
the original behavior and return -EAGAIN whenever migrate_pages is not
able to migrate pages due to non-fatal reasons. Another option would be
to simply continue with the changed semantic and extend move_pages
documentation to clarify that -errno is returned on an invalid input or
when migration simply cannot succeed (e.g. -ENOMEM, -EBUSY) or the
number of pages that couldn't have been migrated due to ephemeral
reasons (e.g. page is pinned or locked for other reasons).
This patch implements the second option because this behavior is in
place for some time without anybody complaining and possibly new users
depending on it. Also it allows to have a slightly easier error
handling as the caller knows that it is worth to retry when err > 0.
But since the new semantic would be aborted immediately if migration is
failed due to ephemeral reasons, need include the number of
non-attempted pages in the return value too.
Wei Yang [Fri, 31 Jan 2020 06:11:20 +0000 (22:11 -0800)]
mm: thp: don't need care deferred split queue in memcg charge move path
If compound is true, this means it is a PMD mapped THP. Which implies
the page is not linked to any defer list. So the first code chunk will
not be executed.
Also with this reason, it would not be proper to add this page to a
defer list. So the second code chunk is not correct.
Based on this, we should remove the defer list related code.
The daxctl unit test for the dax_kmem driver currently triggers the
(false positive) lockdep splat below. It results from the fact that
remove_memory_block_devices() is invoked under the mem_hotplug_lock()
causing lockdep entanglements with cpu_hotplug_lock() and sysfs (kernfs
active state tracking). It is a false positive because the sysfs
attribute path triggering the memory remove is not the same attribute
path associated with memory-block device.
sysfs_break_active_protection() is not applicable since there is no real
deadlock conflict, instead move memory-block device removal outside the
lock. The mem_hotplug_lock() is not needed to synchronize the
memory-block device removal vs the page online state, that is already
handled by lock_device_hotplug(). Specifically, lock_device_hotplug()
is sufficient to allow try_remove_memory() to check the offline state of
the memblocks and be assured that any in progress online attempts are
flushed / blocked by kernfs_drain() / attribute removal.
The add_memory() path safely creates memblock devices under the
mem_hotplug_lock(). There is no kernfs active state synchronization in
the memblock device_register() path, so nothing to fix there.
This change is only possible thanks to the recent change that refactored
memory block device removal out of arch_remove_memory() (commit 4c4b7f9ba948 "mm/memory_hotplug: remove memory block devices before
arch_remove_memory()"), and David's due diligence tracking down the
guarantees afforded by kernfs_drain(). Not flagged for -stable since
this only impacts ongoing development and lockdep validation, not a
runtime issue.
======================================================
WARNING: possible circular locking dependency detected
5.5.0-rc3+ #230 Tainted: G OE
------------------------------------------------------
lt-daxctl/6459 is trying to acquire lock: ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80
but task is already holding lock: ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
Wei Yang [Fri, 31 Jan 2020 06:11:14 +0000 (22:11 -0800)]
mm/migrate.c: also overwrite error when it is bigger than zero
If we get here after successfully adding page to list, err would be 1 to
indicate the page is queued in the list.
Current code has two problems:
* on success, 0 is not returned
* on error, if add_page_for_migratioin() return 1, and the following err1
from do_move_pages_to_node() is set, the err1 is not returned since err
is 1
Pingfan Liu [Fri, 31 Jan 2020 06:11:10 +0000 (22:11 -0800)]
mm/sparse.c: reset section's mem_map when fully deactivated
After commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"),
when a mem section is fully deactivated, section_mem_map still records
the section's start pfn, which is not used any more and will be
reassigned during re-addition.
In analogy with alloc/free pattern, it is better to clear all fields of
section_mem_map.
Beside this, it breaks the user space tool "makedumpfile" [1], which
makes assumption that a hot-removed section has mem_map as NULL, instead
of checking directly against SECTION_MARKED_PRESENT bit. (makedumpfile
will be better to change the assumption, and need a patch)
The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" ,
trigger a crash, and save vmcore by makedumpfile
Dan Carpenter [Fri, 31 Jan 2020 06:11:07 +0000 (22:11 -0800)]
mm/mempolicy.c: fix out of bounds write in mpol_parse_str()
What we are trying to do is change the '=' character to a NUL terminator
and then at the end of the function we restore it back to an '='. The
problem is there are two error paths where we jump to the end of the
function before we have replaced the '=' with NUL.
We end up putting the '=' in the wrong place (possibly one element
before the start of the buffer).
Theodore Ts'o [Fri, 31 Jan 2020 06:11:04 +0000 (22:11 -0800)]
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.