Git Repo - linux.git/log

mm/hugetlb: use a folio in hugetlb_wp()

We can replace nine implict calls to compound_head() with one by using
old_folio. The page we get back is always a head page, so we just convert
old_page to old_folio.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: ZhangPeng <[email protected]>
Suggested-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: Sidhartha Kumar <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/hugetlb: use a folio in copy_hugetlb_page_range()

Patch series "Convert several functions in hugetlb.c to use a folio", v2.

This patch series converts three functions in hugetlb.c to use a folio,
which can remove several implicit calls to compound_head().

This patch (of 3):

We can replace five implict calls to compound_head() with one by using
pte_folio. The page we get back is always a head page, so we just convert
ptepage to pte_folio.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: ZhangPeng <[email protected]>
Suggested-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests: error out if kernel header files are not yet built

As per a discussion with Muhammad Usama Anjum [1], the following is how
one is supposed to build selftests:

make headers && make -C tools/testing/selftests/mm

Change the selftest build system's lib.mk to fail out with a helpful
message if that prerequisite "make headers" has not been done yet.

[1] https://lore.kernel.org/all/bf910fa5-0c96-3707-cce4-5bcc656b6274@collabora.com/

[[email protected]: abort the make process the first time headers aren't detected]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix out-of-tree builds]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Anders Roxell <[email protected]>
Reviewed-by: Muhammad Usama Anjum <[email protected]>
Tested-by: Muhammad Usama Anjum <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Documentation: kselftest: "make headers" is a prerequisite

As per a discussion with Muhammad Usama Anjum [1], the following is how
one is supposed to build selftests:

make headers && make -C tools/testing/selftests/mm

However, that's not yet documented anywhere. So add it to
Documentation/dev-tools/kselftest.rst .

[1] https://lore.kernel.org/all/bf910fa5-0c96-3707-cce4-5bcc656b6274@collabora.com/

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Tested-by: Muhammad Usama Anjum <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: move certain uffd*() routines from vm_util.c to uffd-common.c

There are only three uffd*() routines that are used outside of the uffd
selftests. Leave these in vm_util.c, where they are available to any mm
selftest program:

    uffd_register()
    uffd_unregister()
    uffd_register_with_ioctls().

A few other uffd*() routines, however, are only used by the uffd-focused
tests found in uffd-stress.c and uffd-unit-tests.c. Move those routines
into uffd-common.c.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Tested-by: Muhammad Usama Anjum <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: fix build failures due to missing MADV_COLLAPSE

MADV_PAGEOUT, MADV_POPULATE_READ, MADV_COLLAPSE are conditionally
defined as necessary. However, that was being done in .c files, and a
new build failure came up that would have been automatically avoided had
these been in a common header file.

So consolidate and move them all to vm_util.h, which fixes the build
failure.

An alternative approach from Muhammad Usama Anjum was: rely on "make
headers" being required, and include asm-generic/mman-common.h. This
works in the sense that it builds, but it still generates warnings about
duplicate MADV_* symbols, and the goal here is to get a fully clean (no
warnings) build here.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Tested-by: Muhammad Usama Anjum <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: fix a "possibly uninitialized" warning in pkey-x86.h

This fixes a real bug, too, because xstate_size() was assuming that
the stack variable xstate_size was initialized to zero. That's not
guaranteed nor even especially likely.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Tested-by: Muhammad Usama Anjum <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: fix two -Wformat-security warnings in uffd builds

The uffd tests generate two compile time warnings from clang's
-Wformat-security setting. These trigger at the call sites for
uffd_test_start() and uffd_test_skip().

1) Fix the uffd_test_start() issue by removing the intermediate
test_name variable (thanks to David Hildenbrand for showing how to do
this).

2) Fix the uffd_test_skip() issue by observing that there is no need for
a macro and a variable args approach, because all callers of
uffd_test_skip() pass in a simple char* string, without any format
specifiers. So just change uffd_test_skip() into a regular C function.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Tested-by: Muhammad Usama Anjum <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: .gitignore: add mkdirty, va_high_addr_switch

These new build products were left out of .gitignore, so add them now.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Tested-by: Muhammad Usama Anjum <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: fix invocation of tests that are run via shell scripts

We cannot depend upon git to reliably retain the executable bit on shell
scripts, or so I was told several years ago while working on this same
run_vmtests.sh script. And sure enough, things such as test_hmm.sh are
lately failing to run, due to lacking execute permissions.

Fix this by explicitly adding "bash" to each of the shell script
invocations. Leave fixing the overall approach to another day.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Tested-by: Muhammad Usama Anjum <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: fix "warning: expression which evaluates to zero..." in mlock2-tests.c

The stop variable is a char*, and the code was assigning a char value to
it. This was generating a warning when compiling with clang.

However, as both David and Peter pointed out, stop is not even used
after the problematic assignment to a char type. So just delete that
line entirely.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Tested-by: Muhammad Usama Anjum <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: fix unused variable warnings in hugetlb-madvise.c, migration.c

Dummy variables are required in order to make these two (similar)
routines work, so in both cases, declare the variables as volatile in
order to avoid the clang compiler warning.

Furthermore, in order to ensure that each test actually does what is
intended, add an asm volatile invocation (thanks to David Hildenbrand
for the suggestion), with a clarifying comment so that it survives
future maintenance.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Tested-by: Muhammad Usama Anjum <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: fix uffd-stress unused function warning

Patch series "A minor flurry of selftest/mm fixes", v3.

A series that fixes up build errors and warnings for at least the 64-bit
builds on x86 with clang.

The series also includes an optional "improvement" of moving some uffd
code into uffd-common.[ch], which is proving to be somewhat controversial,
and so if that doesn't get resolved, then patches 9 and 10 may just get
dropped. They are not required in order to get a clean build, now that
"make headers" is happening.

[1]: https://lore.kernel.org/all/20230602013358 [email protected]/

This patch (of 11):

uffd_minor_feature() was unused. Remove it in order to fix the associated
clang build warning.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Muhammad Usama Anjum <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

memory tier: remove unneeded disable_all_demotion_targets() when !CONFIG_MIGRATION

There's no caller of disable_all_demotion_targets() when CONFIG_MIGRATION
is disabled. Remove it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmscan: mark kswapd_run() and kswapd_stop() __meminit

Add __meminit to kswapd_run() and kswapd_stop() to ensure they're default
to __init when memory hotplug is not enabled.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Yu Zhao <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: remove obsolete alloc_migrate_target()

There's only declaration left in the header file. Remove it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: skip CMA pages when they are not available

This patch fixes unproductive reclaiming of CMA pages by skipping them
when they are not available for current context.  It arises from the below
OOM issue, which was caused by a large proportion of MIGRATE_CMA pages
among free pages.

[   36.172486] [03-19 10:05:52.172] ActivityManager: page allocation failure: order:0, mode:0xc00(GFP_NOIO), nodemask=(null),cpuset=foreground,mems_allowed=0
[   36.189447] [03-19 10:05:52.189] DMA32: 0*4kB 447*8kB (C) 217*16kB (C) 124*32kB (C) 136*64kB (C) 70*128kB (C) 22*256kB (C) 3*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 35848kB
[   36.193125] [03-19 10:05:52.193] Normal: 231*4kB (UMEH) 49*8kB (MEH) 14*16kB (H) 13*32kB (H) 8*64kB (H) 2*128kB (H) 0*256kB 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 3236kB
...
[   36.234447] [03-19 10:05:52.234] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
[   36.234455] [03-19 10:05:52.234] cache: ext4_io_end, object size: 64, buffer size: 64, default order: 0, min order: 0
[   36.234459] [03-19 10:05:52.234] node 0: slabs: 53,objs: 3392, free: 0

This change further decreases the chance for wrong OOMs in the presence
of a lot of CMA memory.

[[email protected]: changelog addition]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zhaoyang Huang <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: ke.wang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: page_isolation: write proper kerneldoc

And remove the incorrect header comments.

[[email protected]: s/lower/first/, s/upper/last/, per Mike]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

Writing to file-backed dirty-tracked mappings via GUP is inherently broken
as we cannot rule out folios being cleaned and then a GUP user writing to
them again and possibly marking them dirty unexpectedly.

This is especially egregious for long-term mappings (as indicated by the
use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as we
have already done in the slow path.

We have access to less information in the fast path as we cannot examine
the VMA containing the mapping, however we can determine whether the folio
is anonymous or belonging to a whitelisted filesystem - specifically
hugetlb and shmem mappings.

We take special care to ensure that both the folio and mapping are safe to
access when performing these checks and document folio_fast_pin_allowed()
accordingly.

It's important to note that there are no APIs allowing users to specify
FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
always rely on the fact that if we fail to pin on the fast path, the code
will fall back to the slow path which can perform the more thorough check.

Link: https://lkml.kernel.org/r/a27d39b87ded7f3dad5fd4181edb106393660453.1683235180.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Suggested-by: Kirill A . Shutemov <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Mika Penttilä <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappings

Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.

A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.

The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.

As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.

For example, consider the following scenario:-

1. A folio is written to via GUP which write-faults the memory, notifying
   the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
   the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
   direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
   (though it does not have to).

This results in both data being written to a folio without writenotify,
and the folio being dirtied unexpectedly (if the caller decides to do so).

This issue was first reported by Jan Kara [1] in 2018, where the problem
resulted in file system crashes.

This is only relevant when the mappings are file-backed and the underlying
file system requires folio dirty tracking.  File systems which do not,
such as shmem or hugetlb, are not at risk and therefore can be written to
without issue.

Unfortunately this limitation of GUP has been present for some time and
requires future rework of the GUP API in order to provide correct write
access to such mappings.

However, for the time being we introduce this check to prevent the most
egregious case of this occurring, use of the FOLL_LONGTERM pin.

These mappings are considerably more likely to be written to after folios
are cleaned and thus simply must not be permitted to do so.

This patch changes only the slow-path GUP functions, a following patch
adapts the GUP-fast path along similar lines.

[1] https://lore.kernel.org/linux-mm/20180103100430 [email protected]/

Link: https://lkml.kernel.org/r/7282506742d2390c125949c2f9894722750bb68a.1683235180.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <[email protected]>
Suggested-by: Jason Gunthorpe <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Reviewed-by: Mika Penttilä <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Kirill A . Shutemov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mmap: separate writenotify and dirty tracking logic

Patch series "mm/gup: disallow GUP writing to file-backed mappings by
default", v9.

Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.

A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.

The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.

As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.

For example, consider the following scenario:-

1. A folio is written to via GUP which write-faults the memory, notifying
   the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
   the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
   direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
   (though it does not have to).

This change updates both the PUP FOLL_LONGTERM slow and fast APIs.  As
pin_user_pages_fast_only() does not exist, we can rely on a slightly
imperfect whitelisting in the PUP-fast case and fall back to the slow case
should this fail.

This patch (of 3):

vma_wants_writenotify() is specifically intended for setting PTE page
table flags, accounting for existing page table flag state and whether the
underlying filesystem performs dirty tracking for a file-backed mapping.

Everything is predicated firstly on whether the mapping is shared
writable, as this is the only instance where dirty tracking is pertinent -
MAP_PRIVATE mappings will always be CoW'd and unshared, and read-only
file-backed shared mappings cannot be written to, even with FOLL_FORCE.

All other checks are in line with existing logic, though now separated
into checks eplicitily for dirty tracking and those for determining how to
set page table flags.

We make this change so we can perform checks in the GUP logic to determine
which mappings might be problematic when written to.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/0f218370bd49b4e6bbfbb499f7c7b92c26ba1ceb.1683235180.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Reviewed-by: Mika Penttilä <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Kirill A . Shutemov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/early_ioremap.c: improve the execution efficiency of early_ioremap_setup()

Reduce the number of invalid loops of the function early_ioremap_setup()
to improve the efficiency of function execution

Link: https://lkml.kernel.org/r/CACZJ9cU6t5sLoDwE6_XOg+UJLpZt4+qHfjYN2bA0s+3y9y6pQQ@mail.gmail.com
Signed-off-by: LiamNi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

memcg: use helper macro FLUSH_TIME

Use helper macro FLUSH_TIME to indicate the flush time to improve the
readability a bit. No functional change intended.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: page_alloc: remove unneeded header files

Remove some unneeded header files. No functional change intended.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: fix failure to unmap pte on highmem systems

The loser of a race to service a pte for a device private entry in the
swap path previously unlocked the ptl, but failed to unmap the pte. This
only affects highmem systems since unmapping a pte is a noop on
non-highmem systems.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 16ce101db85d ("mm/memory.c: fix race when faulting a device private page")
Signed-off-by: Ryan Roberts <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/damon/ops-common: refactor to use {pte|pmd}p_clear_young_notify()

With the fix in place to atomically test and clear young on ptes and pmds,
simplify the code to handle the clearing for both the primary mmu and the
mmu notifier with a single API call.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Acked-by: Yu Zhao <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/damon/ops-common: atomically test and clear young on ptes and pmds

It is racy to non-atomically read a pte, then clear the young bit, then
write it back as this could discard dirty information. Further, it is bad
practice to directly set a pte entry within a table. Instead clearing
young must go through the arch-provided helper,
ptep_test_and_clear_young() to ensure it is modified atomically and to
give the arch code visibility and allow it to check (and potentially
modify) the operation.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 3f49584b262c ("mm/damon: implement primitives for the virtual memory address spaces").
Signed-off-by: Ryan Roberts <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmalloc must set pte via arch code

Patch series "Fixes for pte encapsulation bypasses", v3.

A series to improve the encapsulation of pte entries by disallowing
non-arch code from directly dereferencing pte_t pointers.

This patch (of 4):

It is bad practice to directly set pte entries within a pte table.
Instead all modifications must go through arch-provided helpers such as
set_pte_at() to give the arch code visibility and allow it to check (and
potentially modify) the operation.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 3e9a9e256b1e ("mm: add a vmap_pfn function")
Signed-off-by: Ryan Roberts <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Acked-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

vmstat: allow_direct_reclaim should use zone_page_state_snapshot

A customer provided evidence indicating that a process
was stalled in direct reclaim:

- The process was trapped in throttle_direct_reclaim().
   The function wait_event_killable() was called to wait condition
   allow_direct_reclaim(pgdat) for current node to be true.
   The allow_direct_reclaim(pgdat) examined the number of free pages
   on the node by zone_page_state() which just returns value in
   zone->vm_stat[NR_FREE_PAGES].

- On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
   However, the freelist on this node was not empty.

- This inconsistent of vmstat value was caused by percpu vmstat on
   nohz_full cpus. Every increment/decrement of vmstat is performed
   on percpu vmstat counter at first, then pooled diffs are cumulated
   to the zone's vmstat counter in timely manner. However, on nohz_full
   cpus (in case of this customer's system, 48 of 52 cpus) these pooled
   diffs were not cumulated once the cpu had no event on it so that
   the cpu started sleeping infinitely.
   I checked percpu vmstat and found there were total 69 counts not
   cumulated to the zone's vmstat counter yet.

- In this situation, kswapd did not help the trapped process.
   In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
   of free pages on the node by zone_page_state_snapshot() which
   checks pending counts on percpu vmstat.
   Therefore kswapd could know there were 69 free pages correctly.
   Since zone->_watermark = {8, 20, 32}, kswapd did not work because
   69 was greater than 32 as high watermark.

Change allow_direct_reclaim to use zone_page_state_snapshot, which
allows a more precise version of the vmstat counters to be used.

allow_direct_reclaim will only be called from try_to_free_pages,
which is not a hot path.

Testing: Due to difficulties accessing the system, it has not been
possible for the reproducer to test the patch (however its
clear from available data and analysis that it should fix it).

Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Michal Hocko <[email protected]>
Reviewed-by: Aaron Tomlin <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

fuse: use direct_write_fallback

Use the generic direct_write_fallback helper instead of duplicating the
logic.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Xiubo Li <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

fuse: drop redundant arguments to fuse_perform_write

pos is always equal to iocb->ki_pos, and mapping is always equal to
iocb->ki_filp->f_mapping.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Acked-by: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Xiubo Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

fuse: update ki_pos in fuse_perform_write

Both callers of fuse_perform_write need to updated ki_pos, move it into
common code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Xiubo Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

fs: factor out a direct_write_fallback helper

Add a helper dealing with handling the syncing of a buffered write
fallback for direct I/O.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Miklos Szeredi <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Xiubo Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

iomap: use kiocb_write_and_wait and kiocb_invalidate_pages

Use the common helpers for direct I/O page invalidation instead of open
coding the logic. This leads to a slight reordering of checks in
__iomap_dio_rw to keep the logic straight.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Xiubo Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

iomap: update ki_pos in iomap_file_buffered_write

All callers of iomap_file_buffered_write need to updated ki_pos, move it
into common code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Andreas Gruenbacher <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Xiubo Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

filemap: add a kiocb_invalidate_post_direct_write helper

Add a helper to invalidate page cache after a dio write.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Acked-by: Darrick J. Wong <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Xiubo Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

filemap: add a kiocb_invalidate_pages helper

Factor out a helper that calls filemap_write_and_wait_range and
invalidate_inode_pages2_range for the range covered by a write kiocb or
returns -EAGAIN if the kiocb is marked as nowait and there would be pages
to write or invalidate.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Acked-by: Darrick J. Wong <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Xiubo Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

filemap: add a kiocb_write_and_wait helper

Factor out a helper that does filemap_write_and_wait_range for the range
covered by a read kiocb, or returns -EAGAIN if the kiocb is marked as
nowait and there would be pages to write.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Acked-by: Darrick J. Wong <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Xiubo Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

filemap: update ki_pos in generic_perform_write

All callers of generic_perform_write need to updated ki_pos, move it into
common code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Xiubo Li <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Acked-by: Theodore Ts'o <[email protected]>
Acked-by: Darrick J. Wong <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Trond Myklebust <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

iomap: update ki_pos a little later in iomap_dio_complete

Move the ki_pos update down a bit to prepare for a better common helper
that invalidates pages based of an iocb.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Xiubo Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

backing_dev: remove current->backing_dev_info

Patch series "cleanup the filemap / direct I/O interaction", v4.

This series cleans up some of the generic write helper calling conventions
and the page cache writeback / invalidation for direct I/O. This is a
spinoff from the no-bufferhead kernel project, for which we'll want to an
use iomap based buffered write path in the block layer.

This patch (of 12):

The last user of current->backing_dev_info disappeared in commit
b9b1335e6403 ("remove bdi_congested() and wb_congested() and related
functions"). Remove the field and all assignments to it.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Acked-by: Theodore Ts'o <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Xiubo Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: zswap: shrink until can accept

This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.

The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test.  For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided.  The results are presented below, these are averages of
three runs without the use of zswap:

real 46m26s
user 35m4s
sys 7m37s

With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:

real 56m4s
user 35m13s
sys 8m43s

written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478

Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.

The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold.  Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.

Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.

This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated.  The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.

This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.

Testing this suggested update showed much better numbers:

real 36m37s
user 35m8s
sys 9m32s

written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
Reviewed-by: Vitaly Wool <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Seth Jennings <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mm_init.c: move set_pageblock_order() to free_area_init()

pageblock_order only needs to be set once, there is no need to initialize
it in every zone/node.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Haifeng Xu <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: khugepaged: avoid pointless allocation for "struct mm_slot"

In __khugepaged_enter(), if "mm->flags" with MMF_VM_HUGEPAGE bit is set,
the "mm_slot" will be released and return, so we can call mm_slot_alloc()
after test_and_set_bit().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Xin Hao <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/page_alloc: don't wake kswapd from rmqueue() unless __GFP_KSWAPD_RECLAIM is specified

Commit 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock
held") moved wakeup_kswapd() from steal_suitable_fallback() to rmqueue()
using ZONE_BOOSTED_WATERMARK flag.

Only allocation contexts that include ALLOC_KSWAPD (which corresponds to
__GFP_KSWAPD_RECLAIM) should wake kswapd, for callers are supposed to
remove __GFP_KSWAPD_RECLAIM if trying to hold pgdat->kswapd_wait has a
risk of deadlock. But since zone->flags is a shared variable, a thread
doing !__GFP_KSWAPD_RECLAIM allocation request might observe this flag
being set immediately after another thread doing __GFP_KSWAPD_RECLAIM
allocation request set this flag, causing possibility of deadlock.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock held")
Signed-off-by: Tetsuo Handa <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mm_init.c: remove free_area_init_memoryless_node()

free_area_init_memoryless_node() is just a wrapper of
free_area_init_node(), remove it to clean up.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Haifeng Xu <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

THP: avoid lock when check whether THP is in deferred list

free_transhuge_page() acquires split queue lock then check whether the THP
was added to deferred list or not.  It brings high deferred queue lock
contention.

It's safe to check whether the THP is in deferred list or not without
holding the deferred queue lock in free_transhuge_page() because when code
hit free_transhuge_page(), there is no one tries to add the folio to
_deferred_list.

Running page_fault1 of will-it-scale + order 2 folio for anonymous
mapping with 96 processes on an Ice Lake 48C/96T test box, we could
see the 61% split_queue_lock contention:
-   63.02%     0.01%  page_fault1_pro  [kernel.kallsyms]         [k] free_transhuge_page
   - 63.01% free_transhuge_page
      + 62.91% _raw_spin_lock_irqsave

With this patch applied, the split_queue_lock contention is less
than 1%.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yin Fengwei <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

swap: comments get_swap_device() with usage rule

The general rule to use a swap entry is as follows.

When we get a swap entry, if there aren't some other ways to prevent
swapoff, such as the folio in swap cache is locked, page table lock is
held, etc., the swap entry may become invalid because of swapoff.
Then, we need to enclose all swap related functions with
get_swap_device() and put_swap_device(), unless the swap functions
call get/put_swap_device() by themselves.

Add the rule as comments of get_swap_device().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
Reviewed-by: Chris Li (Google) <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

swap: remove get/put_swap_device() in __swap_duplicate()

__swap_duplicate() is called by

- swap_shmem_alloc(): the folio in swap cache is locked.

- copy_nonpresent_pte() -> swap_duplicate() and try_to_unmap_one() ->
swap_duplicate(): the page table lock is held.

- __read_swap_cache_async() -> swapcache_prepare(): enclosed with
get/put_swap_device() in __read_swap_cache_async() already.

So, it's safe to remove get/put_swap_device() in __swap_duplicate().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Chris Li (Google) <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

swap: remove __swp_swapcount()

__swp_swapcount() just encloses the calling to swap_swapcount() with
get/put_swap_device(). It is called in __read_swap_cache_async() only,
which encloses the calling with get/put_swap_device() already. So,
__read_swap_cache_async() can call swap_swapcount() directly.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Chris Li (Google) <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

swap, __read_swap_cache_async(): enlarge get/put_swap_device protection range

This makes the function a little easier to be understood because we don't
need to consider swapoff. And this makes it possible to remove
get/put_swap_device() calling in some functions called by
__read_swap_cache_async().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Chris Li (Google) <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

swap: remove get/put_swap_device() in __swap_count()

Patch series "swap: cleanup get/put_swap_device() usage", v3.

The general rule to use a swap entry is as follows.

When we get a swap entry, if there aren't some other ways to prevent
swapoff, such as the folio in swap cache is locked, page table lock is
held, etc., the swap entry may become invalid because of swapoff.  Then,
we need to enclose all swap related functions with get_swap_device() and
put_swap_device(), unless the swap functions call get/put_swap_device() by
themselves.

Based on the above rule, all get/put_swap_device() usage are checked and
cleaned up if necessary.

This patch (of 5):

get/put_swap_device() are added to __swap_count() in commit
eb085574a752 ("mm, swap: fix race between swapoff and some swap
operations").  Later, in commit 2799e77529c2 ("swap: fix
do_swap_page() race with swapoff"), get/put_swap_device() are added to
do_swap_page().  And they enclose the only call site of
__swap_count().  So, it's safe to remove get/put_swap_device() in
__swap_count() now.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Chris Li (Google) <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mm_init.c: do not calculate zone_start_pfn/zone_end_pfn in zone_absent_pages_in_node()

In calculate_node_totalpages(), zone_start_pfn/zone_end_pfn are already
calculated in zone_spanned_pages_in_node(), so use them as parameters
instead of node_start_pfn/node_end_pfn and the duplicated calculation
process can de dropped.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Haifeng Xu <[email protected]>
Suggested-by: Mike Rapoport <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Haifeng Xu <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mm_init.c: introduce reset_memoryless_node_totalpages()

Currently, no matter whether a node actually has memory or not,
calculate_node_totalpages() is used to account number of pages in
zone/node.  However, for node without memory, these unnecessary
calculations can be skipped.  All the zone/node page counts can be set to
0 directly.  So introduce reset_memoryless_node_totalpages() to perform
this action.

Furthermore, calculate_node_totalpages() only gets called for the node
with memory.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Haifeng Xu <[email protected]>
Suggested-by: Mike Rapoport <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/mm/damon/design: add a section for the modules layer

Add a section for covering DAMON modules layer to the design document.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/mm/damon/design: add a section for DAMON core API

Add a section covering the API of DAMON core layer on the design document.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/mm/damon/design: add sections for advanced features of DAMOS

Add sections for advanced features of DAMOS including quotas,
prioritization, watermarks, and filters of DAMOS on the design document.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/mm/damon/design: add sections for basic parts of DAMOS

DAMOS is an important part of DAMON, but the design doc is not covering
it. Add sections for covering the basic part of DAMOS.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/mm/damon/design: add a section for the relation between Core and Modules layer

Add overall desription of the interface and the relation between the Core
and the Modules layer under 'Overall Architecture' section.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/mm/damon/design: rewrite configurable layers

The 'Configurable Operations Set' section is a little bit outdated.
Update the text.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/mm/damon/design: update the layout based on the layers

DAMON design document is describing only the operations set layer and
monitoring part of the core logic. Update the layout based on the DAMON's
layers, so that more parts of DAMON including DAMOS core logic and DAMON
modules can easily be added.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/mm/damon/design: add a section for overall architecture

The design doc is missing overall picture of DAMON. Add a section for
overall architeucture and layers.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/mm/damon/maintainer-profile: fix typos and grammar errors

Fix a few typos and grammar erros in DAMON Maintainer Profile document.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/mm/damon/faq: remove old questions

Patch series "Docs/mm/damon: Minor fixes and design doc update".

Some of the DAMON documents are outdated, or having minor typos or grammar
erros.  Especially, the design doc has not updated for DAMOS, which is an
important part of DAMON.  Fix the minor issues and update documents.

This patch (of 10):

The first two questions of DAMON faqs have raised when DAMON patches were
first submitted.  More than one year has passed since DAMON patches get
merged in the mainline, and that kind of questions are not asked nowadays.
Remove the questions.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Multi-gen LRU: fix workingset accounting

On Android app cycle workloads, MGLRU showed a significant reduction in
workingset refaults although pgpgin/pswpin remained relatively unchanged.
This indicated MGLRU may be undercounting workingset refaults.

This has impact on userspace programs, like Android's LMKD, that monitor
workingset refault statistics to detect thrashing.

It was found that refaults were only accounted if the MGLRU shadow entry
was for a recently evicted folio. However, recently evicted folios should
be accounted as workingset activation, and refaults should be accounted
regardless of recency.

Fix MGLRU's workingset refault and activation accounting to more closely
match that of the conventional active/inactive LRU.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Kalesh Singh <[email protected]>
Reported-by: Charan Teja Kalla <[email protected]>
Acked-by: Yu Zhao <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Jan Alexander Steffens (heftig) <[email protected]>
Cc: Oleksandr Natalenko <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: relocate the declaration of mas_empty_area_rev().

Relocate the declaration of mas_empty_area_rev() so that mas_empty_area()
and mas_empty_area_rev() are together.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: simplify and clean up mas_wr_node_store()

Simplify and clean up mas_wr_node_store(), remove unnecessary code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: rework mas_wr_slot_store() to be cleaner and more efficient.

Get whether the two gaps to be overwritten are empty to avoid calling
mas_update_gap() all the time. Also clean up the code and add comments.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: add comments and some minor cleanups to mas_wr_append()

Add comment for mas_wr_append(), move mas_update_gap() into
mas_wr_append(), and other cleanups to make mas_wr_modify() cleaner.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: add mas_wr_new_end() to calculate new_end accurately

The previous new_end calculation is inaccurate, because it assumes that
two new pivots must be added (this is inaccurate), and sometimes it will
miss the fast path and enter the slow path. Add mas_wr_new_end() to
accurately calculate new_end to make the conditions for entering the fast
path more accurate.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: make the code symmetrical in mas_wr_extend_null()

Just make the code symmetrical to improve readability.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: simplify mas_is_span_wr()

Make the code for detecting spanning writes more concise.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: fix the arguments to __must_hold()

Fix the arguments to __must_hold() to make sparse work.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: drop mas_{rev_}alloc() and mas_fill_gap()

mas_{rev_}alloc() and mas_fill_gap() are no longer used, delete them.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: rework mtree_alloc_{range,rrange}()

Patch series "Clean ups for maple tree", v4.

Some clean ups, mainly to make the code of maple tree more concise.
This patchset has passed the self-test.

This patch (of 10):

Use mas_empty_area{_rev}() to refactor mtree_alloc_{range,rrange}()

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memcontrol: export memcg.swap watermark via sysfs for v2 memcg

This patch is similar to commit 8e20d4b33266 ("mm/memcontrol: export
memcg->watermark via sysfs for v2 memcg"), but exports the swap counter's
watermark.

We allocate jobs to our compute farm using heuristics determined by memory
and swap usage from previous jobs. Tracking the peak swap usage for new
jobs is important for determining when jobs are exceeding their expected
bounds, or when our baseline metrics are getting outdated.

Our toolset was written to use the "memory.memsw.max_usage_in_bytes" file
in cgroups v1, and altering it to poll cgroups v2's "memory.swap.current"
would give less accurate results as well as add complication to the code.
Having this watermark exposed in sysfs is much preferred.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lars R. Damerow <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: shmem: fix UAF bug in shmem_show_options()

shmem_show_options() uses sbinfo->mpol without adding it's refcnt. This
may lead to race with replacement of the mpol by remount. The execution
sequence is as follows.

       CPU0                                   CPU1
shmem_show_options()                        shmem_reconfigure()
    shmem_show_mpol(seq, sbinfo->mpol)          mpol = sbinfo->mpol
                                                mpol_put(mpol)
        mpol->mode

The KASAN report is as follows.

BUG: KASAN: slab-use-after-free in shmem_show_options+0x21b/0x340
Read of size 2 at addr ffff888124324004 by task mount/2388

CPU: 2 PID: 2388 Comm: mount Not tainted 6.4.0-rc3-00017-g9d646009f65d-dirty #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x37/0x50
print_report+0xd0/0x620
? shmem_show_options+0x21b/0x340
? __virt_addr_valid+0xf4/0x180
? shmem_show_options+0x21b/0x340
kasan_report+0xb8/0xe0
? shmem_show_options+0x21b/0x340
shmem_show_options+0x21b/0x340
? __pfx_shmem_show_options+0x10/0x10
? strchr+0x2c/0x50
? strlen+0x23/0x40
? seq_puts+0x7d/0x90
show_vfsmnt+0x1e6/0x260
? __pfx_show_vfsmnt+0x10/0x10
? __kasan_kmalloc+0x7f/0x90
seq_read_iter+0x57a/0x740
vfs_read+0x2e2/0x4a0
? __pfx_vfs_read+0x10/0x10
? down_write_killable+0xb8/0x140
? __pfx_down_write_killable+0x10/0x10
? __fget_light+0xa9/0x1e0
? up_write+0x3f/0x80
ksys_read+0xb8/0x150
? __pfx_ksys_read+0x10/0x10
? fpregs_assert_state_consistent+0x55/0x60
? exit_to_user_mode_prepare+0x2d/0x120
do_syscall_64+0x3c/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

</TASK>

Allocated by task 2387:
kasan_save_stack+0x22/0x50
kasan_set_track+0x25/0x30
__kasan_slab_alloc+0x59/0x70
kmem_cache_alloc+0xdd/0x220
mpol_new+0x83/0x150
mpol_parse_str+0x280/0x4a0
shmem_parse_one+0x364/0x520
vfs_parse_fs_param+0xf8/0x1a0
vfs_parse_fs_string+0xc9/0x130
shmem_parse_options+0xb2/0x110
path_mount+0x597/0xdf0
do_mount+0xcd/0xf0
__x64_sys_mount+0xbd/0x100
do_syscall_64+0x3c/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

Freed by task 2389:
kasan_save_stack+0x22/0x50
kasan_set_track+0x25/0x30
kasan_save_free_info+0x2e/0x50
__kasan_slab_free+0x10e/0x1a0
kmem_cache_free+0x9c/0x350
shmem_reconfigure+0x278/0x370
reconfigure_super+0x383/0x450
path_mount+0xcc5/0xdf0
do_mount+0xcd/0xf0
__x64_sys_mount+0xbd/0x100
do_syscall_64+0x3c/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

The buggy address belongs to the object at ffff888124324000
which belongs to the cache numa_policy of size 32
The buggy address is located 4 bytes inside of
freed 32-byte region [ffff888124324000, ffff888124324020)
==================================================================

To fix the bug, shmem_get_sbmpol() / mpol_put() needs to be called
before / after shmem_show_mpol() call.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Tu Jinjiang <[email protected]>
Reviewed-by: Kefeng Wang <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Nanyong Sun <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: compaction: skip fast freepages isolation if enough freepages are isolated

I've observed that fast isolation often isolates more pages than
cc->migratepages, and the excess freepages will be released back to the
buddy system. So skip fast freepages isolation if enough freepages are
isolated to save some CPU cycles.

Link: https://lkml.kernel.org/r/f39c2c07f2dba2732fd9c0843572e5bef96f7f67.1685018752.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: compaction: add trace event for fast freepages isolation

The fast_isolate_freepages() can also isolate freepages, but we can not
know the fast isolation efficiency to understand the fast isolation
pressure. So add a trace event to show some numbers to help to understand
the efficiency for fast freepages isolation.

Link: https://lkml.kernel.org/r/78d2932d0160d122c15372aceb3f2c45460a17fc.1685018752.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: compaction: only set skip flag if cc->no_set_skip_hint is false

To keep the same logic as test_and_set_skip(), only set the skip flag if
cc->no_set_skip_hint is false, which makes code more reasonable.

Link: https://lkml.kernel.org/r/0eb2cd2407ffb259ae6e3071e10f70f2d41d0f3e.1685018752.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: compaction: skip more fully scanned pageblock

In fast_isolate_around(), it assumes the pageblock is fully scanned if
cc->nr_freepages < cc->nr_migratepages after trying to isolate some free
pages, and will set skip flag to avoid scanning in future. However this
can miss setting the skip flag for a fully scanned pageblock (returned
'start_pfn' is equal to 'end_pfn') in the case where cc->nr_freepages is
larger than cc->nr_migratepages.

So using the returned 'start_pfn' from isolate_freepages_block() and
'end_pfn' to decide if a pageblock is fully scanned makes more sense. It
can also cover the case where cc->nr_freepages < cc->nr_migratepages,
which means the 'start_pfn' is usually equal to 'end_pfn' except some
uncommon fatal error occurs after non-strict mode isolation.

Link: https://lkml.kernel.org/r/f4efd2fa08735794a6d809da3249b6715ba6ad38.1685018752.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: compaction: change fast_isolate_freepages() to void type

No caller cares about the return value of fast_isolate_freepages(), void
it.

Link: https://lkml.kernel.org/r/759fca20b22ebf4c81afa30496837b9e0fb2e53b.1685018752.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: compaction: drop the redundant page validation in update_pageblock_skip()

Patch series "Misc cleanups and improvements for compaction".

This series cantains some cleanups and improvements for compaction.

This patch (of 6):

The caller has validated the page before calling
update_pageblock_skip(), thus drop the redundant page validation in
update_pageblock_skip().

Link: https://lkml.kernel.org/r/5142e15b9295fe8c447dbb39b7907a20177a1413.1685018752.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/vmalloc: dont purge usable blocks unnecessarily

Purging fragmented blocks is done unconditionally in several contexts:

  1) From drain_vmap_area_work(), when the number of lazy to be freed
     vmap_areas reached the threshold

  2) Reclaiming vmalloc address space from pcpu_get_vm_areas()

  3) _vm_unmap_aliases()

#1 There is no reason to zap fragmented vmap blocks unconditionally, simply
   because reclaiming all lazy areas drains at least

      32MB * fls(num_online_cpus())

   per invocation which is plenty.

#2 Reclaiming when running out of space or due to memory pressure makes a
   lot of sense

#3 _unmap_aliases() requires to touch everything because the caller has no
   clue which vmap_area used a particular page last and the vmap_area lost
   that information too.

   Except for the vfree + VM_FLUSH_RESET_PERMS case, which removes the
   vmap area first and then cares about the flush. That in turn requires
   a full walk of _all_ vmap areas including the one which was just
   added to the purge list.

   But as this has to be flushed anyway this is an opportunity to combine
   outstanding TLB flushes and do the housekeeping of purging freed areas,
   but like #1 there is no real good reason to zap usable vmap blocks
   unconditionally.

Add a @force_purge argument to the newly split out block purge function and
if not true only purge fragmented blocks which have less than 1/4 of their
capacity left.

Rename purge_vmap_area_lazy() to reclaim_and_purge_vmap_areas() to make it
clear what the function does.

[[email protected]: correct VMAP_PURGE_THRESHOLD check]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/vmalloc: add missing READ/WRITE_ONCE() annotations

purge_fragmented_blocks() accesses vmap_block::free and vmap_block::dirty
lockless for a quick check.

Add the missing READ/WRITE_ONCE() annotations.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/vmalloc: check free space in vmap_block lockless

vb_alloc() unconditionally locks a vmap_block on the free list to check
the free space.

This can be done locklessly because vmap_block::free never increases, it's
only decreased on allocations.

Check the free space lockless and only if that succeeds, recheck under the
lock.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/vmalloc: prevent flushing dirty space over and over

vmap blocks which have active mappings cannot be purged.  Allocations
which have been freed are accounted for in vmap_block::dirty_min/max, so
that they can be detected in _vm_unmap_aliases() as potentially stale
TLBs.

If there are several invocations of _vm_unmap_aliases() then each of them
will flush the dirty range.  That's pointless and just increases the
probability of full TLB flushes.

Avoid that by resetting the flush range after accounting for it.  That's
safe versus other invocations of _vm_unmap_aliases() because this is all
serialized with vmap_purge_lock.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/vmalloc: avoid iterating over per CPU vmap blocks twice

_vunmap_aliases() walks the per CPU xarrays to find partially unmapped
blocks and then walks the per cpu free lists to purge fragmented blocks.

Arguably that's waste of CPU cycles and cache lines as the full xarray
walk already touches every block.

Avoid this double iteration:

  - Split out the code to purge one block and the code to free the local
    purge list into helper functions.

  - Try to purge the fragmented blocks in the xarray walk before looking at
    their dirty space.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/vmalloc: prevent stale TLBs in fully utilized blocks

Patch series "mm/vmalloc: Assorted fixes and improvements", v2.

this series addresses the following issues:

  1) Prevent the stale TLB problem related to fully utilized vmap blocks

  2) Avoid the double per CPU list walk in _vm_unmap_aliases()

  3) Avoid flushing dirty space over and over

  4) Add a lockless quickcheck in vb_alloc() and add missing
     READ/WRITE_ONCE() annotations

  5) Prevent overeager purging of usable vmap_blocks if
     not under memory/address space pressure.

This patch (of 6):

_vm_unmap_aliases() is used to ensure that no unflushed TLB entries for a
page are left in the system. This is required due to the lazy TLB flush
mechanism in vmalloc.

This is tried to achieve by walking the per CPU free lists, but those do
not contain fully utilized vmap blocks because they are removed from the
free list once the blocks free space became zero.

When the block is not fully unmapped then it is not on the purge list
either.

So neither the per CPU list iteration nor the purge list walk find the
block and if the page was mapped via such a block and the TLB has not yet
been flushed, the guarantee of _vm_unmap_aliases() that there are no stale
TLBs after returning is broken:

x = vb_alloc() // Removes vmap_block from free list because vb->free became 0
vb_free(x)     // Unmaps page and marks in dirty_min/max range
       // Block has still mappings and is not put on purge list

// Page is reused
vm_unmap_aliases() // Can't find vmap block with the dirty space -> FAIL

So instead of walking the per CPU free lists, walk the per CPU xarrays
which hold pointers to _all_ active blocks in the system including those
removed from the free lists.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: db64fe02258f ("mm: rewrite vmap layer")
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

kmemleak-test: drop __init to get better backtrace

Drop the __init on kmemleak_test_init().  With it, the storage is
reclaimed, but then the symbol isn't available for "%pS" rendering,
and the backtrace gets a bare pointer where the actual leak happened.

unreferenced object 0xffff88800a2b0800 (size 1024):
  comm "modprobe", pid 413, jiffies 4294953430
  hex dump (first 32 bytes):
    73 02 00 00 75 01 00 68 02 00 00 01 00 00 00 04  s...u..h........
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000fabad728>] kmalloc_trace+0x26/0x90
    [<00000000ef738764>] 0xffffffffc02350a2
    [<00000000004e5795>] do_one_initcall+0x43/0x210
    [<00000000d768905e>] do_init_module+0x4a/0x210
    [<0000000087135ab5>] __do_sys_finit_module+0x93/0xf0
    [<000000004fcb1fa2>] do_syscall_64+0x34/0x80
    [<00000000c73c8d9d>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

with __init gone, that trace entry renders like:

    [<00000000ef738764>] kmemleak_test_init+<offset>/<size>

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jim Cromie <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: multi-gen LRU: cleanup lru_gen_test_recent()

Avoid passing memcg* and pglist_data* to lru_gen_test_recent()
since we only use the lruvec anyway.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: T.J. Alumbaugh <[email protected]>
Reviewed-by: Yuanchu Xie <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: multi-gen LRU: add helpers in page table walks

Add helpers to page table walking code:
- Clarifies intent via name "should_walk_mmu" and "should_clear_pmd_young"
- Avoids repeating same logic in two places

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: T.J. Alumbaugh <[email protected]>
Reviewed-by: Yuanchu Xie <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: multi-gen LRU: cleanup lru_gen_soft_reclaim()

lru_gen_soft_reclaim() gets the lruvec from the memcg and node ID to keep a
cleaner interface on the caller side.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: T.J. Alumbaugh <[email protected]>
Reviewed-by: Yuanchu Xie <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: multi-gen LRU: use macro for bitmap

Use DECLARE_BITMAP macro when possible.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: T.J. Alumbaugh <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Yuanchu Xie <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests: cgroup: fix unexpected failure on test_memcg_low

Since commit f079a020ba95 ("selftests: memcg: factor out common parts of
memory.{low,min} tests"), the value used in second alloc_anon has changed
from 148M to 170M. Because memory.low allows reclaiming page cache in
child cgroups, so the memory.current is close to 30M instead of 50M.
Therefore, adjust the expected value of parent cgroup.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: f079a020ba95 ("selftests: memcg: factor out common parts of memory.{low,min} tests")
Signed-off-by: Haifeng Xu <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memcontrol: fix typo in comment

Replace 'then' with 'than'.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Haifeng Xu <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mlock: rename mlock_future_check() to mlock_future_ok()

It is felt that the name mlock_future_check() is vague - it doesn't
particularly convey the function's operation. mlock_future_ok() is a
clearer name for a predicate function.

Acked-by: Vlastimil Babka <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mmap: refactor mlock_future_check()

In all but one instance, mlock_future_check() is treated as a boolean
function despite returning an error code. In one instance, this error
code is ignored and replaced with -ENOMEM.

This is confusing, and the inversion of true -> failure, false -> success
is not warranted. Convert the function to a bool, lightly refactor and
return true if the check passes, false if not.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lorenzo Stoakes <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: gup_longterm: add liburing tests

Similar to the COW selftests, also use io_uring fixed buffers to test if
long-term page pinning works as expected.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: gup_longterm: new functional test for FOLL_LONGTERM

Let's add a new test for checking whether GUP long-term page pinning works
as expected (R/O vs.  R/W, MAP_PRIVATE vs.  MAP_SHARED, GUP vs.
GUP-fast).  Note that COW handling with long-term R/O pinning in private
mappings, and pinning of anonymous memory in general, is tested by the COW
selftest.  This test, therefore, focuses on page pinning in file mappings.

The most interesting case is probably the "local tmpfile" case, as that
will likely end up on a "real" filesystem such as ext4 or xfs, not on a
virtual one like tmpfs or hugetlb where any long-term page pinning is
always expected to succeed.

For now, only add tests that use the "/sys/kernel/debug/gup_test"
interface.  We'll add tests based on liburing separately next.

[[email protected]: update .gitignore for gup_longterm, per Peter]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>