Git Repo - linux.git/log

mm/treewide: replace pud_large() with pud_leaf()

pud_large() is always defined as pud_leaf(). Merge their usages. Chose
pud_leaf() because pud_leaf() is a global API, while pud_large() is not.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/treewide: replace pmd_large() with pmd_leaf()

pmd_large() is always defined as pmd_leaf(). Merge their usages. Chose
pmd_leaf() because pmd_leaf() is a global API, while pmd_large() is not.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/kasan: use pXd_leaf() in shadow_mapped()

There is an old trick in shadow_mapped() to use pXd_bad() to detect huge
pages. After commit 93fab1b22ef7 ("mm: add generic p?d_leaf() macros") we
have a global API for huge mappings. Use that to replace the trick.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/x86: drop two unnecessary pud_leaf() definitions

pud_leaf() has a fallback macro defined in include/linux/pgtable.h
already. Drop the extra two for x86.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/x86: replace pgd_large() with pgd_leaf()

pgd_leaf() is a global API while pgd_large() is not. Always use the
global pgd_leaf(), then drop pgd_large().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/x86: replace p4d_large() with p4d_leaf()

p4d_large() is always defined as p4d_leaf(). Merge their usages. Chose
p4d_leaf() because p4d_leaf() is a global API, while p4d_large() is not.

Only x86 has p4d_leaf() defined as of now. So it also means after this
patch we removed all p4d_large() usages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/powerpc: replace pXd_is_leaf() with pXd_leaf()

They're the same macros underneath. Drop pXd_is_leaf(), instead always use
pXd_leaf().

At the meantime, instead of renames, drop the pXd_is_leaf() fallback
definitions directly in arch/powerpc/include/asm/pgtable.h. because
similar fallback macros for pXd_leaf() are already defined in
include/linux/pgtable.h.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Suggested-by: Christophe Leroy <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Christophe Leroy <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/powerpc: define pXd_large() with pXd_leaf()

Patch series "mm/treewide: Replace pXd_large() with pXd_leaf()", v3.

These two APIs are mostly always the same.  It's confusing to have both of
them.  Merge them into one.  Here I used pXd_leaf() only because
pXd_leaf() is a global API which is always defined, while pXd_large() is
not.

We have yet one more API that is similar which is pXd_huge(), but that's
even trickier, so let's do it step by step.

Some special cares are taken for ppc and x86, they're done as separate
cleanups first.

This patch (of 10):

The two definitions are the same.  The only difference is that pXd_large()
is only defined with THP selected, and only on book3s 64bits.

Instead of implementing it twice, make pXd_large() a macro to pXd_leaf().
Define it unconditionally just like pXd_leaf().  This helps to prepare
merging the two APIs.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/zswap: global lru and shrinker shared by all zswap_pools fix

Commit bf9b7df23cb3 ("mm/zswap: global lru and shrinker shared by all
zswap_pools") introduced a new lock to protect zswap_next_shrink, instead
of reusing zswap_pools_lock.

But the problem is that it's initialized only when zswap enabled, which
causes bug if zswap_memcg_offline_cleanup() called without zswap enabled.

Fix it by using DEFINE_SPINLOCK() to statically initialize them and define
them as multiple static variables to keep in consistent with the existing
global variables in zswap.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: bf9b7df23cb3 ("mm/zswap: global lru and shrinker shared by all zswap_pools")
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memory: fix shift-out-of-bounds in fault_around_bytes_set

The rounddown_pow_of_two(0) is undefined, so val = 0 is not allowed in the
fault_around_bytes_set(), and leads to shift-out-of-bounds,

UBSAN: shift-out-of-bounds in include/linux/log2.h:67:13
shift exponent 4294967295 is too large for 64-bit type 'long unsigned int'
CPU: 7 PID: 107 Comm: sh Not tainted 6.8.0-rc6-next-20240301 #294
Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
dump_backtrace+0x94/0xec
show_stack+0x18/0x24
dump_stack_lvl+0x78/0x90
dump_stack+0x18/0x24
ubsan_epilogue+0x10/0x44
__ubsan_handle_shift_out_of_bounds+0x98/0x134
fault_around_bytes_set+0xa4/0xb0
simple_attr_write_xsigned.isra.0+0xe4/0x1ac
simple_attr_write+0x18/0x24
debugfs_attr_write+0x4c/0x98
vfs_write+0xd0/0x4b0
ksys_write+0x6c/0xfc
__arm64_sys_write+0x1c/0x28
invoke_syscall+0x44/0x104
el0_svc_common.constprop.0+0x40/0xe0
do_el0_svc+0x1c/0x28
el0_svc+0x34/0xdc
el0t_64_sync_handler+0xc0/0xc4
el0t_64_sync+0x190/0x194
---[ end trace ]---

Fix it by setting the minimum val to PAGE_SIZE.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 53d36a56d8c4 ("mm: prefer fault_around_pages to fault_around_bytes")
Signed-off-by: Kefeng Wang <[email protected]>
Reported-by: Yue Sun <[email protected]>
Closes: https://lore.kernel.org/all/CAEkJfYPim6DQqW1GqCiHLdh2-eweqk1fGyXqs3JM+8e1qGge8w@mail.gmail.com/
Reviewed-by: Lorenzo Stoakes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

s390: supplement for ptdesc conversion

After commit 6326c26c1514 ("s390: convert various pgalloc functions to use
ptdescs"), there are still some positions that use page->{lru, index}
instead of ptdesc->{pt_list, pt_index}. In order to make the use of
ptdesc->{pt_list, pt_index} clearer, it would be better to convert them as
well.

[[email protected]: fix build failure]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/04beaf3255056ffe131a5ea595736066c1e84756.1709541697.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Janosch Frank <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: pgtable: add missing pt_index to struct ptdesc

In s390, the page->index field is used for gmap (see gmap_shadow_pgt()),
so add the corresponding pt_index to struct ptdesc and add a comment to
clarify this.

Link: https://lkml.kernel.org/r/283624c2af45fb2090b41a6b1b5481bb0a45bad7.1709541697.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Janosch Frank <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: pgtable: correct the wrong comment about ptdesc->__page_flags

Patch series "minor fixes and supplement for ptdesc".

In this series, the [PATCH 1/3] and [PATCH 2/3] are fixes for some issues
discovered during code inspection.

The [PATCH 3/3] is a supplement to ptdesc conversion in s390, I don't know
why this is not done in the commit 6326c26c1514 ("s390: convert various
pgalloc functions to use ptdescs"), maybe I missed something. And since I
don't have an s390 environment, I hope kernel test robot can help compile
and test, and this is why I did not fold [PATCH 2/3] and [PATCH 3/3] into
one patch.

This patch (of 3):

The commit 32cc0b7c9d50 ("powerpc: add pte_free_defer() for pgtables
sharing page") introduced the use of PageActive flag to page table
fragments tracking, so the ptdesc->__page_flags is not unused, so correct
the wrong comment.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/cc42d5915fd98fd802f920de243f535efcfe01db.1709541697.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Janosch Frank <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: page_alloc: use div64_ul() instead of do_div()

Fixes Coccinelle/coccicheck warning reported by do_div.cocci.

Compared to do_div(), div64_ul() does not implicitly cast the divisor and
does not unnecessarily calculate the remainder.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Thorsten Blum <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mempolicy: use a folio in do_mbind()

We actually add folios to the pagelist already, but then work with them as
pages. Removes a call to compound_head() in PageKsm() and removes a
reference to page->index.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Gregory Price <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: make folio_pte_batch available outside of mm/memory.c

madvise, mprotect and some others might need folio_pte_batch to check if a
range of PTEs are completely mapped to a large folio with contiguous
physical addresses. Let's make it available in mm/internal.h.

While at it, add proper kernel doc and sanity-check more input parameters
using two additional VM_WARN_ON_FOLIO().

[[email protected]: build fix]
Link: https://lkml.kernel.org/r/CAGsJ_4wWzG-37D82vqP_zt+Fcbz+URVe5oXLBc4M5wbN8A_gpQ@mail.gmail.com
[[email protected]: improve the doc for the exported func]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Yin Fengwei <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: remove cast from page_to_nid()

Now that PF_POISONED_CHECK() can take a const argument, we can drop
the cast.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: constify more page/folio tests

Constify the flag tests that aren't automatically generated and the tests
that look like flag tests but are more complicated.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: constify testing page/folio flags

Now that dump_page() takes a const argument, we can constify all the page
flag tests.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: make dump_page() take a const argument

Now that __dump_page() takes a const argument, we can make dump_page()
take a const struct page too.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: add __dump_folio()

Turn __dump_page() into a wrapper around __dump_folio(). Snapshot the
page & folio into a stack variable so we don't hit BUG_ON() if an
allocation is freed under us and what was a folio pointer becomes a
pointer to a tail page.

[[email protected]: fix build issue]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix __dump_folio]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix pointer confusion]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: s/printk/pr_warn/]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: remove PageYoung and PageIdle definitions

All callers have been converted to use folios, so remove the various
set/clear/test functions defined on pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: remove PageWaiters, PageSetWaiters and PageClearWaiters

All callers have been converted to use folios. This was the only user of
PF_ONLY_HEAD, so remove that too.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: separate out FOLIO_FLAGS from PAGEFLAGS

Patch series "PageFlags cleanups".

We have now successfully removed all of the uses of some of the PageFlags
from the kernel, but there's nothing to stop somebody reintroducing them.
By splitting out FOLIO_FLAGS from PAGEFLAGS, we can stop defining the old
flags; and we do that in some of the later patches.

After doing this, I realised that dump_page() was living dangerously; we
could end up calling folio_test_foo() on a pointer which no longer pointed
to a folio (as dump_page() is not necessarily called when the caller has a
reference to the page). So I fixed that up.

And then I realised that this was the key to making dump_page() take a
const argument, which means we can constify the page flags testing, which
means we can remove more cast-away-the-const bad code.

And here's where I ended up.

This patch (of 8):

We've progressed far enough with the folio transition that some flags are
now no longer checked on pages, but only on folios. To prevent new users
appearing, prepare to only define the folio versions of the flag
test/set/clear.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb: parallelize 1G hugetlb initialization

Optimizing the initialization speed of 1G huge pages through
parallelization.

1G hugetlbs are allocated from bootmem, a process that is already very
fast and does not currently require optimization.  Therefore, we focus on
parallelizing only the initialization phase in `gather_bootmem_prealloc`.

Here are some test results:
      test case       no patch(ms)   patched(ms)   saved
------------------- -------------- ------------- --------
  256c2T(4 node) 1G           4745          2024   57.34%
  128c1T(2 node) 1G           3358          1712   49.02%
     12T         1G          77000         18300   76.23%

[[email protected]: s/initialied/initialized/, per Alexey]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Steffen Klassert <[email protected]>
Cc: Tim Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb: parallelize 2M hugetlb allocation and initialization

By distributing both the allocation and the initialization tasks across
multiple threads, the initialization of 2M hugetlb will be faster, thereby
improving the boot speed.

Here are some test results:
      test case        no patch(ms)   patched(ms)   saved
------------------- -------------- ------------- --------
  256c2T(4 node) 2M           3336          1051   68.52%
  128c1T(2 node) 2M           1943           716   63.15%

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Steffen Klassert <[email protected]>
Cc: Tim Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA

Allow hugetlb use padata_do_multithreaded for parallel initialization.
Select CONFIG_PADATA in this case.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
Acked-by: Daniel Jordan <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Steffen Klassert <[email protected]>
Cc: Tim Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

padata: downgrade padata_do_multithreaded to serial execution for non-SMP

hugetlb parallelization depends on PADATA, and PADATA depends on SMP.

PADATA consists of two distinct functionality: One part is
padata_do_multithreaded which disregards order and simply divides tasks
into several groups for parallel execution.  Hugetlb init parallelization
depends on padata_do_multithreaded.

The other part is composed of a set of APIs that, while handling data in
an out-of-order parallel manner, can eventually return the data with
ordered sequence.  Currently Only `crypto/pcrypt.c` use them.

All users of PADATA of non-SMP case currently only use
padata_do_multithreaded.  It is easy to implement a serial one in
include/linux/padata.h.  And it is not necessary to implement another
functionality unless the only user of crypto/pcrypt.c does not depend on
SMP in the future.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
Acked-by: Daniel Jordan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Steffen Klassert <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Author: Gang Li padata: dispatch works on

different nodes Date: Thu, 22 Feb 2024 22:04:17 +0800

When a group of tasks that access different nodes are scheduled on the
same node, they may encounter bandwidth bottlenecks and access latency.

Thus, numa_aware flag is introduced here, allowing tasks to be distributed
across different nodes to fully utilize the advantage of multi-node
systems.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: Tim Chen <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Steffen Klassert <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc

With parallelization of hugetlb allocation across different threads, each
thread works on a differnet node to allocate pages from, instead of all
allocating from a common node h->next_nid_to_alloc. To address this, it's
necessary to assign a separate next_nid_to_alloc for each thread.

Consequently, the hstate_next_node_to_alloc and
for_each_node_mask_to_alloc have been modified to directly accept a
*next_nid_to_alloc parameter, ensuring thread-specific allocation and
avoiding concurrent access issues.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
Reviewed-by: Tim Chen <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Steffen Klassert <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb: split hugetlb_hstate_alloc_pages

1G and 2M huge pages have different allocation and initialization logic,
which leads to subtle differences in parallelization. Therefore, it is
appropriate to split hugetlb_hstate_alloc_pages into gigantic and
non-gigantic.

This patch has no functional changes.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
Reviewed-by: Tim Chen <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Steffen Klassert <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb: code clean for hugetlb_hstate_alloc_pages

Patch series "hugetlb: parallelize hugetlb page init on boot", v6.

Introduction
------------
Hugetlb initialization during boot takes up a considerable amount of time.
For instance, on a 2TB system, initializing 1,800 1GB huge pages takes
1-2 seconds out of 10 seconds.  Initializing 11,776 1GB pages on a 12TB
Intel host takes more than 1 minute[1].  This is a noteworthy figure.

Inspired by [2] and [3], hugetlb initialization can also be accelerated
through parallelization.  Kernel already has infrastructure like
padata_do_multithreaded, this patch uses it to achieve effective results
by minimal modifications.

[1] https://lore.kernel.org/all/783f8bac-55b8-5b95-eb6a-11a583675000@google.com/
[2] https://lore.kernel.org/all/20200527173608.2885243 [email protected]/
[3] https://lore.kernel.org/all/20230906112605.2286994 [email protected]/
[4] https://lore.kernel.org/all/76becfc1-e609-e3e8-2966-4053143170b6@google.com/

max_threads
-----------
This patch use `padata_do_multithreaded` like this:

```
job.max_threads = num_node_state(N_MEMORY) * multiplier;
padata_do_multithreaded(&job);
```

To fully utilize the CPU, the number of parallel threads needs to be
carefully considered.  `max_threads = num_node_state(N_MEMORY)` does not
fully utilize the CPU, so we need to multiply it by a multiplier.

Tests below indicate that a multiplier of 2 significantly improves
performance, and although larger values also provide improvements, the
gains are marginal.

  multiplier     1       2       3       4       5
------------ ------- ------- ------- ------- -------
  256G 2node   358ms   215ms   157ms   134ms   126ms
  2T   4node   979ms   679ms   543ms   489ms   481ms
  50G  2node   71ms    44ms    37ms    30ms    31ms

Therefore, choosing 2 as the multiplier strikes a good balance between
enhancing parallel processing capabilities and maintaining efficient
resource management.

Test result
-----------
      test case       no patch(ms)   patched(ms)   saved
------------------- -------------- ------------- --------
  256c2T(4 node) 1G           4745          2024   57.34%
  128c1T(2 node) 1G           3358          1712   49.02%
     12T         1G          77000         18300   76.23%

  256c2T(4 node) 2M           3336          1051   68.52%
  128c1T(2 node) 2M           1943           716   63.15%

This patch (of 8):

The readability of `hugetlb_hstate_alloc_pages` is poor.  By cleaning the
code, its readability can be improved, facilitating future modifications.

This patch extracts two functions to reduce the complexity of
`hugetlb_hstate_alloc_pages` and has no functional changes.

- hugetlb_hstate_alloc_pages_node_specific() to handle iterates through
  each online node and performs allocation if necessary.
- hugetlb_hstate_alloc_pages_report() report error during allocation.
  And the value of h->max_huge_pages is updated accordingly.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: Tim Chen <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Steffen Klassert <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/zsmalloc: don't need to reserve LSB in handle

We will save allocated tag in the object header to indicate that it's
allocated.

handle |= OBJ_ALLOCATED_TAG;

So the object header needs to reserve LSB for this tag bit.

But the handle itself doesn't need to reserve LSB to save tag, since it's
only used to find the position of object, by (pfn + obj_idx). So remove
LSB reserve from handle, one more bit can be used as obj_idx.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chengming Zhou <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memory.c: do_numa_page(): remove a redundant page table read

do_numa_page() is reading from the same page table entry, twice, while
holding the page table lock: once while checking that the pte hasn't
changed, and again in order to modify the pte.

Instead, just read the pte once, and save it in the same old_pte variable
that already exists. This has no effect on behavior, other than to
provide a tiny potential improvement to performance, by avoiding the
redundant memory read (which the compiler cannot elide, due to
READ_ONCE()).

Also improve the associated comments nearby.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: add alloc_contig_migrate_range allocation statistics

alloc_contig_migrate_range has every information to be able to understand
big contiguous allocation latency. For example, how many pages are
migrated, how many times they were needed to unmap from page tables.

This patch adds the trace event to collect the allocation statistics. In
the field, it was quite useful to understand CMA allocation latency.

[[email protected]: a/trace_mm_alloc_config_migrate_range_info_enabled/trace_mm_alloc_contig_migrate_range_info_enabled]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Richard Chang <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected].
Cc: Martin Liu <[email protected]>
Cc: "Masami Hiramatsu (Google)" <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: use folio more widely in __split_huge_page

We already have a folio; use it instead of the head page where reasonable.
Saves a couple of calls to compound_head() and elimimnates a few
references to page->mapping.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Sidhartha Kumar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

crash_core: export vmemmap when CONFIG_SPARSEMEM_VMEMMAP is enabled

In memory_model.h, if CONFIG_SPARSEMEM_VMEMMAP is configed, kernel will
use vmemmap to do the __pfn_to_page/page_to_pfn, and kernel will not use
the "classic sparse" to do the __pfn_to_page/page_to_pfn.

So export the vmemmap when CONFIG_SPARSEMEM_VMEMMAP is configed. This
makes the user applications (crash, etc) get faster
pfn_to_page/page_to_pfn operations too.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Huang Shijie <[email protected]>
Acked-by: Baoquan He <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Kazuhito Hagio <[email protected]>
Cc: Lianbo Jiang <[email protected]>
Cc: Vivek Goyal <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

modules: wait do_free_init correctly

The synchronization here is to ensure the ordering of freeing of a module
init so that it happens before W+X checking.  It is worth noting it is not
that the freeing was not happening, it is just that our sanity checkers
raced against the permission checkers which assume init memory is already
gone.

Commit 1a7b7d922081 ("modules: Use vmalloc special flag") moved calling
do_free_init() into a global workqueue instead of relying on it being
called through call_rcu(..., do_free_init), which used to allowed us call
do_free_init() asynchronously after the end of a subsequent grace period.
The move to a global workqueue broke the gaurantees for code which needed
to be sure the do_free_init() would complete with rcu_barrier().  To fix
this callers which used to rely on rcu_barrier() must now instead use
flush_work(&init_free_wq).

Without this fix, we still could encounter false positive reports in W+X
checking since the rcu_barrier() here can not ensure the ordering now.

Even worse, the rcu_barrier() can introduce significant delay.  Eric
Chanudet reported that the rcu_barrier introduces ~0.1s delay on a
PREEMPT_RT kernel.

  [    0.291444] Freeing unused kernel memory: 5568K
  [    0.402442] Run /sbin/init as init process

With this fix, the above delay can be eliminated.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
Signed-off-by: Changbin Du <[email protected]>
Tested-by: Eric Chanudet <[email protected]>
Acked-by: Luis Chamberlain <[email protected]>
Cc: Xiaoyi Su <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: convert free_swap_cache() to take a folio

All but one caller already has a folio, so convert
free_page_and_swap_cache() to have a folio and remove the call to
page_folio().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: use a folio in __collapse_huge_page_copy_succeeded()

These pages are all chained together through the lru list, so we know
they're folios. Use the folio APIs to save three hidden calls to
compound_head().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: convert free_pages_and_swap_cache() to use folios_put()

Process the pages in batch-sized quantities instead of all-at-once.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: remove lru_to_page()

The last user was removed over a year ago; remove the definition.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: remove free_unref_page_list()

All callers now use free_unref_folios() so we can delete this function.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

memcg: remove mem_cgroup_uncharge_list()

All users have been converted to mem_cgroup_uncharge_folios() so we can
remove this API.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: free folios directly in move_folios_to_lru()

The few folios which can't be moved to the LRU list (because their
refcount dropped to zero) used to be returned to the caller to dispose of.
Make this simpler to call by freeing the folios directly through
free_unref_folios().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: free folios in a batch in shrink_folio_list()

Use free_unref_page_batch() to free the folios. This may increase the
number of IPIs from calling try_to_unmap_flush() more often, but that's
going to be very workload-dependent. It may even reduce the number of
IPIs as we now batch-free large folios instead of freeing them one at a
time.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: allow non-hugetlb large folios to be batch processed

Hugetlb folios still get special treatment, but normal large folios can
now be freed by free_unref_folios(). This should have a reasonable
performance impact, TBD.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: handle large folios in free_unref_folios()

Call folio_undo_large_rmappable() if needed. free_unref_page_prepare()
destroys the ability to call folio_order(), so stash the order in
folio->private for the benefit of the second loop.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: use __page_cache_release() in folios_put()

Pass a pointer to the lruvec so we can take advantage of the
folio_lruvec_relock_irqsave(). Adjust the calling convention of
folio_lruvec_relock_irqsave() to suit and add a page_cache_release()
wrapper.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: use free_unref_folios() in put_pages_list()

Break up the list of folios into batches here so that the folios are more
likely to be cache hot when doing the rest of the processing.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: remove use of folio list from folios_put()

Instead of putting the interesting folios on a list, delete the
uninteresting one from the folio_batch.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

memcg: add mem_cgroup_uncharge_folios()

Almost identical to mem_cgroup_uncharge_list(), except it takes a
folio_batch instead of a list_head.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: use folios_put() in __folio_batch_release()

There's no need to indirect through release_pages() and iterate over this
batch of folios an extra time; we can just use the batch that we have.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: add free_unref_folios()

Iterate over a folio_batch rather than a linked list. This is easier for
the CPU to prefetch and has a batch count naturally built in so we don't
need to track it. Again, this lowers the maximum lock hold time from
32 folios to 15, but I do not expect this to have a significant effect.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: convert free_unref_page_list() to use folios

Most of its callees are not yet ready to accept a folio, but we know all
of the pages passed in are actually folios because they're linked through
->lru.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: make folios_put() the basis of release_pages()

Patch series "Rearrange batched folio freeing", v3.

Other than the obvious "remove calls to compound_head" changes, the
fundamental belief here is that iterating a linked list is much slower
than iterating an array (5-15x slower in my testing).  There's also an
associated belief that since we iterate the batch of folios three times,
we do better when the array is small (ie 15 entries) than we do with a
batch that is hundreds of entries long, which only gives us the
opportunity for the first pages to fall out of cache by the time we get to
the end.

It is possible we should increase the size of folio_batch.  Hopefully the
bots let us know if this introduces any performance regressions.

This patch (of 3):

By making release_pages() call folios_put(), we can get rid of the calls
to compound_head() for the callers that already know they have folios.  We
can also get rid of the lock_batch tracking as we know the size of the
batch is limited by folio_batch.  This does reduce the maximum number of
pages for which the lruvec lock is held, from SWAP_CLUSTER_MAX (32) to
PAGEVEC_SIZE (15).  I do not expect this to make a significant difference,
but if it does, we can increase PAGEVEC_SIZE to 31.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/khugepaged: keep mm in mm_slot without MMF_DISABLE_THP check

Previously, we removed the mm from mm_slot and dropped mm_count
if the MMF_THP_DISABLE flag was set. However, we didn't re-add
the mm back after clearing the MMF_THP_DISABLE flag. Additionally,
We add a check for the MMF_THP_DISABLE flag in hugepage_vma_revalidate().

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 879c6000e191 ("mm/khugepaged: bypassing unnecessary scans with MMF_DISABLE_THP check")
Signed-off-by: Lance Yang <[email protected]>
Suggested-by: Yang Shi <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

lib/test_vmalloc.c: use unsigned long constant

Use an unsigned long constant instead of an int constant and a cast. This
fixes the checkpatch warning

WARNING: Unnecessary typecast of c90 int constant - '(unsigned long) 1' could be '1UL'
+ align = ((unsigned long) 1) << i;

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Martin Kaiser <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

lib/test_vmalloc.c: drop empty exit function

The module is never loaded successfully. Therefore, it'll never be
unloaded and we can remove the exit function.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Martin Kaiser <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

lib/test_vmalloc.c: fix typo in function name

Fix a typo and change the function name to init_test_configuration. Both
caller and definition have the same typo, so the current code already
works.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Martin Kaiser <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: remove total_mapcount()

All users of total_mapcount() are gone, let's remove it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memfd: refactor memfd_tag_pins() and memfd_wait_for_pins()

Patch series "mm: remove total_mapcount()", v2.

Let's remove the remaining user from mm/memfd.c so we can get rid of
total_mapcount().

This patch (of 2):

Both functions are the remaining users of total_mapcount().  Let's get rid
of the calls by converting the code to folios.

As it turns out, the code is unnecessarily complicated, especially:

1) We can query the number of pagecache references for a folio simply via
   folio_nr_pages(). This will handle other folio sizes in the future
   correctly.

2) The xas_set(xas, page->index + cache_count) call to increment the
   iterator for large folios is not required. Remove it.

Further, simplify the XA_CHECK_SCHED check, counting each entry exactly
once.

Memfd pages can be swapped out when using shmem; leave xa_is_value()
checks in place.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Co-developed-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: huge_memory: enable debugfs to split huge pages to any order

It is used to test split_huge_page_to_list_to_order for pagecache THPs.
Also add test cases for split_huge_page_to_list_to_order via both debugfs.

[[email protected]: fix issue discovered with NFS]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Tested-by: Aishwarya TCV <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michal Koutny <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Cc: Aishwarya TCV <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: thp: split huge page to any lower order pages

To split a THP to any lower order pages, we need to reform THPs on
subpages at given order and add page refcount based on the new page order.
Also we need to reinitialize page_deferred_list after removing the page
from the split_queue, otherwise a subsequent split will see list
corruption when checking the page_deferred_list again.

Note: Anonymous order-1 folio is not supported because _deferred_list,
which is used by partially mapped folios, is stored in subpage 2 and an
order-1 folio only has subpage 0 and 1. File-backed order-1 folios are
fine, since they do not use _deferred_list.

[[email protected]: fixup per discussion with Ryan]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michal Koutny <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: page_owner: add support for splitting to any order in split page_owner

It adds a new_order parameter to set new page order in page owner. It
prepares for upcoming changes to support split huge page to any lower
order.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michal Koutny <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memcg: make memcg huge page split support any order split

It sets memcg information for the pages after the split. A new parameter
new_order is added to tell the order of subpages in the new page, always 0
for now. It prepares for upcoming changes to support split huge page to
any lower order.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michal Koutny <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/page_owner: use order instead of nr in split_page_owner()

We do not have non power of two pages, using nr is error prone if nr is
not power-of-two. Use page order instead.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michal Koutny <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memcg: use order instead of nr in split_page_memcg()

We do not have non power of two pages, using nr is error prone if nr is
not power-of-two. Use page order instead.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michal Koutny <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: support order-1 folios in the page cache

Folios of order 1 have no space to store the deferred list. This is not a
problem for the page cache as file-backed folios are never placed on the
deferred list. All we need to do is prevent the core MM from touching the
deferred list for order 1 folios and remove the code which prevented us
from allocating order 1 folios.

Link: https://lore.kernel.org/linux-mm/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: Michal Koutny <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/huge_memory: only split PMD mapping when necessary in unmap_folio()

Patch series "Split a folio to any lower order folios", v5.

File folio supports any order and multi-size THP is upstreamed[1], so both
file and anonymous folios can be >0 order.  Currently, split_huge_page()
only splits a huge page to order-0 pages, but splitting to orders higher
than 0 might better utilize large folios, if done properly.  In addition,
Large Block Sizes in XFS support would benefit from it during truncate[2].
This patchset adds support for splitting a large folio to any lower order
folios.

In addition to this implementation of split_huge_page_to_list_to_order(),
a possible optimization could be splitting a large folio to arbitrary
smaller folios instead of a single order.  As both Hugh and Ryan pointed
out [3,5] that split to a single order might not be optimal, an order-9
folio might be better split into 1 order-8, 1 order-7, ..., 1 order-1, and
2 order-0 folios, depending on subsequent folio operations.  Leave this as
future work.

[1] https://lore.kernel.org/all/20231207161211.2374093 [email protected]/
[2] https://lore.kernel.org/linux-mm/20240226094936.2677493 [email protected]/
[3] https://lore.kernel.org/linux-mm/9dd96da-efa2-5123-20d4-4992136ef3ad@google.com/
[4] https://lore.kernel.org/linux-mm/cbb1d6a0-66dd-47d0-8733-f836fe050374@arm.com/
[5] https://lore.kernel.org/linux-mm/20240213215520.1048625 [email protected]/

This patch (of 8):

As multi-size THP support is added, not all THPs are PMD-mapped, thus
during a huge page split, there is no need to always split PMD mapping in
unmap_folio().  Make it conditional.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Koutny <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: enumerate all gfp flags

Introduce GFP bits enumeration to let compiler track the number of used
bits (which depends on the config options) instead of hardcoding them.
That simplifies __GFP_BITS_SHIFT calculation.

Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Petr Tesařík <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Pasha Tatashin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Petr Tesarik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: madvise: pageout: ignore references rather than clearing young

While doing MADV_PAGEOUT, the current code will clear PTE young so that
vmscan won't read young flags to allow the reclamation of madvised folios
to go ahead.  It seems we can do it by directly ignoring references, thus
we can remove tlb flush in madvise and rmap overhead in vmscan.

Regarding the side effect, in the original code, if a parallel thread runs
side by side to access the madvised memory with the thread doing madvise,
folios will get a chance to be re-activated by vmscan (though the time gap
is actually quite small since checking PTEs is done immediately after
clearing PTEs young).  But with this patch, they will still be reclaimed.
But this behaviour doing PAGEOUT and doing access at the same time is
quite silly like DoS.  So probably, we don't need to care.  Or ignoring
the new access during the quite small time gap is even better.

For DAMON's DAMOS_PAGEOUT based on physical address region, we still keep
its behaviour as is since a physical address might be mapped by multiple
processes.  MADV_PAGEOUT based on virtual address is actually much more
aggressive on reclamation.  To untouch paddr's DAMOS_PAGEOUT, we simply
pass ignore_references as false in reclaim_pages().

A microbench as below has shown 6% decrement on the latency of
MADV_PAGEOUT,

#define PGSIZE 4096
main()
{
int i;
#define SIZE 512*1024*1024
volatile long *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

for (i = 0; i < SIZE/sizeof(long); i += PGSIZE / sizeof(long))
p[i] =  0x11;

madvise(p, SIZE, MADV_PAGEOUT);
}

w/o patch                    w/ patch
root@10:~# time ./a.out      root@10:~# time ./a.out
real 0m49.634s            real   0m46.334s
user 0m0.637s             user   0m0.648s
sys 0m47.434s            sys    0m44.265s

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Barry Song <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64/mm: improve comment in contpte_ptep_get_lockless()

Make clear the atmicity/consistency requirements of the API and how we
achieve them.

Link: https://lore.kernel.org/linux-mm/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Mark Rutland <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64/mm: export contpte symbols only to GPL users

Patch series "Address some contpte nits".

These 2 patches address some nits raised by Catalin late in the review cycle for
my contpte series [1].

[1] https://lore.kernel.org/linux-mm/20240215103205.2607016 [email protected]/

This patch (of 2):

The contpte symbols must be exported since some of the public inline
ptep_* APIs are called from modules and these inlines now call the contpte
functions. Originally they were exported as EXPORT_SYMBOL() for fear of
breaking out-of-tree modules. But we subsequently concluded that
EXPORT_SYMBOL_GPL() should be safe since these functions are deeply core
mm routines, and any module operating at this level is not going to be
able to survive on EXPORT_SYMBOL alone.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/linux-mm/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Mark Rutland <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/mm/damon/design: remove the details for pageout as paddr doesn't use MADV_PAGEOUT

The doc needs a fix.  As only in the case of virtual address, we are
calling madvise() with MADV_PAGEOUT.  But in the case of physical address,
we are calling reclaim_pages() directly.  MADV_PAGEOUT on virtual address
is much more aggresive to reclaim memory compared to reclaim_pages() on
paddr region.  This patch removes the details so that the description can
apply to both cases.  And we don't need to couple with the implementation
details.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Barry Song <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

kasan: fix a2 allocation and remove explicit cast in atomic tests

Address the additional feedback since 4e76c8cc3378 kasan: add atomic tests
(""kasan: add atomic tests") by removing an explicit cast and fixing the
size as well as the check of the allocation of `a2`.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/[email protected]/T/#u
Fixes: 4e76c8cc3378 ("kasan: add atomic tests")
Signed-off-by: Paul Heidekrüger <[email protected]>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=214055
Reviewed-by: Marco Elver <[email protected]>
Tested-by: Marco Elver <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

lib/stackdepot: off by one in depot_fetch_stack()

The stack_pools[] array has DEPOT_MAX_POOLS.  The "pools_num" tracks the
number of pools which are initialized.  See depot_init_pool() for more
details.

If pool_index == pools_num_cached, this will read one element beyond what
we want.  If not all the pools are initialized, then the pool will be
NULL, triggering a WARN(), and if they are all initialized it will read
one element beyond the end of the array.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: b29d31885814 ("lib/stackdepot: store free stack records in a freelist")
Signed-off-by: Dan Carpenter <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

zram: zcomp: remove zcomp_set_max_streams() declaration

The zcomp_set_max_streams() is removed from commit 43209ea2d17a
("zram: remove max_comp_streams internals"), remove the declaration.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: update mark_victim tracepoints fields

The current implementation of the mark_victim tracepoint provides only the
process ID (pid) of the victim process.  This limitation poses challenges
for userspace tools requiring real-time OOM analysis and intervention.
Although this information is available from the kernel logs, it’s not
the appropriate format to provide OOM notifications.  In Android, BPF
programs are used with the mark_victim trace events to notify userspace of
an OOM kill.  For consistency, update the trace event to include the same
information about the OOMed victim as the kernel logs.

- UID
   In Android each installed application has a unique UID. Including
   the `uid` assists in correlating OOM events with specific apps.

- Process Name (comm)
   Enables identification of the affected process.

- OOM Score
  Will allow userspace to get additional insight of the relative kill
  priority of the OOM victim. In Android, the oom_score_adj is used to
  categorize app state (foreground, background, etc.), which aids in
  analyzing user-perceptible impacts of OOM events [1].

- Total VM, RSS Stats, and pgtables
  Amount of memory used by the victim that will, potentially, be freed up
  by killing it.

[1] https://cs.android.com/android/platform/superproject/main/+/246dc8fc95b6d93afcba5c6d6c133307abb3ac2e:frameworks/base/services/core/java/com/android/server/am/ProcessList.java;l=188-283
Signed-off-by: Carlos Galo <[email protected]>
Reviewed-by: Steven Rostedt <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Masami Hiramatsu (Google)" <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests: damon: add access_memory to .gitignore

This binary is missing in the .gitignore and stays as an untracked file.

Link: https://lore.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Javier Carrasco <[email protected]>
Singed-off-by: SeongJae Park <[email protected]>
Reported-by: Bernd Edlinger <[email protected]>
Closes: https://lore.kernel.org/all/AS8P193MB1285C963658008F1B2702AF7E4792@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM/
Reviewed-by: SeongJae Park <[email protected]>
Cc: Vincenzo Mezzela <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftest: damon: fix minor typos in test logs

Patch series "selftests/damon: misc fixes".

Misc fixes for DAMON selftets on behalf of the original authors.

This patch (of 2):

This patch resolves a spelling error in the test log, preventing potential
confusion.

It is submitted as part of my application to the "Linux Kernel Bug Fixing
Spring Unpaid 2024" mentorship program of the Linux Foundation.

Link: https://lore.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vincenzo Mezzela <[email protected]>
Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Cc: Bernd Edlinger <[email protected]>
Cc: Javier Carrasco <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb: allow faults to be handled under the VMA lock

Hugetlb can now safely handle faults under the VMA lock, so allow it to do
so.

This patch may cause ltp hugemmap10 to "fail".  Hugemmap10 tests hugetlb
counters, and expects the counters to remain unchanged on failure to
handle a fault.

In hugetlb_no_page(), vmf_anon_prepare() may bailout with no anon_vma
under the VMA lock after allocating a folio for the hugepage.  In
free_huge_folio(), this folio is completely freed on bailout iff there is
a surplus of hugetlb pages.  This will remove a folio off the freelist and
decrement the number of hugepages while ltp expects these counters to
remain unchanged on failure.

Originally this could only happen due to OOM failures, but now it may also
occur after we allocate a hugetlb folio without a suitable anon_vma under
the VMA lock.  This should only happen for the first freshly allocated
hugepage in this vma.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb: use vmf_anon_prepare() instead of anon_vma_prepare()

hugetlb_no_page() and hugetlb_wp() call anon_vma_prepare(). In
preparation for hugetlb to safely handle faults under the VMA lock, use
vmf_anon_prepare() here instead.

Additionally, passing hugetlb_wp() the vm_fault struct from
hugetlb_fault() works toward cleaning up the hugetlb code and function
stack.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb: pass struct vm_fault through to hugetlb_handle_userfault()

Now that hugetlb_fault() has a struct vm_fault, have
hugetlb_handle_userfault() use it instead of creating one of its own.

This lets us reduce the number of arguments passed to
hugetlb_handle_userfault() from 7 to 3, cleaning up the code and stack.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb: move vm_fault declaration to the top of hugetlb_fault()

hugetlb_fault() currently defines a vm_fault to pass to the generic
handle_userfault() function. We can move this definition to the top of
hugetlb_fault() so that it can be used throughout the rest of the hugetlb
fault path.

This will help cleanup a number of excess variables and function arguments
throughout the stack. Also, since vm_fault already has space to store the
page offset, use that instead and get rid of idx.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memory: change vmf_anon_prepare() to be non-static

Patch series "Handle hugetlb faults under the VMA lock", v2.

It is generally safe to handle hugetlb faults under the VMA lock.  The
only time this is unsafe is when no anon_vma has been allocated to this
vma yet, so we can use vmf_anon_prepare() instead of anon_vma_prepare() to
bailout if necessary.  This should only happen for the first hugetlb page
in the vma.

Additionally, this patchset begins to use struct vm_fault within
hugetlb_fault().  This works towards cleaning up hugetlb code, and should
significantly reduce the number of arguments passed to functions.

The last patch in this series may cause ltp hugemmap10 to "fail".  This is
because vmf_anon_prepare() may bailout with no anon_vma under the VMA lock
after allocating a folio for the hugepage.  In free_huge_folio(), this
folio is completely freed on bailout iff there is a surplus of hugetlb
pages.  This will remove a folio off the freelist and decrement the number
of hugepages while ltp expects these counters to remain unchanged on
failure.  The rest of the ltp testcases pass.

This patch (of 2):

In order to handle hugetlb faults under the VMA lock, hugetlb can use
vmf_anon_prepare() to ensure we can safely prepare an anon_vma.  Change it
to be a non-static function so it can be used within hugetlb as well.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/page_alloc: make check_new_page() return bool

Make check_new_page() return bool like check_new_pages()

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hao Ge <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

x86/mm: always pass NULL as the first argument of switch_mm_irqs_off()

The first argument of switch_mm_irqs_off() is unused by the x86
implementation. Make sure that x86 code never passes a non-NULL value to
make this clear. Update the only non violating caller, switch_mm().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

x86/mm: further clarify switch_mm_irqs_off() documentation

Commit accf6b23d1e5a ("x86/mm: clarify "prev" usage in
switch_mm_irqs_off()") attempted to clarify x86's usage of the arguments
passed by generic code, specifically the "prev" argument the is unused by
x86. However, it could have done a better job with the comment above
switch_mm_irqs_off(). Rewrite this comment according to Dave Hansen's
suggestion.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 3cfd6625a6cf ("x86/mm: clarify "prev" usage in switch_mm_irqs_off()")
Signed-off-by: Yosry Ahmed <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/util.c: add byte count to __vm_enough_memory failure warning

Commit 44b414c8715c5dcf53288 ("mm/util.c: add warning if
__vm_enough_memory fails") adds debug information which gives the process
id and executable name should __vm_enough_memory() fail.  Adding the
number of pages to the failure message would benefit application
developers and system administrators in debugging overambitious memory
requests by providing a point of reference to the amount of memory causing
__vm_enough_memory() to fail.

1. Set appropriate kernel tunable to reach code path for failure
   message:

# echo 2 > /proc/sys/vm/overcommit_memory

2. Test program to generate failure - requests 1 gibibyte per
   iteration:

#include <stdlib.h>
#include <stdio.h>

int main(int argc, char **argv) {
for(;;) {
if(malloc(1<<30) == NULL)
break;

printf("allocated 1 GiB\n");
}

return 0;
}

3. Output:

Before:

__vm_enough_memory: pid: 1218, comm: a.out, not enough memory
for the allocation

After:

__vm_enough_memory: pid: 1137, comm: a.out, bytes: 1073741824,
not enough memory for the allocation

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Cassell <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

sched/numa, mm: do not try to migrate memory to memoryless nodes

Memoryless nodes do not have any memory to migrate to, so, as an
optimization, stop trying it.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: c574bbe91703 ("NUMA balancing: optimize page placement for memory tiering system")
Signed-off-by: Byungchul Park <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reviewed-by: Phil Auld <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Benjamin Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Vincent Guittot <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/zswap: change zswap_pool kref to percpu_ref

All zswap entries will take a reference of zswap_pool when zswap_store(),
and drop it when free.  Change it to use the percpu_ref is better for
scalability performance.

Although percpu_ref use a bit more memory which should be ok for our use
case, since we almost have only one zswap_pool to be using.  The
performance gain is for zswap_store/load hotpath.

Testing kernel build (32 threads) in tmpfs with memory.max=2GB.  (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)

        mm-unstable  zswap-global-lru
real    63.20        63.12
user    1061.75      1062.95
sys     268.74       264.44

[[email protected]: fix zswap_pools_lock usages after changing to percpu_ref]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chengming Zhou <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Chengming Zhou <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/zswap: global lru and shrinker shared by all zswap_pools

Patch series "mm/zswap: optimize for dynamic zswap_pools", v3.

Dynamic pool creation has been supported for a long time, which maybe not
used so much in practice.  But with the per-memcg lru merged, the current
structure of zswap_pool's lru and shrinker become less optimal.

In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.

1. When memory has pressure, all shrinkers of zswap_pools will try to
   shrink its lru list, there is no order between them.

2. When zswap limit hit, only the last zswap_pool's shrink_work will
   try to shrink its own lru, which is inefficient.

A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.

Another optimization is changing zswap_pool kref to percpu_ref, which will
be taken reference by every zswap entry.  So the scalability is better.

Testing kernel build (32 threads) in tmpfs with memory.max=2GB.  (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)

        mm-unstable  zswap-global-lru
real    63.20        63.12
user    1061.75      1062.95
sys     268.74       264.44

This patch (of 3):

Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools
in a list, only the first will be current used.

Each zswap_pool has its own lru and shrinker, which is not necessary and
has its problem:

1. When memory has pressure, all shrinker of zswap_pools will
   try to shrink its own lru, there is no order between them.

2. When zswap limit hit, only the last zswap_pool's shrink_work
   will try to shrink its lru list. The rationale here was to
   try and empty the old pool first so that we can completely
   drop it. However, since we only support exclusive loads now,
   the LRU ordering should be entirely decided by the order of
   stores, so the oldest entries on the LRU will naturally be
   from the oldest pool.

Anyway, having a global lru and shrinker shared by all zswap_pools is
better and efficient.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nhat Pham <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

writeback: remove a use of write_cache_pages() from do_writepages()

Use the new writeback_iter() directly instead of indirecting through a
callback.

[[email protected]: ported to the while based iter style]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

writeback: add a writeback iterator

Refactor the code left in write_cache_pages into an iterator that the file
system can call to get the next folio for a writeback operation:

struct folio *folio = NULL;

while ((folio = writeback_iter(mapping, wbc, folio, &error))) {
error = <do per-folio writeback>;
}

The twist here is that the error value is passed by reference, so that the
iterator can restore it when breaking out of the loop.

Handling of the magic AOP_WRITEPAGE_ACTIVATE value stays outside the
iterator and needs is just kept in the write_cache_pages legacy wrapper.
in preparation for eventually killing it off.

Heavily based on a for_each* based iterator from Matthew Wilcox.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: David Howells <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

writeback: move the folio_prepare_writeback loop out of write_cache_pages()

Move the loop for should-we-write-this-folio to writeback_get_folio.

[[email protected]: fold loop into existing helper instead of a separate one per Jan]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Dave Chinner <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

writeback: use the folio_batch queue iterator

Instead of keeping our own local iterator variable, use the one just added
to folio_batch.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Dave Chinner <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

pagevec: add ability to iterate a queue

Add a loop counter inside the folio_batch to let us iterate from 0-nr
instead of decrementing nr and treating the batch as a stack. It would
generate some very weird and suboptimal I/O patterns for page writeback to
iterate over the batch as a stack.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Dave Chinner <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

writeback: simplify the loops in write_cache_pages()

Collapse the two nested loops into one.  This is needed as a step towards
turning this into an iterator.

Note that this drops the "index <= end" check in the previous outer loop
and just relies on filemap_get_folios_tag() to return 0 entries when index
> end.  This actually has a subtle implication when end == -1 because then
the returned index will be -1 as well and thus if there is page present on
index -1, we could be looping indefinitely.  But as the comment in
filemap_get_folios_tag documents this as already broken anyway we should
not worry about it here either.  The fix for that would probably a change
to the filemap_get_folios_tag() calling convention.

[[email protected]: update the commit log per Jan]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Dave Chinner <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

writeback: factor writeback_get_batch() out of write_cache_pages()

This simple helper will be the basis of the writeback iterator. To make
this work, we need to remember the current index and end positions in
writeback_control.

[[email protected]: heavily rebased, add helpers to get the tag and end index, don't keep the end index in struct writeback_control]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Dave Chinner <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>