Git Repo - linux.git/log

mm/ksm: use folio in remove_stable_node

Pages in stable tree are all single normal page, so uses ksm_get_folio()
and folio_set_stable_node(), also saves 3 calls to compound_head().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alex Shi (tencent) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Izik Eidus <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Chris Wright <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/ksm: add folio_set_stable_node

Turn set_page_stable_node() into a wrapper folio_set_stable_node, and then
use it to replace the former. we will merge them together after all place
converted to folio.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alex Shi (tencent) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Izik Eidus <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Chris Wright <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/ksm: use folio in remove_rmap_item_from_tree

To save 2 compound_head calls.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alex Shi (tencent) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Izik Eidus <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Chris Wright <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/ksm: add ksm_get_folio

Patch series "transfer page to folio in KSM".

This is the first part of page to folio transfer on KSM. Since only
single page could be stored in KSM, we could safely transfer stable tree
pages to folios.

This patchset could reduce ksm.o 57kbytes from 2541776 bytes on latest
akpm/mm-stable branch with CONFIG_DEBUG_VM enabled. It pass the KSM
testing in LTP and kernel selftest.

Thanks for Matthew Wilcox and David Hildenbrand's suggestions and
comments!

This patch (of 10):

The ksm only contains single pages, so we could add a new func
ksm_get_folio for get_ksm_page to use folio instead of pages to save a
couple of compound_head calls.

After all caller replaced, get_ksm_page will be removed.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alex Shi (tencent) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Izik Eidus <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Chris Wright <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm: mm: drop VM_FAULT_BADMAP/VM_FAULT_BADACCESS

If bad map or access, directly set code to SEGV_MAPRR or SEGV_ACCERR, also
set fault to 0 and goto error handling, which make us to drop the arch's
special vm fault reason.

[[email protected]: coding-style cleanups]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: Aishwarya TCV <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Cristian Marussi <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64: mm: drop VM_FAULT_BADMAP/VM_FAULT_BADACCESS

Patch series "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS", v2.

Directly set SEGV_MAPRR or SEGV_ACCERR for arm/arm64 to remove the last
two arch's private vm_fault reasons.

This patch (of 2):

If bad map or access, directly set si_code to SEGV_MAPRR or SEGV_ACCERR,
also set fault to 0 and goto error handling, which make us to drop the
arch's special vm fault reason.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Aishwarya TCV <[email protected]>
Cc: Cristian Marussi <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Documentation/admin-guide/cgroup-v1/memory.rst: don't reference page_mapcount()

Let's stop talking about page_mapcount().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/debug: print only page mapcount (excluding folio entire mapcount) in __dump_folio()

Let's simplify and only print the page mapcount: we already print the
large folio mapcount and the entire folio mapcount for large folios
separately; that should be sufficient to figure out what's happening.

While at it, print the page mapcount also if it had an underflow,
filtering out only typed pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

xtensa/mm: convert check_tlb_entry() to sanity check folios

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. So let's convert check_tlb_entry() to perform
sanity checks on folios instead of pages.

This essentially already happened: page_count() is mapped to
folio_ref_count(), and page_mapped() to folio_mapped() internally.
However, we would have printed the page_mapount(), which does not really
match what page_mapped() would have checked.

Let's simply print the folio mapcount to avoid using page_mapcount(). For
small folios there is no change.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

trace/events/page_ref: trace the raw page mapcount value

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary.  We already trace raw page->refcount, raw
page->flags and raw page->mapping, and don't involve any folios.  Let's
also trace the raw mapcount value that does not consider the entire
mapcount of large folios, and we don't add "1" to it.

When dealing with typed folios, this makes a lot more sense.  ...  and
it's for debugging purposes only either way.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/migrate_device: use folio_mapcount() in migrate_vma_check_page()

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. Let's convert migrate_vma_check_page() to work on a
folio internally so we can remove the page_mapcount() usage.

Note that we reject any large folios.

There is a lot more folio conversion to be had, but that has to wait for
another day. No functional change intended.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/filemap: use folio_mapcount() in filemap_unaccount_folio()

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary.

Let's use folio_mapcount() instead of filemap_unaccount_folio().

No functional change intended, because we're only dealing with small
folios.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

sh/mm/cache: use folio_mapped() in copy_from_user_page()

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary.

We're already using folio_mapped in copy_user_highpage() and
copy_to_user_page() for a similar purpose so ... let's also simply use it
for copy_from_user_page().

There is no change for small folios. Likely we won't stumble over many
large folios on sh in that code either way.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/migrate: use folio_likely_mapped_shared() in add_page_for_migration()

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. In add_page_for_migration(), we actually want to
check if the folio is mapped shared, to reject such folios. So let's use
folio_likely_mapped_shared() instead.

For small folios, fully mapped THP, and hugetlb folios, there is no change.
For partially mapped, shared THP, we should now do a better job at
rejecting such folios.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range()

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary.

For tracing purposes, we use page_mapcount() in
__alloc_contig_migrate_range(). Adding that mapcount to total_mapped
sounds strange: total_migrated and total_reclaimed would count each page
only once, not multiple times.

But then, isolate_migratepages_range() adds each folio only once to the
list. So for large folios, we would query the mapcount of the first page
of the folio, which doesn't make too much sense for large folios.

Let's simply use folio_mapped() * folio_nr_pages(), which makes more sense
as nr_migratepages is also incremented by the number of pages in the folio
in case of successful migration.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memory-failure: use folio_mapcount() in hwpoison_user_mappings()

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. We can only unmap full folios; page_mapped(), which
we check here, is translated to folio_mapped() -- based on
folio_mapcount(). So let's print the folio mapcount instead.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/huge_memory: use folio_mapcount() in zap_huge_pmd() sanity check

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. Let's similarly check for folio_mapcount()
underflows instead of page_mapcount() underflows like we do in
zap_present_folio_ptes() now.

Instead of the VM_BUG_ON(), we should actually be doing something like
print_bad_pte(). For now, let's keep it simple and use WARN_ON_ONCE(),
performing that check independently of DEBUG_VM.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memory: use folio_mapcount() in zap_present_folio_ptes()

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary.  In zap_present_folio_ptes(), let's simply check the
folio mapcount().  If there is some issue, it will underflow at some point
either way when unmapping.

As indicated already in commit 10ebac4f95e7 ("mm/memory: optimize
unmap/zap with PTE-mapped THP"), we already documented "If we ever have a
cheap folio_mapcount(), we might just want to check for underflows
there.".

There is no change for small folios.  For large folios, we'll now catch
more underflows when batch-unmapping, because instead of only testing the
mapcount of the first subpage, we'll test if the folio mapcount
underflows.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: make folio_mapcount() return 0 for small typed folios

We already handle it properly for large folios. Let's also return "0" for
small typed folios, like page_mapcount() currently would.

Consequently, folio_mapcount() will never return negative values for typed
folios, but may return negative values for underflows.

[[email protected]: make folio_mapcount() slightly more efficient]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: improve folio_likely_mapped_shared() using the mapcount of large folios

We can now read the mapcount of large folios very efficiently. Use it to
improve our handling of partially-mappable folios, falling back to making
a guess only in case the folio is not "obviously mapped shared".

We can now better detect partially-mappable folios where the first page is
not mapped as "mapped shared", reducing "false negatives"; but false
negatives are still possible.

While at it, fixup a wrong comment (false positive vs. false negative)
for KSM folios.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Yin Fengwei <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: track mapcount of large folios in single value

Let's track the mapcount of large folios in a single value.  The mapcount
of a large folio currently corresponds to the sum of the entire mapcount
and all page mapcounts.

This sum is what we actually want to know in folio_mapcount() and it is
also sufficient for implementing folio_mapped().

With PTE-mapped THP becoming more important and more widely used, we want
to avoid looping over all pages of a folio just to obtain the mapcount of
large folios.  The comment "In the common case, avoid the loop when no
pages mapped by PTE" in folio_total_mapcount() does no longer hold for
mTHP that are always mapped by PTE.

Further, we are planning on using folio_mapcount() more frequently, and
might even want to remove page mapcounts for large folios in some kernel
configs.  Therefore, allow for reading the mapcount of large folios
efficiently and atomically without looping over any pages.

Maintain the mapcount also for hugetlb pages for simplicity.  Use the new
mapcount to implement folio_mapcount() and folio_mapped().  Make
page_mapped() simply call folio_mapped().  We can now get rid of
folio_large_is_mapped().

_nr_pages_mapped is now only used in rmap code and for debugging purposes.
Keep folio_nr_pages_mapped() around, but document that its use should be
limited to rmap internals and debugging purposes.

This change implies one additional atomic add/sub whenever
mapping/unmapping (parts of) a large folio.

As we now batch RMAP operations for PTE-mapped THP during fork(), during
unmap/zap, and when PTE-remapping a PMD-mapped THP, and we adjust the
large mapcount for a PTE batch only once, the added overhead in the common
case is small.  Only when unmapping individual pages of a large folio
(e.g., during COW), the overhead might be bigger in comparison, but it's
essentially one additional atomic operation.

Note that before the new mapcount would overflow, already our refcount
would overflow: each mapping requires a folio reference.  Extend the
focumentation of folio_mapcount().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Yin Fengwei <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/rmap: add fast-path for small folios when adding/removing/duplicating

Let's add a fast-path for small folios to all relevant rmap functions.
Note that only RMAP_LEVEL_PTE applies.

This is a preparation for tracking the mapcount of large folios in a
single value.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Yin Fengwei <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/rmap: always inline anon/file rmap duplication of a single PTE

As we grow the code, the compiler might make stupid decisions and
unnecessarily degrade fork() performance. Let's make sure to always
inline functions that operate on a single PTE so the compiler will always
optimize out the loop and avoid a function call.

This is a preparation for maintining a total mapcount for large folios.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Yin Fengwei <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: allow for detecting underflows with page_mapcount() again

Patch series "mm: mapcount for large folios + page_mapcount() cleanups".

This series tracks the mapcount of large folios in a single value, so it
can be read efficiently and atomically, just like the mapcount of small
folios.

folio_mapcount() is then used in a couple more places, most notably to
reduce false negatives in folio_likely_mapped_shared(), and many users of
page_mapcount() are cleaned up (that's maybe why you got CCed on the full
series, sorry sh+xtensa folks!  :) ).

The remaining s390x user and one KSM user of page_mapcount() are getting
removed separately on the list right now.  I have patches to handle the
other KSM one, the khugepaged one and the kpagecount one; as they are not
as "obvious", I will send them out separately in the future.  Once that is
all in place, I'm planning on moving page_mapcount() into
fs/proc/task_mmu.c, the remaining user for the time being (and we can
discuss at LSF/MM details on that :) ).

I proposed the mapcount for large folios (previously called total
mapcount) originally in part of [1] and I later included it in [2] where
it is a requirement.  In the meantime, I changed the patch a bit so I
dropped all RB's.  During the discussion of [1], Peter Xu correctly raised
that this additional tracking might affect the performance when PMD->PTE
remapping THPs.  In the meantime.  I addressed that by batching RMAP
operations during fork(), unmap/zap and when PMD->PTE remapping THPs.

Running some of my micro-benchmarks [3] (fork,munmap,cow-byte,remap) on 1
GiB of memory backed by folios with the same order, I observe the
following on an Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz tuned for
reproducible results as much as possible:

Standard deviation is mostly < 1%, except for order-9, where it's < 2% for
fork() and munmap().

(1) Small folios are not affected (< 1%) in all 4 microbenchmarks.
(2) Order-4 folios are not affected (< 1%) in all 4 microbenchmarks. A bit
    weird comapred to the other orders ...
(3) PMD->PTE remapping of order-9 THPs is not affected (< 1%)
(4) COW-byte (COWing a single page by writing a single byte) is not
    affected for any order (< 1 %). The page copy_fault overhead dominates
    everything.
(5) fork() is mostly not affected (< 1%), except order-2, where we have
    a slowdown of ~4%. Already for order-3 folios, we're down to a slowdown
    of < 1%.
(6) munmap() sees a slowdown by < 3% for some orders (order-5,
    order-6, order-9), but less for others (< 1% for order-4 and order-8,
    < 2% for order-2, order-3, order-7).

Especially the fork() and munmap() benchmark are sensitive to each added
instruction and other system noise, so I suspect some of the change and
observed weirdness (order-4) is due to code layout changes and other
factors, but not really due to the added atomics.

So in the common case where we can batch, the added atomics don't really
make a big difference, especially in light of the recent improvements for
large folios that we recently gained due to batching.  Surprisingly, for
some cases where we cannot batch (e.g., COW), the added atomics don't seem
to matter, because other overhead dominates.

My fork and munmap micro-benchmarks don't cover cases where we cannot
batch-process bigger parts of large folios.  As this is not the common
case, I'm not worrying about that right now.

Future work is batching RMAP operations during swapout and folio
migration.

[1] https://lore.kernel.org/all/20230809083256 [email protected]/
[2] https://lore.kernel.org/all/20231124132626 [email protected]/
[3] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c?ref_type=heads

This patch (of 18):

Commit 53277bcf126d ("mm: support page_mapcount() on page_has_type()
pages") made it impossible to detect mapcount underflows by treating any
negative raw mapcount value as a mapcount of 0.

We perform such underflow checks in zap_present_folio_ptes() and
zap_huge_pmd(), which would currently no longer trigger.

Let's check against PAGE_MAPCOUNT_RESERVE instead by using
page_type_has_type(), like page_has_type() would, so we can still catch
some underflows.

[[email protected]: make page_mapcount() slighly more efficient]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 53277bcf126d ("mm: support page_mapcount() on page_has_type() pages")
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Richard Chang <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: follow_pte() improvements

follow_pte() is now our main function to lookup PTEs in VM_PFNMAP/VM_IO
VMAs.  Let's perform some more sanity checks to make this exported
function harder to abuse.

Further, extend the doc a bit, it still focuses on the KVM use case with
MMU notifiers.  Drop the KVM+follow_pfn() comment, follow_pfn() is no
more, and we have other users nowadays.

Also extend the doc regarding refcounted pages and the interaction with
MMU notifiers.

KVM is one example that uses MMU notifiers and can deal with refcounted
pages properly.  VFIO is one example that doesn't use MMU notifiers, and
to prevent use-after-free, rejects refcounted pages: pfn_valid(pfn) &&
!PageReserved(pfn_to_page(pfn)).  Protection changes are less of a concern
for users like VFIO: the behavior is similar to longterm-pinning a page,
and getting the PTE protection changed afterwards.

The primary concern with refcounted pages is use-after-free, which callers
should be aware of.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Fei Li <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Yonghua Huang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: pass VMA instead of MM to follow_pte()

... and centralize the VM_IO/VM_PFNMAP sanity check in there. We'll
now also perform these sanity checks for direct follow_pte()
invocations.

For generic_access_phys(), we might now check multiple times: nothing to
worry about, really.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Sean Christopherson <[email protected]> [KVM]
Cc: Alex Williamson <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Fei Li <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Yonghua Huang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

drivers/virt/acrn: fix PFNMAP PTE checks in acrn_vm_ram_map()

Patch series "mm: follow_pte() improvements and acrn follow_pte() fixes".

Patch #1 fixes a bunch of issues I spotted in the acrn driver.  It
compiles, that's all I know.  I'll appreciate some review and testing from
acrn folks.

Patch #2+#3 improve follow_pte(), passing a VMA instead of the MM, adding
more sanity checks, and improving the documentation.  Gave it a quick test
on x86-64 using VM_PAT that ends up using follow_pte().

This patch (of 3):

We currently miss handling various cases, resulting in a dangerous
follow_pte() (previously follow_pfn()) usage.

(1) We're not checking PTE write permissions.

Maybe we should simply always require pte_write() like we do for
pin_user_pages_fast(FOLL_WRITE)? Hard to tell, so let's check for
ACRN_MEM_ACCESS_WRITE for now.

(2) We're not rejecting refcounted pages.

As we are not using MMU notifiers, messing with refcounted pages is
dangerous and can result in use-after-free. Let's make sure to reject them.

(3) We are only looking at the first PTE of a bigger range.

We only lookup a single PTE, but memmap->len may span a larger area.
Let's loop over all involved PTEs and make sure the PFN range is
actually contiguous. Reject everything else: it couldn't have worked
either way, and rather made use access PFNs we shouldn't be accessing.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 8a6e85f75a83 ("virt: acrn: obtain pa from VMA with PFNMAP flag")
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Fei Li <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Yonghua Huang <[email protected]>
Cc: Sean Christopherson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm,swap: add document about RCU read lock and swapoff interaction

During reviewing a patch to fix the race condition between
free_swap_and_cache() and swapoff() [1], it was found that the document
about how to prevent racing with swapoff isn't clear enough. Especially
RCU read lock can prevent swapoff from freeing data structures. So, the
document is added as comments.

[1] https://lore.kernel.org/linux-mm/c8fe62d0-78b8-527a-5bef-ee663ccdc37a@huawei.com/

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mmap: make accountable_mapping return bool

accountable_mapping() can return bool, so change it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hao Ge <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mmap: make vma_wants_writenotify return bool

vma_wants_writenotify() should return bool, so change it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hao Ge <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

memory tier: create CPUless memory tiers after obtaining HMAT info

The current implementation treats emulated memory devices, such as CXL1.1
type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM).  However, these emulated devices have different
characteristics than traditional DRAM, making it important to distinguish
them.  Thus, we modify the tiered memory initialization process to
introduce a delay specifically for CPUless NUMA nodes.  This delay ensures
that the memory tier initialization for these nodes is deferred until HMAT
information is obtained during the boot process.  Finally, demotion tables
are recalculated at the end.

* late_initcall(memory_tier_late_init);
  Some device drivers may have initialized memory tiers between
  `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
  online memory nodes and configuring memory tiers.  They should be
  excluded in the late init.

* Handle cases where there is no HMAT when creating memory tiers
  There is a scenario where a CPUless node does not provide HMAT
  information.  If no HMAT is specified, it falls back to using the
  default DRAM tier.

* Introduce another new lock `default_dram_perf_lock` for adist
  calculation In the current implementation, iterating through CPUlist
  nodes requires holding the `memory_tier_lock`.  However,
  `mt_calc_adistance()` will end up trying to acquire the same lock,
  leading to a potential deadlock.  Therefore, we propose introducing a
  standalone `default_dram_perf_lock` to protect `default_dram_perf_*`.
  This approach not only avoids deadlock but also prevents holding a large
  lock simultaneously.

* Upgrade `set_node_memory_tier` to support additional cases, including
  default DRAM, late CPUless, and hot-plugged initializations.  To cover
  hot-plugged memory nodes, `mt_calc_adistance()` and
  `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
  handle cases where memtype is not initialized and where HMAT information
  is available.

* Introduce `default_memory_types` for those memory types that are not
  initialized by device drivers.  Because late initialized memory and
  default DRAM memory need to be managed, a default memory type is created
  for storing all memory types that are not initialized by device drivers
  and as a fallback.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]>
Signed-off-by: Hao Xiang <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Gregory Price <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Ravi Jonnalagadda <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

Patch series "Improved Memory Tier Creation for CPUless NUMA Nodes", v11.

When a memory device, such as CXL1.1 type3 memory, is emulated as normal
memory (E820_TYPE_RAM), the memory device is indistinguishable from normal
DRAM in terms of memory tiering with the current implementation.  The
current memory tiering assigns all detected normal memory nodes to the
same DRAM tier.  This results in normal memory devices with different
attributions being unable to be assigned to the correct memory tier,
leading to the inability to migrate pages between different types of
memory.
https://lore.kernel.org/linux-mm/PH0PR08MB7955E9F08CCB64F23963B5C3A860A@PH0PR08MB7955.namprd08.prod.outlook.com/T/

This patchset automatically resolves the issues.  It delays the
initialization of memory tiers for CPUless NUMA nodes until they obtain
HMAT information and after all devices are initialized at boot time,
eliminating the need for user intervention.  If no HMAT is specified, it
falls back to using `default_dram_type`.

Example usecase:
We have CXL memory on the host, and we create VMs with a new system memory
device backed by host CXL memory.  We inject CXL memory performance
attributes through QEMU, and the guest now sees memory nodes with
performance attributes in HMAT.  With this change, we enable the guest
kernel to construct the correct memory tiering for the memory nodes.

This patch (of 2):

Since different memory devices require finding, allocating, and putting
memory types, these common steps are abstracted in this patch, enhancing
the scalability and conciseness of the code.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Gregory Price <[email protected]>
Cc: Hao Xiang <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Ravi Jonnalagadda <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: set pageblock_order to HPAGE_PMD_ORDER in case with !CONFIG_HUGETLB_PAGE but THP enabled

As Vlastimil suggested in previous discussion[1], it doesn't make sense to
set pageblock_order as MAX_PAGE_ORDER when hugetlbfs is not enabled and
THP is enabled. Instead, it should be set to HPAGE_PMD_ORDER.

[1] https://lore.kernel.org/all/76457ec5-d789-449b-b8ca-dcb6ceb12445@suse.cz/
Link: https://lkml.kernel.org/r/3d57d253070035bdc0f6d6e5681ce1ed0e1934f7.1712286863.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Suggested-by: Vlastimil Babka <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: convert free_zone_device_page to free_zone_device_folio

Both callers already have a folio; pass it in and save a few calls to
compound_head().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: combine __folio_put_small, __folio_put_large and __folio_put

It's now obvious that __folio_put_small() and __folio_put_large() do
almost exactly the same thing. Inline them both into __folio_put().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: inline destroy_large_folio() into __folio_put_large()

destroy_large_folio() has only one caller, move its contents there.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: combine free_the_page() and free_unref_page()

The pcp_allowed_order() check in free_the_page() was only being skipped by
__folio_put_small() which is about to be rearranged.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: free non-hugetlb large folios in a batch

Patch series "Clean up __folio_put()".

With all the changes over the last few years, __folio_put_small and
__folio_put_large have become almost identical to each other ...  except
you can't tell because they're spread over two files.  Rearrange it all so
that you can tell, and then inline them both into __folio_put().

This patch (of 5):

free_unref_folios() can now handle non-hugetlb large folios, so keep
normal large folios in the batch.  hugetlb folios still need to be handled
specially.

[[email protected]: fix panic]
Link: https://lkml.kernel.org/r/ZikjPB0Dt5HA8-uL@x1n
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: convert pagecache_isize_extended to use a folio

Remove four hidden calls to compound_head(). Also exit early if the
filesystem block size is >= PAGE_SIZE instead of just equal to PAGE_SIZE.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Pankaj Raghav <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/hugetlb: pass correct order_per_bit to cma_declare_contiguous_nid

The hugetlb_cma code passes 0 in the order_per_bit argument to
cma_declare_contiguous_nid (the alignment, computed using the page order,
is correctly passed in).

This causes a bit in the cma allocation bitmap to always represent a 4k
page, making the bitmaps potentially very large, and slower.

It would create bitmaps that would be pretty big.  E.g.  for a 4k page
size on x86, hugetlb_cma=64G would mean a bitmap size of (64G / 4k) / 8
== 2M.  With HUGETLB_PAGE_ORDER as order_per_bit, as intended, this
would be (64G / 2M) / 8 == 4k.  So, that's quite a difference.

Also, this restricted the hugetlb_cma area to ((PAGE_SIZE <<
MAX_PAGE_ORDER) * 8) * PAGE_SIZE (e.g.  128G on x86) , since
bitmap_alloc uses normal page allocation, and is thus restricted by
MAX_PAGE_ORDER.  Specifying anything about that would fail the CMA
initialization.

So, correctly pass in the order instead.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
Signed-off-by: Frank van der Linden <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/cma: drop incorrect alignment check in cma_init_reserved_mem

cma_init_reserved_mem uses IS_ALIGNED to check if the size represented by
one bit in the cma allocation bitmask is aligned with
CMA_MIN_ALIGNMENT_BYTES (pageblock size).

However, this is too strict, as this will fail if order_per_bit >
pageblock_order, which is a valid configuration.

We could check IS_ALIGNED both ways, but since both numbers are powers of
two, no check is needed at all.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: de9e14eebf33 ("drivers: dma-contiguous: add initialization from device tree")
Signed-off-by: Frank van der Linden <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: fix additional build errors for selftests

These build errors only occur if one fails to first run "make headers".
However, that is a non-obvious and instrusive requirement, and so there
was a discussion on how to get rid of it [1]. This uses that solution.

These two files were created by taking a snapshot of the generated header
files that are created via "make headers". These two files were copied
from ./usr/include/linux/ to ./tools/include/uapi/linux/ .

That fixes the selftests/mm build on today's Arch Linux (which required
the userfaultfd.h) and Ubuntu 23.04 (which additionally required memfd.h).

[1] https://lore.kernel.org/all/783a4178-1dec-4e30-989a-5174b8176b09@redhat.com/

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Muhammad Usama Anjum <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests: break the dependency upon local header files

Patch series "Fix selftests/mm build without requiring "make headers"".

As mentioned in each patch, this implements the solution that we discussed
in December 2023, in [1].  This turned out to be very clean and easy.  It
should also be quite easy to maintain.

This should also make Peter Zijlstra happy, because it directly addresses
the root cause of his "NAK NAK NAK" reply [2].  :)

[1] https://lore.kernel.org/all/783a4178-1dec-4e30-989a-5174b8176b09@redhat.com/
[2] https://lore.kernel.org/lkml/20231103121652 [email protected]/

This patch (of 2):

Use tools/include/uapi/ files instead.  These are obtained by taking a
snapshot: run "make headers" at the top level, then copy the desired
header file into the appropriate subdir in tools/uapi/.

This was discussed and solved in [1].

However, even before copying any additional files there, there are already
quite a few in tools/include/uapi already.  And these will immediately fix
a number of selftests/mm build failures.

So this patch:

a) Adds TOOLS_INCLUDES to selftests/lib.mk, so that all selftests can
   immediately and easily include the snapshotted header files.

b) Uses $(TOOLS_INCLUDES) in the selftests/mm build.  On today's Arch
   Linux, this already fixes all build errors except for a few
   userfaultfd.h (those will be addressed in a subsequent patch).

[1] https://lore.kernel.org/all/783a4178-1dec-4e30-989a-5174b8176b09@redhat.com/

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Muhammad Usama Anjum <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb: convert hugetlb_wp() to use struct vm_fault

hugetlb_wp() can use the struct vm_fault passed in from hugetlb_fault().
This alleviates the stack by consolidating 5 variables into a single
struct.

[[email protected]: simplify hugetlb_wp() arguments]
Link: https://lkml.kernel.org/r/ZhQtoFNZBNwBCeXn@fedora
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb: convert hugetlb_no_page() to use struct vm_fault

hugetlb_no_page() can use the struct vm_fault passed in from
hugetlb_fault(). This alleviates the stack by consolidating 7
variables into a single struct.

[[email protected]: simplify hugetlb_no_page() arguments]
Link: https://lkml.kernel.org/r/ZhQtN8y5zud8iI1u@fedora
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb: convert hugetlb_fault() to use struct vm_fault

Patch series "Hugetlb fault path to use struct vm_fault", v2.

This patchset converts the hugetlb fault path to use struct vm_fault.
This helps make the code more readable, and alleviates the stack by
allowing us to consolidate many fault-related variables into an individual
pointer.

This patch (of 3):

Now that hugetlb_fault() has a vm_fault available for fault tracking, use
it throughout. This cleans up the code by removing 2 variables, and
prepares hugetlb_fault() to take in a struct vm_fault argument.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/ksm: remove redundant code in ksm_fork

Since commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl"), when a
child process is forked, the MMF_VM_MERGE_ANY flag will be inherited in
mm_init(). So, it's unnecessary to set the flag in ksm_fork().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jinjiang Tu <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Stefan Roesch <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: use "GUP-fast" instead "fast GUP" in remaining comments

Let's fixup the remaining comments to consistently call that thing
"GUP-fast". With this change, we consistently call it "GUP-fast".

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FAST

Nowadays, we call it "GUP-fast", the external interface includes functions
like "get_user_pages_fast()", and we renamed all internal functions to
reflect that as well.

Let's make the config option reflect that.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/gup: consistently name GUP-fast functions

Patch series "mm/gup: consistently call it GUP-fast".

Some cleanups around function names, comments and the config option of
"GUP-fast" -- GUP without "lock" safety belts on.

With this cleanup it's easy to judge which functions are GUP-fast
specific.  We now consistently call it "GUP-fast", avoiding mixing it with
"fast GUP", "lockless", or simply "gup" (which I always considered
confusing in the ode).

So the magic now happens in functions that contain "gup_fast", whereby
gup_fast() is the entry point into that magic.  Comments consistently
reference either "GUP-fast" or "gup_fast()".

This patch (of 3):

Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename
all relevant internal functions to start with "gup_fast", to make it
clearer that this is not ordinary GUP.  The current mixture of "lockless",
"gup" and "gup_fast" is confusing.

Further, avoid the term "huge" when talking about a "leaf" -- for example,
we nowadays check pmd_leaf() because pmd_huge() is gone.  For the
"hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that
stays.

What remains is the "external" interface:
* get_user_pages_fast_only()
* get_user_pages_fast()
* pin_user_pages_fast()

The high-level internal functions for GUP-fast (+slow fallback) are now:
* internal_get_user_pages_fast() -> gup_fast_fallback()
* lockless_pages_from_mm() -> gup_fast()

The basic GUP-fast walker functions:
* gup_pgd_range() -> gup_fast_pgd_range()
* gup_p4d_range() -> gup_fast_p4d_range()
* gup_pud_range() -> gup_fast_pud_range()
* gup_pmd_range() -> gup_fast_pmd_range()
* gup_pte_range() -> gup_fast_pte_range()
* gup_huge_pgd()  -> gup_fast_pgd_leaf()
* gup_huge_pud()  -> gup_fast_pud_leaf()
* gup_huge_pmd()  -> gup_fast_pmd_leaf()

The weird hugepd stuff:
* gup_huge_pd() -> gup_fast_hugepd()
* gup_hugepte() -> gup_fast_hugepte()

The weird devmap stuff:
* __gup_device_huge_pud() -> gup_fast_devmap_pud_leaf()
* __gup_device_huge_pmd   -> gup_fast_devmap_pmd_leaf()
* __gup_device_huge()     -> gup_fast_devmap_leaf()
* undo_dev_pagemap()      -> gup_fast_undo_dev_pagemap()

Helper functions:
* unpin_user_pages_lockless() -> gup_fast_unpin_user_pages()
* gup_fast_folio_allowed() is already properly named
* gup_fast_permitted() is already properly named

With "gup_fast()", we now even have a function that is referred to in
comment in mm/mmu_gather.c.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb: convert alloc_buddy_hugetlb_folio to use a folio

While this function returned a folio, it was still using __alloc_pages()
and __free_pages(). Use __folio_alloc() and put_folio() instead. This
actually removes a call to compound_head(), but more importantly, it
prepares us for the move to memdescs.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: remove struct page from get_shadow_from_swap_cache

We don't actually use any parts of struct page; all we do is check the
value of the pointer. So give the pointer the appropriate name & type.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

x86: mm: accelerate pagefault when badaccess

The access_error() of vma is already checked under per-VMA lock, if it is
a bad access, directly handle error, no need to retry with mmap_lock
again.  In order to release the correct lock, pass the mm_struct into
bad_area_access_error().  If mm is NULL, release vma lock, or release
mmap_lock.  Since the page faut is handled under per-VMA lock, count it as
a vma lock event with VMA_LOCK_SUCCESS.

Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Kefeng Wang <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

s390: mm: accelerate pagefault when badaccess

The vm_flags of vma already checked under per-VMA lock, if it is a bad
access, directly handle error, no need to retry with mmap_lock again.
Since the page faut is handled under per-VMA lock, count it as a vma lock
event with VMA_LOCK_SUCCESS.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: Heiko Carstens <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

riscv: mm: accelerate pagefault when badaccess

The access_error() of vma already checked under per-VMA lock, if it is a
bad access, directly handle error, no need to retry with mmap_lock again.
Since the page faut is handled under per-VMA lock, count it as a vma lock
event with VMA_LOCK_SUCCESS.

[[email protected]: use `cause' rather than SIGSEGV, per Alexandre]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Tested-by: Alexandre Ghiti <[email protected]
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

powerpc: mm: accelerate pagefault when badaccess

The access_[pkey]_error() of vma already checked under per-VMA lock, if it
is a bad access, directly handle error, no need to retry with mmap_lock
again. In order to release the correct lock, pass the mm_struct into
bad_access_pkey()/bad_access(), if mm is NULL, release vma lock, or
release mmap_lock. Since the page faut is handled under per-VMA lock,
count it as a vma lock event with VMA_LOCK_SUCCESS.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Acked-by: Michael Ellerman <[email protected]> (powerpc)
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm: mm: accelerate pagefault when VM_FAULT_BADACCESS

The vm_flags of vma already checked under per-VMA lock, if it is a bad
access, directly set fault to VM_FAULT_BADACCESS and handle error, no need
to retry with mmap_lock again. Since the page faut is handled under
per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: Suren Baghdasaryan <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64: mm: accelerate pagefault when VM_FAULT_BADACCESS

The vm_flags of vma already checked under per-VMA lock, if it is a bad
access, directly set fault to VM_FAULT_BADACCESS and handle error, no need
to retry with mmap_lock again, the latency time reduces 34% in 'lat_sig -P
1 prot lat_sig' from lmbench testcase.

Since the page fault is handled under per-VMA lock, count it as a vma lock
event with VMA_LOCK_SUCCESS.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64: mm: cleanup __do_page_fault()

Patch series "arch/mm/fault: accelerate pagefault when badaccess", v2.

After VMA lock-based page fault handling enabled, if bad access met
under per-vma lock, it will fallback to mmap_lock-based handling,
so it leads to unnessary mmap lock and vma find again. A test from
lmbench shows 34% improve after this changes on arm64,

lat_sig -P 1 prot lat_sig 0.29194 -> 0.19198

This patch (of 7):

The __do_page_fault() only calls handle_mm_fault() after vm_flags checked,
and it is only called by do_page_fault(), let's squash it into
do_page_fault() to cleanup code.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: madvise: avoid split during MADV_PAGEOUT and MADV_COLD

Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
folio that is fully and contiguously mapped in the pageout/cold vm range.
This change means that large folios will be maintained all the way to swap
storage.  This both improves performance during swap-out, by eliding the
cost of splitting the folio, and sets us up nicely for maintaining the
large folio when it is swapped back in (to be covered in a separate
series).

Folios that are not fully mapped in the target range are still split, but
note that behavior is changed so that if the split fails for any reason
(folio locked, shared, etc) we now leave it as is and move to the next pte
in the range and continue work on the proceeding folios.  Previously any
failure of this sort would cause the entire operation to give up and no
folios mapped at higher addresses were paged out or made cold.  Given
large folios are becoming more common, this old behavior would have likely
lead to wasted opportunities.

While we are at it, change the code that clears young from the ptes to use
ptep_test_and_clear_young(), via the new mkold_ptes() batch helper
function.  This is more efficent than get_and_clear/modify/set, especially
for contpte mappings on arm64, where the old approach would require
unfolding/refolding and the new approach can be done in place.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Reviewed-by: Barry Song <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Gao Xiang <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmscan: avoid split during shrink_folio_list()

Now that swap supports storing all mTHP sizes, avoid splitting large
folios before swap-out.  This benefits performance of the swap-out path by
eliding split_folio_to_list(), which is expensive, and also sets us up for
swapping in large folios in a future series.

If the folio is partially mapped, we continue to split it since we want to
avoid the extra IO overhead and storage of writing out pages
uneccessarily.

THP_SWPOUT and THP_SWPOUT_FALLBACK counters should continue to count
events only for PMD-mappable folios to avoid user confusion.  THP_SWPOUT
already has the appropriate guard.  Add a guard for THP_SWPOUT_FALLBACK.
It may be appropriate to add per-size counters in future.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Barry Song <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Gao Xiang <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: swap: allow storage of all mTHP orders

Multi-size THP enables performance improvements by allocating large,
pte-mapped folios for anonymous memory.  However I've observed that on an
arm64 system running a parallel workload (e.g.  kernel compilation) across
many cores, under high memory pressure, the speed regresses.  This is due
to bottlenecking on the increased number of TLBIs added due to all the
extra folio splitting when the large folios are swapped out.

Therefore, solve this regression by adding support for swapping out mTHP
without needing to split the folio, just like is already done for
PMD-sized THP.  This change only applies when CONFIG_THP_SWAP is enabled,
and when the swap backing store is a non-rotating block device.  These are
the same constraints as for the existing PMD-sized THP swap-out support.

Note that no attempt is made to swap-in (m)THP here - this is still done
page-by-page, like for PMD-sized THP.  But swapping-out mTHP is a
prerequisite for swapping-in mTHP.

The main change here is to improve the swap entry allocator so that it can
allocate any power-of-2 number of contiguous entries between [1, (1 <<
PMD_ORDER)].  This is done by allocating a cluster for each distinct order
and allocating sequentially from it until the cluster is full.  This
ensures that we don't need to search the map and we get no fragmentation
due to alignment padding for different orders in the cluster.  If there is
no current cluster for a given order, we attempt to allocate a free
cluster from the list.  If there are no free clusters, we fail the
allocation and the caller can fall back to splitting the folio and
allocates individual entries (as per existing PMD-sized THP fallback).

The per-order current clusters are maintained per-cpu using the existing
infrastructure.  This is done to avoid interleving pages from different
tasks, which would prevent IO being batched.  This is already done for the
order-0 allocations so we follow the same pattern.

As is done for order-0 per-cpu clusters, the scanner now can steal order-0
entries from any per-cpu-per-order reserved cluster.  This ensures that
when the swap file is getting full, space doesn't get tied up in the
per-cpu reserves.

This change only modifies swap to be able to accept any order mTHP.  It
doesn't change the callers to elide doing the actual split.  That will be
done in separate changes.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gao Xiang <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: swap: update get_swap_pages() to take folio order

We are about to allow swap storage of any mTHP size.  To prepare for that,
let's change get_swap_pages() to take a folio order parameter instead of
nr_pages.  This makes the interface self-documenting; a power-of-2 number
of pages must be provided.  We will also need the order internally so this
simplifies accessing it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Gao Xiang <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: swap: simplify struct percpu_cluster

struct percpu_cluster stores the index of cpu's current cluster and the
offset of the next entry that will be allocated for the cpu.  These two
pieces of information are redundant because the cluster index is just
(offset / SWAPFILE_CLUSTER).  The only reason for explicitly keeping the
cluster index is because the structure used for it also has a flag to
indicate "no cluster".  However this data structure also contains a spin
lock, which is never used in this context, as a side effect the code
copies the spinlock_t structure, which is questionable coding practice in
my view.

So let's clean this up and store only the next offset, and use a sentinal
value (SWAP_NEXT_INVALID) to indicate "no cluster".  SWAP_NEXT_INVALID is
chosen to be 0, because 0 will never be seen legitimately; The first page
in the swap file is the swap header, which is always marked bad to prevent
it from being allocated as an entry.  This also prevents the cluster to
which it belongs being marked free, so it will never appear on the free
list.

This change saves 16 bytes per cpu.  And given we are shortly going to
extend this mechanism to be per-cpu-AND-per-order, we will end up saving
16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the
system.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gao Xiang <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()

Now that we no longer have a convenient flag in the cluster to determine
if a folio is large, free_swap_and_cache() will take a reference and lock
a large folio much more often, which could lead to contention and (e.g.)
failure to split large folios, etc.

Let's solve that problem by batch freeing swap and cache with a new
function, free_swap_and_cache_nr(), to free a contiguous range of swap
entries together.  This allows us to first drop a reference to each swap
slot before we try to release the cache folio.  This means we only try to
release the folio once, only taking the reference and lock once - much
better than the previous 512 times for the 2M THP case.

Contiguous swap entries are gathered in zap_pte_range() and
madvise_free_pte_range() in a similar way to how present ptes are already
gathered in zap_pte_range().

While we are at it, let's simplify by converting the return type of both
functions to void.  The return value was used only by zap_pte_range() to
print a bad pte, and was ignored by everyone else, so the extra reporting
wasn't exactly guaranteed.  We will still get the warning with most of the
information from get_swap_device().  With the batch version, we wouldn't
know which pte was bad anyway so could print the wrong one.

[[email protected]: fix a build warning on parisc]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Gao Xiang <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: swap: remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags

Patch series "Swap-out mTHP without splitting", v7.

This series adds support for swapping out multi-size THP (mTHP) without
needing to first split the large folio via
split_huge_page_to_list_to_order().  It closely follows the approach
already used to swap-out PMD-sized THP.

There are a couple of reasons for swapping out mTHP without splitting:

  - Performance: It is expensive to split a large folio and under
    extreme memory pressure some workloads regressed performance when
    using 64K mTHP vs 4K small folios because of this extra cost in the
    swap-out path.  This series not only eliminates the regression but
    makes it faster to swap out 64K mTHP vs 4K small folios.

  - Memory fragmentation avoidance: If we can avoid splitting a large
    folio memory is less likely to become fragmented, making it easier to
    re-allocate a large folio in future.

  - Performance: Enables a separate series [7] to swap-in whole mTHPs,
    which means we won't lose the TLB-efficiency benefits of mTHP once the
    memory has been through a swap cycle.

I've done what I thought was the smallest change possible, and as a
result, this approach is only employed when the swap is backed by a
non-rotating block device (just as PMD-sized THP is supported today).
Discussion against the RFC concluded that this is sufficient.

Performance Testing
===================

I've run some swap performance tests on Ampere Altra VM (arm64) with 8
CPUs.  The VM is set up with a 35G block ram device as the swap device and
the test is run from inside a memcg limited to 40G memory.  I've then run
`usemem` from vm-scalability with 70 processes, each allocating and
writing 1G of memory.  I've repeated everything 6 times and taken the mean
performance improvement relative to 4K page baseline:

| alloc size |                baseline |           + this series |
|            | mm-unstable (~v6.9-rc1) |                         |
|:-----------|------------------------:|------------------------:|
| 4K Page    |                    0.0% |                    1.3% |
| 64K THP    |                  -13.6% |                   46.3% |
| 2M THP     |                   91.4% |                   89.6% |

So with this change, the 64K swap performance goes from a 14% regression to a
46% improvement. While 2M shows a small regression I'm confident that this is
just noise.

[1] https://lore.kernel.org/linux-mm/20231010142111.3997780 [email protected]/
[2] https://lore.kernel.org/linux-mm/20231017161302.2518826 [email protected]/
[3] https://lore.kernel.org/linux-mm/20231025144546 [email protected]/
[4] https://lore.kernel.org/linux-mm/20240311150058.1122862 [email protected]/
[5] https://lore.kernel.org/linux-mm/20240327144537.4165578 [email protected]/
[6] https://lore.kernel.org/linux-mm/20240403114032.1162100 [email protected]/
[7] https://lore.kernel.org/linux-mm/20240304081348 [email protected]/
[8] https://lore.kernel.org/linux-mm/CAGsJ_4yMOow27WDvN2q=E4HAtDd2PJ=OQ5Pj9DG+6FLWwNuXUw@mail.gmail.com/
[9] https://lore.kernel.org/linux-mm/579d5127-c763-4001-9625-4563a9316ac3@redhat.com/

This patch (of 7):

As preparation for supporting small-sized THP in the swap-out path,
without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
which, when present, always implies PMD-sized THP, which is the same as
the cluster size.

The only use of the flag was to determine whether a swap entry refers to a
single page or a PMD-sized THP in swap_page_trans_huge_swapped().  Instead
of relying on the flag, we now pass in order, which originates from the
folio's order.  This allows the logic to work for folios of any order.

The one snag is that one of the swap_page_trans_huge_swapped() call sites
does not have the folio.  But it was only being called there to shortcut a
call __try_to_reclaim_swap() in some cases.  __try_to_reclaim_swap() gets
the folio and (via some other functions) calls
swap_page_trans_huge_swapped().  So I've removed the problematic call site
and believe the new logic should be functionally equivalent.

That said, removing the fast path means that we will take a reference and
trylock a large folio much more often, which we would like to avoid.  The
next patch will solve this.

Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster()
which used to be called during folio splitting, since
split_swap_cluster()'s only job was to remove the flag.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Acked-by: Chris Li <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Gao Xiang <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Barry Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: page_alloc: use the correct THP order for THP PCP

Commit 44042b449872 ("mm/page_alloc: allow high-order pages to be stored
on the per-cpu lists") extends the PCP allocator to store THP pages, and
it determines whether to cache THP pages in PCP by comparing with
pageblock_order. But the pageblock_order is not always equal to THP
order. It might also be MAX_PAGE_ORDER, which could prevent PCP from
caching THP pages.

Therefore, using HPAGE_PMD_ORDER instead to determine the need for caching
THP for PCP will fix this issue

Link: https://lkml.kernel.org/r/a25c9e14cd03907d5978b60546a69e6aa3fc2a7d.1712151833.git.baolin.wang@linux.alibaba.com
Fixes: 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
Signed-off-by: Baolin Wang <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Reviewed-by: Barry Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

proc: convert smaps_pmd_entry to use a folio

Replace two calls to compound_head() with one.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Christian Brauner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

proc: pass a folio to smaps_page_accumulate()

Both callers already have a folio; pass it in instead of doing the
conversion each time.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Christian Brauner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

proc: convert smaps_page_accumulate to use a folio

Replaces three calls to compound_head() with one. Shrinks the function
from 2614 bytes to 1112 bytes in an allmodconfig build.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Christian Brauner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

proc: convert gather_stats to use a folio

Patch series "Use folio APIs in procfs".

We're down to very few users of the PageFoo macros, with proc being a
major user.

After this patchset and another patchset I have for khugepaged, we can get
rid of PageActive, PageReadahead and PageSwapBacked.  This patchset has
the usual advantages in its own right of removing hidden calls to
compound_head().  We have the page table lock, so the mapcount & refcount
are stable and there can't be any races with folios suddenly becoming tail
pages.

This patch (of 4):

Replaces six calls to compound_head() with one.  Shrinks the function from
5054 bytes to 1756 bytes in an allmodconfig build.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Christian Brauner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: generate PAGE_IDLE_FLAG definitions

If CONFIG_PAGE_IDLE_FLAG is not set, we can use FOLIO_FLAG_FALSE() to
generate these definitions.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: remove page_idle and page_young wrappers

All users have now been converted to the folio equivalents, so remove the
page wrappers.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

proc: convert smaps_account() to use a folio

Replace seven calls to compound_head() with one.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

proc: convert clear_refs_pte_range to use a folio

Patch series "Remove page_idle and page_young wrappers".

There are only a couple of places left using the page wrappers for idle &
young tracking. Convert the two users in proc and then we can remove the
wrappers. That enables the further simplification of autogenerating the
definitions when CONFIG_PAGE_IDLE_FLAG is disabled.

This patch (of 4):

Replaces four calls to compound_head() with two.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

khugepaged: use a folio throughout hpage_collapse_scan_file()

Replace the use of pages with folios. Saves a few calls to
compound_head() and removes some uses of obsolete functions.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

khugepaged: use a folio throughout collapse_file()

Pull folios from the page cache instead of pages. Half of this work had
been done already, but we were still operating on pages for a large chunk
of this function. There is no attempt in this patch to handle large
folios that are smaller than a THP; that will have to wait for a future
patch.

[[email protected]: the unlikely() is embedded in IS_ERR()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

khugepaged: remove hpage from collapse_file()

Use new_folio throughout where we had been using hpage.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

khugepaged: pass a folio to __collapse_huge_page_copy()

Simplify the body of __collapse_huge_page_copy() while I'm looking at
it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

khugepaged: remove hpage from collapse_huge_page()

Work purely in terms of the folio. Removes a call to compound_head()
in put_page().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

khugepaged: convert alloc_charge_hpage to alloc_charge_folio

Both callers want to deal with a folio, so return a folio from this
function.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

khugepaged: inline hpage_collapse_alloc_folio()

Patch series "khugepaged folio conversions".

We've been kind of hacking piecemeal at converting khugepaged to use
folios instead of compound pages, and so this patchset is a little larger
than it should be as I undo some of our wrong moves in the past. In
particular, collapse_file() now consistently uses 'new_folio' for the
freshly allocated folio and 'folio' for the one that's currently in use.

This patch (of 7):

This function has one caller, and the combined function is simpler to
read, reason about and modify.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: mremap_test: use sscanf to parse /proc/self/maps

Enforce consistency across files by avoiding two separate functions to
parse /proc/self/maps, replacing them with a simple sscanf().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dev Jain <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: mremap_test: optimize execution time from minutes to seconds using chunkwise memcmp

Mismatch index is currently being checked by a brute force iteration over
the buffer.  Instead, break the comparison into O(sqrt(n)) number of
chunks, with the chunk size of this order only, where n is the size of the
buffer.  Do a brute-force iteration to print to stdout only when the
highly optimized memcmp() library function returns a mismatch in the
chunk.  The time complexity of this algorithm is O(sqrt(n)) * t, where t
is the time taken by memcmp(); for our test conditions, it is safe to
assume t to be small.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dev Jain <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: mremap_test: optimize using pre-filled random array and memcpy

Patch series "selftests/mm: mremap_test: Optimizations and style fixes".

The mremap_test, in a worst case controlled by the -t flag, does a for
loop iteration in orders of GB.  Without compromising on the stdout
report, the aim is to reduce this time.

A pre-filled random buffer is allocated based on the seed, replacing
repetitive rand() calls.  The byte pattern in the memory locations is set
through memcpy() from the random buffer.

Replacing the loop for printing the mismatch index to stdout, employ an
efficient algorithm by breaking the comparison into chunks, use the highly
optimized memcmp() library function, and when a mismatch does occur, only
then do a brute force iteration.

Also, use sscanf() to parse /proc/self/maps for consistency across files.

Execution time results (x86 system):
./mremap_test
Original: 3 seconds
After change: 0.8 seconds

./mremap_test -t100
Original: 17 seconds
After change: 2 seconds

./mremap_test -t0 (worst case):
Original: 9:40 minutes
After change: 45 seconds

This patch (of 3):

Allocate a pre-filled random buffer using the seed.  Replace iterative
copying of the random sequence to buffers using the highly optimized
library function memcpy().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dev Jain <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

memory: remove the now superfluous sentinel element from ctl_table array

This commit comes at the tail end of a greater effort to remove the empty
elements at the end of the ctl_table arrays (sentinels) which will reduce
the overall build time size of the kernel and run time memory bloat by ~64
bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%[email protected]/)

Remove sentinel from all files under mm/ that register a sysctl table.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Joel Granados <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: rename vma_pgoff_address back to vma_address

With all callers converted, we can use the nice shorter name. Take this
opportunity to reorder the arguments to the logical order (larger object
first).

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: remove vma_address()

Convert the three remaining callers to call vma_pgoff_address() directly.
This removes an ambiguity where we'd check just one page if passed a tail
page and all N pages if passed a head page.

Also add better kernel-doc for vma_pgoff_address().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: correct page_mapped_in_vma() for large folios

Patch series "Unify vma_address and vma_pgoff_address".

The current vma_address() pretends that the ambiguity between head & tail
page is an advantage.  If you pass a head page to vma_address(), it will
operate on all pages in the folio, while if you pass a tail page, it will
operate on a single page.  That's not what any of the callers actually
want, so first convert all callers to use vma_pgoff_address() and then
rename vma_pgoff_address() to vma_address().

This patch (of 3):

If 'page' is the first page of a large folio then vma_address() will scan
for any page in the entire folio.  This can lead to page_mapped_in_vma()
returning true if some of the tail pages are mapped and the head page is
not.  This could lead to memory failure choosing to kill a task
unnecessarily.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: huge_memory: add the missing folio_test_pmd_mappable() for THP split statistics

Now the mTHP can also be split or added into the deferred list, so add
folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
confusion with PMD mapped THP related statistics.

[[email protected]: check THP earlier in case folio is split, per Lance]
Link: https://lkml.kernel.org/r/b99f8cb14bc85fdb6ab43721d1331cb5ebed2581.1713771041.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Lance Yang <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: support multi-size THP numa balancing

Now the anonymous page allocation already supports multi-size THP (mTHP),
but the numa balancing still prohibits mTHP migration even though it is an
exclusive mapping, which is unreasonable.

Allow scanning mTHP:
Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section
pages") skips shared CoW pages' NUMA page migration to avoid shared data
segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to
NUMA-migrate COW pages that have other uses") change to use page_count()
to avoid GUP pages migration, that will also skip the mTHP numa scanning.
Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
issue, although there is still a GUP race, the issue seems to have been
resolved by commit 80d47f5de5e3. Meanwhile, use the folio_likely_mapped_shared()
to skip shared CoW pages though this is not a precise sharers count. To
check if the folio is shared, ideally we want to make sure every page is
mapped to the same process, but doing that seems expensive and using
the estimated mapcount seems can work when running autonuma benchmark.

Allow migrating mTHP:
As mentioned in the previous thread[1], large folios (including THP) are
more susceptible to false sharing issues among threads than 4K base page,
leading to pages ping-pong back and forth during numa balancing, which is
currently not easy to resolve. Therefore, as a start to support mTHP numa
balancing, we can follow the PMD mapped THP's strategy, that means we can
reuse the 2-stage filter in should_numa_migrate_memory() to check if the
mTHP is being heavily contended among threads (through checking the CPU id
and pid of the last access) to avoid false sharing at some degree. Thus,
we can restore all PTE maps upon the first hint page fault of a large folio
to follow the PMD mapped THP's strategy. In the future, we can continue to
optimize the NUMA balancing algorithm to avoid the false sharing issue with
large folios as much as possible.

Performance data:
Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum
Base: 2024-03-25 mm-unstable branch
Enable mTHP to run autonuma-benchmark

mTHP:16K
Base Patched
numa01 numa01
224.70 143.48
numa01_THREAD_ALLOC numa01_THREAD_ALLOC
118.05 47.43
numa02 numa02
13.45 9.29
numa02_SMT numa02_SMT
14.80 7.50

mTHP:64K
Base Patched
numa01 numa01
216.15 114.40
numa01_THREAD_ALLOC numa01_THREAD_ALLOC
115.35 47.41
numa02 numa02
13.24 9.25
numa02_SMT numa02_SMT
14.67 7.34

mTHP:128K
Base Patched
numa01 numa01
205.13 144.45
numa01_THREAD_ALLOC numa01_THREAD_ALLOC
112.93 41.88
numa02 numa02
13.16 9.18
numa02_SMT numa02_SMT
14.81 7.49

[1] https://lore.kernel.org/all/20231117100745 [email protected]/

[[email protected]: v3]
Link: https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: factor out the numa mapping rebuilding into a new helper

Patch series "support multi-size THP numa balancing", v2.

This patchset tries to support mTHP numa balancing, as a simple solution
to start, the NUMA balancing algorithm for mTHP will follow the THP
strategy as the basic support. Please find details in each patch.

This patch (of 2):

To support large folio's numa balancing, factor out the numa mapping
rebuilding into a new helper as a preparation.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.1711683069.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: alloc_anon_folio: avoid doing vma_thp_gfp_mask in fallback cases

Fallback rates surpassing 90% have been observed on phones utilizing 64KiB
CONT-PTE mTHP.  In these scenarios, when one out of every 16 PTEs fails to
allocate large folios, the remaining 15 PTEs fallback.  Consequently,
invoking vma_thp_gfp_mask seems redundant in such cases.  Furthermore,
abstaining from its use can also contribute to improved code readability.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Barry Song <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Acked-by: Yu Zhao <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Itaru Kitayama <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

zram: add max_pages param to recompression

Introduce "max_pages" param to recompress device attribute which sets an
upper limit on the number of entries (pages) zram attempts to recompress
(in this particular recompression call). S/W recompression can be quite
expensive so limiting the number of pages recompress touches can be quite
helpful.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: init_mlocked_on_free_v3

Implements the "init_mlocked_on_free" boot option. When this boot option
is enabled, any mlock'ed pages are zeroed on free. If
the pages are munlock'ed beforehand, no initialization takes place.
This boot option is meant to combat the performance hit of
"init_on_free" as reported in commit 6471384af2a6 ("mm: security:
introduce init_on_alloc=1 and init_on_free=1 boot options"). With
"init_mlocked_on_free=1" only relevant data is freed while everything
else is left untouched by the kernel. Correspondingly, this patch
introduces no performance hit for unmapping non-mlock'ed memory. The
unmapping overhead for purely mlocked memory was measured to be
approximately 13%. Realistically, most systems mlock only a fraction of
the total memory so the real-world system overhead should be close to
zero.

Optimally, userspace programs clear any key material or other
confidential memory before exit and munlock the according memory
regions. If a program crashes, userspace key managers fail to do this
job. Accordingly, no munlock operations are performed so the data is
caught and zeroed by the kernel. Should the program not crash, all
memory will ideally be munlocked so no overhead is caused.

CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON can be set to enable
"init_mlocked_on_free" by default.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: York Jasper Niebuhr <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: York Jasper Niebuhr <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftest/mm: ksm_functional_tests: extend test case for ksm fork/exec

This extends test_prctl_fork() and test_prctl_fork_exec() to make sure
that deduplication really happens, instead of only testing the
MMF_VM_MERGE_ANY flag is set.

[[email protected]: fix spelling mistake in ksft_test_result_skip message]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jinjiang Tu <[email protected]>
Signed-off-by: Colin Ian King <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Stefan Roesch <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftest/mm: ksm_functional_tests: refactor mmap_and_merge_range()

In order to extend test_prctl_fork() and test_prctl_fork_exec() to make
sure that deduplication really happens, mmap_and_merge_range() needs to be
refactored.

Firstly, mmap_and_merge_range() will be called with no need to call enable
KSM by madvise or prctl.  So, switch the 'bool use_prctl' parameter to
enum ksm_merge_mode.

Secondly, mmap_and_merge_range() will be called in child process in the
two testcases, it isn't appropriate to call ksft_test_result_{fail, skip},
because the global variables ksft_{fail, skip} aren't consistent with the
parent process.  Thus, convert calls of ksft_test_result_{fail, skip} to
ksft_print_msg(), return differrent error according to the two cases, and
rename mmap_and_merge_range() to __mmap_and_merge_range().  For existing
callers, introduce new mmap_and_merge_range() to handle different return
values of __mmap_and_merge_range().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jinjiang Tu <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Stefan Roesch <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/ksm: fix ksm exec support for prctl

Patch series "mm/ksm: fix ksm exec support for prctl", v4.

commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits
MMF_VM_MERGE_ANY flag when a task calls execve().  However, it doesn't
create the mm_slot, so ksmd will not try to scan this task.  The first
patch fixes the issue.

The second patch refactors to prepare for the third patch.  The third
patch extends the selftests of ksm to verfity the deduplication really
happens after fork/exec inherits ths KSM setting.

This patch (of 3):

commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits
MMF_VM_MERGE_ANY flag when a task calls execve().  Howerver, it doesn't
create the mm_slot, so ksmd will not try to scan this task.

To fix it, allocate and add the mm_slot to ksm_mm_head in __bprm_mm_init()
when the mm has MMF_VM_MERGE_ANY flag.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 3c6f33b7273a ("mm/ksm: support fork/exec for prctl")
Signed-off-by: Jinjiang Tu <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Stefan Roesch <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/x86: add placement guard gap test for shstk

The existing shadow stack test for guard gaps just checks that new
mappings are not placed in an existing mapping's guard gap. Add one that
checks that new mappings are not placed such that preexisting mappings are
in the new mappings guard gap.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Deepak Gupta <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: H. Peter Anvin (Intel) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Naveen N. Rao <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

x86/mm: care about shadow stack guard gap during placement

When memory is being placed, mmap() will take care to respect the guard
gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and
VM_GROWSDOWN).  In order to ensure guard gaps between mappings, mmap()
needs to consider two things:

1. That the new mapping isn't placed in an any existing mappings guard
    gaps.
2. That the new mapping isn't placed such that any existing mappings
    are not in *its* guard gaps.

The longstanding behavior of mmap() is to ensure 1, but not take any care
around 2.  So for example, if there is a PAGE_SIZE free area, and a mmap()
with a PAGE_SIZE size, and a type that has a guard gap is being placed,
mmap() may place the shadow stack in the PAGE_SIZE free area.  Then the
mapping that is supposed to have a guard gap will not have a gap to the
adjacent VMA.

Now that the vm_flags is passed into the arch get_unmapped_area()'s, and
vm_unmapped_area() is ready to consider it, have VM_SHADOW_STACK's get
guard gap consideration for scenario 2.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Deepak Gupta <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: H. Peter Anvin (Intel) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Naveen N. Rao <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>