Git Repo - linux.git/log

s390: add pte_free_defer() for pgtables sharing page

Add s390-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because s390 fits two 2K
page tables into one 4K page (so page->rcu_head must be shared between
both halves), and already uses page->lru (which page->rcu_head overlays)
to list any free halves; with clever management by page->_refcount bits.

Build upon the existing management, adjusted to follow a new rule: that a
page is never on the free list if pte_free_defer() was used on either half
(marked by PageActive).  And for simplicity, delay calling RCU until both
halves are freed.

Not adding back unallocated fragments to the list in pte_free_defer() can
result in wasting some amount of memory for pagetables, depending on how
long the allocated fragment will stay in use.  In practice, this effect is
expected to be insignificant, and not justify a far more complex approach,
which might allow to add the fragments back later in __tlb_remove_table(),
where we might not have a stable mm any more.

[[email protected]: Claudio finds warning on mm_has_pgste() more useful than on mm_alloc_pgste()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Gerald Schaefer <[email protected]>
Tested-by: Alexander Gordeev <[email protected]>
Acked-by: Alexander Gordeev <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Russell King <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Hellström <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zack Rusin <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

sparc: add pte_free_defer() for pte_t *pgtable_t

Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon. This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

sparc32 supports pagetables sharing a page, but does not support THP;
sparc64 supports THP, but does not support pagetables sharing a page. So
the sparc-specific pte_free_defer() is as simple as the generic one,
except for converting between pte_t *pgtable_t and struct page *.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Russell King <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Hellström <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zack Rusin <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

powerpc: add pte_free_defer() for pgtables sharing page

Add powerpc-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon. This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time. But powerpc never reuses a fragment
once it has been freed: so mark the page Active in pte_free_defer(),
before calling pte_fragment_free() directly; and there call_rcu() to
pte_free_now() when last fragment is freed and the page is PageActive.

Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Russell King <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Hellström <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zack Rusin <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

powerpc: assert_pte_locked() use pte_offset_map_nolock()

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in assert_pte_locked(). BUG if pte_offset_map_nolock() fails.

This mod might cause new crashes: which either expose my ignorance, or
indicate issues to be fixed, or limit the usage of assert_pte_locked().

[[email protected]: assert_pte_locked() still needs the pmd_none() check]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Russell King <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Hellström <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zack Rusin <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm: adjust_pte() use pte_offset_map_nolock()

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in adjust_pte(): because it gives the not-locked ptl for precisely that
pte, which the caller can then safely lock; whereas pte_lockptr() is not
so tightly coupled, because it dereferences the pmd pointer again.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Russell King <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Hellström <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zack Rusin <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/pgtable: add PAE safety to __pte_offset_map()

There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g.  x86-32 with CONFIG_X86_PAE=y, would succeed on a
pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.

Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests.  It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.

Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(),
used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync()
synonym for tlb_remove_table_sync_one(): to send the necessary interrupt
at the right moment on those configs which do not already send it.

CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that Will
Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm.  It is
not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.

Limit the IRQ disablement to CONFIG_HIGHPTE?  Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Russell King <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Hellström <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zack Rusin <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s

Patch series "mm: free retracted page table by RCU", v3.

Some mmap_lock avoidance i.e.  latency reduction.  Initially just for the
case of collapsing shmem or file pages to THPs: the usefulness of
MADV_COLLAPSE on shmem is being limited by that mmap_write_lock it
currently requires.

Likely to be relied upon later in other contexts e.g.  freeing of empty
page tables (but that's not work I'm doing).  mmap_write_lock avoidance
when collapsing to anon THPs?  Perhaps, but again that's not work I've
done: a quick attempt was not as easy as the shmem/file case.

These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.

This patch (of 13):

Before putting them to use (several commits later), add rcu_read_lock() to
pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
separate commit, since it risks exposing imbalances: prior commits have
fixed all the known imbalances, but we may find some have been missed.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Russell King <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Hellström <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zack Rusin <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: drop mas_first_entry()

The internal function mas_first_entry() is no longer used, so drop it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: replace mas_logical_pivot() with mas_safe_pivot()

Replace mas_logical_pivot() with mas_safe_pivot() and drop
mas_logical_pivot() since it won't be used anymore. We can do this since
now all nodes will have node limit pivot (if it is not full node).

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: update mt_validate()

Instead of using mas_first_entry() to find the leftmost leaf, use a simple
loop instead. Remove an unneeded check for root node. To make the error
message more accurate, check pivots first and then slots, because checking
slots depend on the node limit pivot to break the loop.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: make mas_validate_limits() check root node and node limit

Update mas_validate_limits() to check root node, check node limit pivot if
there is enough room for it to exist and check data_end. Remove the check
for child existence as it is done in mas_validate_child_slot().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: fix mas_validate_child_slot() to check last missed slot

Don't break the loop before checking the last slot. Also here check if
non-leaf nodes are missing children.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: make mas_validate_gaps() to check metadata

Make mas_validate_gaps() check whether the offset in the metadata points
to the largest gap. By the way, simplify this function.

Add the verification that gaps beyond the node limit are zero.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: don't use MAPLE_ARANGE64_META_MAX to indicate no gap

Patch series "Improve the validation for maple tree and some cleanup", v2.

This patch (of 7):

Do not use a special offset to indicate that there is no gap. When there
is no gap, offset can point to any valid slots because its gap is 0.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memory: pass folio into do_page_mkwrite()

Saves one implicit call to compound_head().

I'm not sure if I should change the name of the function to
do_folio_mkwrite() and update the description comment to reference a folio
as the vm_op is still called page_mkwrite.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Suggested-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memory-failure: fix race window when trying to get hugetlb folio

page_folio() is fetched before calling get_hwpoison_hugetlb_folio()
without hugetlb_lock being held. So hugetlb page could be demoted before
get_hwpoison_hugetlb_folio() holding hugetlb_lock but after page_folio()
is fetched. So get_hwpoison_hugetlb_folio() will hold unexpected extra
refcnt of hugetlb folio while leaving demoted page un-refcnted.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation")
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memory-failure: fetch compound head after extra page refcnt is held

Page might become thp, huge page or being splited after compound head is
fetched but before page refcnt is bumped. So hpage might be a tail page
leading to VM_BUG_ON_PAGE(PageTail(page)) in PageTransHuge().

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 415c64c1453a ("mm/memory-failure: split thp earlier in memory error handling")
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memory-failure: minor cleanup for comments and codestyle

Fix some wrong function names and grammar error in comments. Also remove
unneeded space after for_each_process. No functional change intended.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memory-failure: remove unneeded header files

Remove some unneeded header files. No functional change intended.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memory-failure: use local variable huge to check hugetlb page

Use local variable huge to check whether page is hugetlb page to avoid
calling PageHuge() multiple times to save cpu cycles. PageHuge() will be
stable while extra page refcnt is held.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memory-failure: don't account hwpoison_filter() filtered pages

mf_generic_kill_procs() will return -EOPNOTSUPP when hwpoison_filter()
filtered dax page. In that case, action_result() isn't expected to be
called to update mf_stats. This will results in inaccurate but benign
memory failure handling statistics.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memory-failure: ensure moving HWPoison flag to the raw error pages

If hugetlb_vmemmap_optimized is enabled, folio_clear_hugetlb_hwpoison()
called from try_memory_failure_hugetlb() won't transfer HWPoison flag to
subpages while folio's HWPoison flag is cleared. So when trying to free
this hugetlb page into buddy, folio_clear_hugetlb_hwpoison() is not called
to move HWPoison flag from head page to the raw error pages even if now
hugetlb_vmemmap_optimized is cleared. This will results in HWPoisoned
page being used again and raw_hwp_page leak.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: ac5fcde0a96a ("mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage")
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memory-failure: remove unneeded PageHuge() check

Patch series "A few fixup and cleanup patches for memory-failure", v2.

This series contains a few fixup patches to fix inaccurate mf_stats, fix
race window when trying to get hugetlb folio and so on.  Also there is
minor cleanup for comments and codestyle.  More details can be found in
the respective changelogs.

This patch (of 8):

PageHuge() check in me_huge_page() is just for potential problems.  Remove
it as it's actually dead code and won't catch anything.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memory_hotplug: document the signal_pending() check in offline_pages()

Let's update the documentation that any signal is sufficient, and add a
comment that not only checking for fatal signals is historical baggage:
changing it now could break existing user space. although unlikely.

For example, when an app provides a custom SIGALRM handler and triggers
memory offlining, the timeout cmd would no longer stop memory offlining,
because SIGALRM would no longer be considered a fatal signal.

Note that using signal_pending() instead of fatal_signal_pending() is
an anti-pattern, but slowly deprecating that behavior to eventually
change it in the far future is probably not worth the effort. If this
ever becomes relevant for user-space, we might want to rethink.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

HWPOISON: offline support: fix spelling in Documentation/ABI/

Correct spelling problems as identified by codespell.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: facb6011f399 ("HWPOISON: Add soft page offline support")
Signed-off-by: Randy Dunlap <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mm_init.c: mark check_for_memory() as __init

The only caller of check_for_memory() is free_area_init(), which is
annotated with __init, so it should be safe to also mark the former as
__init.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Haifeng Xu <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

zsmalloc: remove obj_tagged()

obj_tagged() is not needed at this point, because objects can only have
one tag: OBJ_ALLOCATED_TAG. We needed obj_tagged() for the zsmalloc LRU
implementation, which has now been removed. Simplify zsmalloc code and
revert to the previous implementation that was in place before the
zsmalloc LRU series.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Nhat Pham <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: add uffd unit test for UFFDIO_POISON

The test is pretty basic, and exercises UFFDIO_POISON straightforwardly.
We register a region with userfaultfd, in missing fault mode.  For each
fault, we either UFFDIO_COPY a zeroed page (odd pages) or UFFDIO_POISON
(even pages).  We do this mix to test "something like a real use case",
where guest memory would be some mix of poisoned and non-poisoned pages.

We read each page in the region, and assert that the odd pages are zeroed
as expected, and the even pages yield a SIGBUS as expected.

Why UFFDIO_COPY instead of UFFDIO_ZEROPAGE?  Because hugetlb doesn't
support UFFDIO_ZEROPAGE, and we don't want to have special case code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Axel Rasmussen <[email protected]>
Acked-by: Peter Xu <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gaosheng Cui <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jan Alexander Steffens (heftig) <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: T.J. Alumbaugh <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: ZhangPeng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: refactor uffd_poll_thread to allow custom fault handlers

Previously, we had "one fault handler to rule them all", which used
several branches to deal with all of the scenarios required by all of the
various tests.

In upcoming patches, I plan to add a new test, which has its own slightly
different fault handling logic. Instead of continuing to add cruft to the
existing fault handler, let's allow tests to define custom ones, separate
from other tests.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Axel Rasmussen <[email protected]>
Acked-by: Peter Xu <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gaosheng Cui <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jan Alexander Steffens (heftig) <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: T.J. Alumbaugh <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: ZhangPeng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: userfaultfd: document and enable new UFFDIO_POISON feature

Update the userfaultfd API to advertise this feature as part of feature
flags and supported ioctls (returned upon registration).

Add basic documentation describing the new feature.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Axel Rasmussen <[email protected]>
Acked-by: Peter Xu <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gaosheng Cui <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jan Alexander Steffens (heftig) <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: T.J. Alumbaugh <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: ZhangPeng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: userfaultfd: support UFFDIO_POISON for hugetlbfs

The behavior here is the same as it is for anon/shmem. This is done
separately because hugetlb pte marker handling is a bit different.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Axel Rasmussen <[email protected]>
Acked-by: Peter Xu <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gaosheng Cui <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jan Alexander Steffens (heftig) <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: T.J. Alumbaugh <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: ZhangPeng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: userfaultfd: add new UFFDIO_POISON ioctl: fix

Smatch has observed that pte_offset_map_lock() is now allowed to fail, and
then ptl should not be unlocked. Use -EAGAIN here like elsewhere.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Axel Rasmussen <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: userfaultfd: add new UFFDIO_POISON ioctl

The basic idea here is to "simulate" memory poisoning for VMs.  A VM
running on some host might encounter a memory error, after which some
page(s) are poisoned (i.e., future accesses SIGBUS).  They expect that
once poisoned, pages can never become "un-poisoned".  So, when we live
migrate the VM, we need to preserve the poisoned status of these pages.

When live migrating, we try to get the guest running on its new host as
quickly as possible.  So, we start it running before all memory has been
copied, and before we're certain which pages should be poisoned or not.

So the basic way to use this new feature is:

- On the new host, the guest's memory is registered with userfaultfd, in
  either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can
  communicate with the old host to find out if the page was poisoned.
- If so, we can respond with a UFFDIO_POISON - this places a swap marker
  so any future accesses will SIGBUS. Because the pte is now "present",
  future accesses won't generate more userfaultfd events, they'll just
  SIGBUS directly.

UFFDIO_POISON does not handle unmapping previously-present PTEs.  This
isn't needed, because during live migration we want to intercept all
accesses with userfaultfd (not just writes, so WP mode isn't useful for
this).  So whether minor or missing mode is being used (or both), the PTE
won't be present in any case, so handling that case isn't needed.

Similarly, UFFDIO_POISON won't replace existing PTE markers.  This might
be okay to do, but it seems to be safer to just refuse to overwrite any
existing entry (like a UFFD_WP PTE marker).

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Axel Rasmussen <[email protected]>
Acked-by: Peter Xu <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gaosheng Cui <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jan Alexander Steffens (heftig) <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: T.J. Alumbaugh <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: ZhangPeng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: userfaultfd: extract file size check out into a helper

This code is already duplicated twice, and UFFDIO_POISON will do the same
check a third time. So, it's worth extracting into a helper to save
repetitive lines of code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Axel Rasmussen <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gaosheng Cui <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jan Alexander Steffens (heftig) <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: T.J. Alumbaugh <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: ZhangPeng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: userfaultfd: check for start + len overflow in validate_range

Most userfaultfd ioctls take a `start + len` range as an argument. We
have the validate_range helper to check that such ranges are valid.
However, some (but not all!) ioctls *also* check that `start + len`
doesn't wrap around (overflow).

Just check for this in validate_range. This saves some repetitive code,
and adds the check to some ioctls which weren't bothering to check for it
before.

[[email protected]: call validate_range() on the src range too]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix src/dst validation]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Axel Rasmussen <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gaosheng Cui <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jan Alexander Steffens (heftig) <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: T.J. Alumbaugh <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: ZhangPeng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm-make-pte_marker_swapin_error-more-general-fix

fix CONFIG_MMU=n build

Cc: Axel Rasmussen <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: make PTE_MARKER_SWAPIN_ERROR more general

Patch series "add UFFDIO_POISON to simulate memory poisoning with UFFD",
v4.

This series adds a new userfaultfd feature, UFFDIO_POISON. See commit 4
for a detailed description of the feature.

This patch (of 8):

Future patches will reuse PTE_MARKER_SWAPIN_ERROR to implement
UFFDIO_POISON, so make some various preparations for that:

First, rename it to just PTE_MARKER_POISONED.  The "SWAPIN" can be
confusing since we're going to re-use it for something not really related
to swap.  This can be particularly confusing for things like hugetlbfs,
which doesn't support swap whatsoever.  Also rename some various helper
functions.

Next, fix pte marker copying for hugetlbfs.  Previously, it would WARN on
seeing a PTE_MARKER_SWAPIN_ERROR, since hugetlbfs doesn't support swap.
But, since we're going to re-use it, we want it to go ahead and copy it
just like non-hugetlbfs memory does today.  Since the code to do this is
more complicated now, pull it out into a helper which can be re-used in
both places.  While we're at it, also make it slightly more explicit in
its handling of e.g.  uffd wp markers.

For non-hugetlbfs page faults, instead of returning VM_FAULT_SIGBUS for an
error entry, return VM_FAULT_HWPOISON.  For most cases this change doesn't
matter, e.g.  a userspace program would receive a SIGBUS either way.  But
for UFFDIO_POISON, this change will let KVM guests get an MCE out of the
box, instead of giving a SIGBUS to the hypervisor and requiring it to
somehow inject an MCE.

Finally, for hugetlbfs faults, handle PTE_MARKER_POISONED, and return
VM_FAULT_HWPOISON_LARGE in such cases.  Note that this can't happen today
because the lack of swap support means we'll never end up with such a PTE
anyway, but this behavior will be needed once such entries *can* show up
via UFFDIO_POISON.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Axel Rasmussen <[email protected]>
Acked-by: Peter Xu <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gaosheng Cui <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jan Alexander Steffens (heftig) <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: T.J. Alumbaugh <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: ZhangPeng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memcg: minor cleanup for MEM_CGROUP_ID_MAX

MEM_CGROUP_ID_MAX is only used when CONFIG_MEMCG is configured.  So remove
unneeded !CONFIG_MEMCG variant.  Also it's only used in
mem_cgroup_alloc(), so move it from memcontrol.h to memcontrol.c.  And
further define it as:

  #define MEM_CGROUP_ID_MAX ((1UL << MEM_CGROUP_ID_SHIFT) - 1)

so if someone changes MEM_CGROUP_ID_SHIFT in the future, then
MEM_CGROUP_ID_MAX will be updated accordingly, as suggested by Muchun.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Shakeel Butt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memory: convert do_read_fault() to use folios

Saves one implicit call to compound_head().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: ZhangPeng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memory: convert do_shared_fault() to folios

Saves three implicit calls to compound_head().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: ZhangPeng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memory: convert wp_page_shared() to use folios

Saves six implicit calls to compound_head().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: ZhangPeng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memory: convert do_page_mkwrite() to use folios

Saves one implicit call to compound_head().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: ZhangPeng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: hugetlb_vmemmap: fix a race between vmemmap pmd split

The local variable @page in __split_vmemmap_huge_pmd() to obtain a pmd
page without holding page_table_lock may possiblely get the page table
page instead of a huge pmd page.

The effect may be in set_pte_at() since we may pass an invalid page
struct, if set_pte_at() wants to access the page struct (e.g.
CONFIG_PAGE_TABLE_CHECK is enabled), it may crash the kernel.

So fix it. And inline __split_vmemmap_huge_pmd() since it only has one
user.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: d8d55f5616cf ("mm: sparsemem: use page table lock to protect kernel pmd operations")
Signed-off-by: Muchun Song <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/sparse: remove redundant judgments from macro for_each_present_section_nr

next_present_section_nr() has already ensured that
'section_nr<=__highest_present_section_nr', so this check is removed.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: liuq <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: compaction: skip the memory hole rapidly when isolating free pages

Just like commit 9721fd82351d ("mm: compaction: skip memory hole
rapidly when isolating migratable pages"), I can see it will also take
more time to skip the larger memory hole (range: 0x1000000000 -
0x1800000000) when isolating free pages on my machine with below memory
layout.  So like commit 9721fd82351d, adding a new helper to skip the
memory hole rapidly, which can reduce the time consumed from about 70us
to less than 1us.

[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000040000000-0x00000000ffffffff]
[    0.000000]   DMA32    empty
[    0.000000]   Normal   [mem 0x0000000100000000-0x0000001fa7ffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000040000000-0x0000000fffffffff]
[    0.000000]   node   0: [mem 0x0000001800000000-0x0000001fa3c7ffff]
[    0.000000]   node   0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff]
[    0.000000]   node   0: [mem 0x0000001fa4000000-0x0000001fa402ffff]
[    0.000000]   node   0: [mem 0x0000001fa4030000-0x0000001fa40effff]
[    0.000000]   node   0: [mem 0x0000001fa40f0000-0x0000001fa73cffff]
[    0.000000]   node   0: [mem 0x0000001fa73d0000-0x0000001fa745ffff]
[    0.000000]   node   0: [mem 0x0000001fa7460000-0x0000001fa746ffff]
[    0.000000]   node   0: [mem 0x0000001fa7470000-0x0000001fa758ffff]
[    0.000000]   node   0: [mem 0x0000001fa7590000-0x0000001fa7ffffff]

[[email protected]: avoid missing last page block in section after skip offline sections]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/d2ba7e41ee566309b594311207ffca736375fc16.1688715750.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Signed-off-by: Kemeng Shi <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: compaction: use the correct type of list for free pages

Use the page->buddy_list instead of page->lru to clarify the correct type
of list for free pages.

Link: https://lkml.kernel.org/r/b21cd8e2e32b9a1d9bc9e43ebf8acaf35e87f8df.1688715750.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: fix some kernel-doc comments

Add description of @mm_wr_locked and @mm.
to silence the warnings:

mm/memory.c:1716: warning: Function parameter or member 'mm_wr_locked' not described in 'unmap_vmas'
mm/memory.c:5110: warning: Function parameter or member 'mm' not described in 'mm_account_fault'

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: correct stale comment of function check_pte

Commit 2aff7a4755bed ("mm: Convert page_vma_mapped_walk to work on PFNs")
replaced page with pfns in page_vma_mapped_walk structure and updated
"@pvmw->page" to "@pvmw->pfn" in comment of function page_vma_mapped_walk.

This patch update stale "page" to "pfn" in comment of check_pte.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kemeng Shi <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm, netfs, fscache: stop read optimisation when folio removed from pagecache

Fscache has an optimisation by which reads from the cache are skipped
until we know that (a) there's data there to be read and (b) that data
isn't entirely covered by pages resident in the netfs pagecache.  This is
done with two flags manipulated by fscache_note_page_release():

if (...
    test_bit(FSCACHE_COOKIE_HAVE_DATA, &cookie->flags) &&
    test_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags))
clear_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags);

where the NO_DATA_TO_READ flag causes cachefiles_prepare_read() to
indicate that netfslib should download from the server or clear the page
instead.

The fscache_note_page_release() function is intended to be called from
->releasepage() - but that only gets called if PG_private or PG_private_2
is set - and currently the former is at the discretion of the network
filesystem and the latter is only set whilst a page is being written to
the cache, so sometimes we miss clearing the optimisation.

Fix this by following Willy's suggestion[1] and adding an address_space
flag, AS_RELEASE_ALWAYS, that causes filemap_release_folio() to always call
->release_folio() if it's set, even if PG_private or PG_private_2 aren't
set.

Note that this would require folio_test_private() and page_has_private() to
become more complicated.  To avoid that, in the places[*] where these are
used to conditionalise calls to filemap_release_folio() and
try_to_release_page(), the tests are removed the those functions just
jumped to unconditionally and the test is performed there.

[*] There are some exceptions in vmscan.c where the check guards more than
just a call to the releaser.  I've added a function, folio_needs_release()
to wrap all the checks for that.

AS_RELEASE_ALWAYS should be set if a non-NULL cookie is obtained from
fscache and cleared in ->evict_inode() before truncate_inode_pages_final()
is called.

Additionally, the FSCACHE_COOKIE_NO_DATA_TO_READ flag needs to be cleared
and the optimisation cancelled if a cachefiles object already contains data
when we open it.

[[email protected]: call folio_mapping() inside folio_needs_release()]
Link: https://github.com/DaveWysochanskiRH/kernel/commit/902c990e311120179fa5de99d68364b2947b79ec
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 1f67e6d0b188 ("fscache: Provide a function to note the release of a page")
Fixes: 047487c947e8 ("cachefiles: Implement the I/O routines")
Signed-off-by: David Howells <[email protected]>
Signed-off-by: Dave Wysochanski <[email protected]>
Reported-by: Rohith Surabattula <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
Tested-by: SeongJae Park <[email protected]>
Cc: Daire Byrne <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Steve French <[email protected]>
Cc: Shyam Prasad N <[email protected]>
Cc: Rohith Surabattula <[email protected]>
Cc: Dave Wysochanski <[email protected]>
Cc: Dominique Martinet <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Jingbo Xu <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Xiubo Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: merge folio_has_private()/filemap_release_folio() call pairs

Patch series "mm, netfs, fscache: Stop read optimisation when folio
removed from pagecache", v7.

This fixes an optimisation in fscache whereby we don't read from the cache
for a particular file until we know that there's data there that we don't
have in the pagecache.  The problem is that I'm no longer using PG_fscache
(aka PG_private_2) to indicate that the page is cached and so I don't get
a notification when a cached page is dropped from the pagecache.

The first patch merges some folio_has_private() and
filemap_release_folio() pairs and introduces a helper,
folio_needs_release(), to indicate if a release is required.

The second patch is the actual fix.  Following Willy's suggestions[1], it
adds an AS_RELEASE_ALWAYS flag to an address_space that will make
filemap_release_folio() always call ->release_folio(), even if
PG_private/PG_private_2 aren't set.  folio_needs_release() is altered to
add a check for this.

This patch (of 2):

Make filemap_release_folio() check folio_has_private().  Then, in most
cases, where a call to folio_has_private() is immediately followed by a
call to filemap_release_folio(), we can get rid of the test in the pair.

There are a couple of sites in mm/vscan.c that this can't so easily be
done.  In shrink_folio_list(), there are actually three cases (something
different is done for incompletely invalidated buffers), but
filemap_release_folio() elides two of them.

In shrink_active_list(), we don't have have the folio lock yet, so the
check allows us to avoid locking the page unnecessarily.

A wrapper function to check if a folio needs release is provided for those
places that still need to do it in the mm/ directory.  This will acquire
additional parts to the condition in a future patch.

After this, the only remaining caller of folio_has_private() outside of
mm/ is a check in fuse.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Rohith Surabattula <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
Signed-off-by: David Howells <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Steve French <[email protected]>
Cc: Shyam Prasad N <[email protected]>
Cc: Rohith Surabattula <[email protected]>
Cc: Dave Wysochanski <[email protected]>
Cc: Dominique Martinet <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Xiubo Li <[email protected]>
Cc: Jingbo Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

rmap: pass the folio to __page_check_anon_rmap()

The lone caller already has the folio, so pass it in instead of deriving
it from the page again.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: cma: print cma name as well in cma_alloc debug

CMA allocation can happen either from global cma or from dedicated cma
region.

Thus it is helpful to print cma name as well during initial
debugging to confirm cma regions were getting initialized or not.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pintu Kumar <[email protected]>
Signed-off-by: Pintu Agarwal <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: remove obsolete comment above struct per_cpu_pages

Since commit 01b44456a7aa ("mm/page_alloc: replace local_lock with normal
spinlock"), per_cpu_pages is protected by normal spinlock. Remove the
obsolete comment as it's not that helpful.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

memory tier: rename destroy_memory_type() to put_memory_type()

It appears that destroy_memory_type() isn't a very good name because we
usually will not free the memory_type here. So rename it to a more
appropriate name i.e. put_memory_type().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Suggested-by: Huang, Ying <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reviewed-by: Xiao Yang <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/memfd: sysctl: fix MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED

Add selftest for sysctl vm.memfd_noexec is 2
(MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED)

memfd_create(.., MFD_EXEC) should fail in this case.

Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Dominique Martinet <[email protected]>
Closes: https://lore.kernel.org/linux-mm/CABi2SkXUX_QqTQ10Yx9bBUGpN1wByOi_=gZU6WEy5a8MaQY3Jw@mail.gmail.com/T/
Signed-off-by: Jeff Xu <[email protected]>
Cc: Daniel Verkamp <[email protected]>
Cc: Dmitry Torokhov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jorge Lucangeli Obes <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memfd: sysctl: fix MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED

Patch series "mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED", v2.

When sysctl vm.memfd_noexec is 2 (MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED),
memfd_create(.., MFD_EXEC) should fail.

This complies with how MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED is defined -
"memfd_create() without MFD_NOEXEC_SEAL will be rejected"

Thanks to Dominique Martinet <[email protected]> who reported the bug.
see [1] for context.

[1] https://lore.kernel.org/linux-mm/CABi2SkXUX_QqTQ10Yx9bBUGpN1wByOi_=gZU6WEy5a8MaQY3Jw@mail.gmail.com/T/

This patch (of 2):

When vm.memfd_noexec is 2 (MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED),
memfd_create(.., MFD_EXEC) should fail.

This complies with how MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED is
defined - "memfd_create() without MFD_NOEXEC_SEAL will be rejected"

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Reported-by: Dominique Martinet <[email protected]>
Closes: https://lore.kernel.org/linux-mm/CABi2SkXUX_QqTQ10Yx9bBUGpN1wByOi_=gZU6WEy5a8MaQY3Jw@mail.gmail.com/T/
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Jeff Xu <[email protected]>
Cc: Daniel Verkamp <[email protected]>
Cc: Dmitry Torokhov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jorge Lucangeli Obes <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

fs: drop_caches: draining pages before dropping caches

We expect a file page access after dropping caches should be a major
fault, but sometimes it's still a minor fault. That's because a file page
can't be dropped if it's in a per-cpu pagevec. Draining all pages from
per-cpu pagevec to lru list before trying to drop caches.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrew Yang <[email protected]>
Cc: Al Viro <[email protected]>
Cc: AngeloGioacchino Del Regno <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Matthias Brugger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

memcg: drop kmem.limit_in_bytes

kmem.limit_in_bytes (v1 way to limit kernel memory usage) has been
deprecated since 58056f77502f ("memcg, kmem: further deprecate
kmem.limit_in_bytes") merged in 5.16.  We haven't heard about any serious
users since then but it seems that the mere presence of the file is
causing more harm thatn good.  We (SUSE) have had several bug reports from
customers where Docker based containers started to fail because a write to
kmem.limit_in_bytes has failed.

This was unexpected because runc code only expects ENOENT (kmem disabled)
or EBUSY (tasks already running within cgroup).  So a new error code was
unexpected and the whole container startup failed.  This has been later
addressed by
https://github.com/opencontainers/runc/commit/52390d68040637dfc77f9fda6bbe70952423d380
so current Docker runtimes do not suffer from the problem anymore.  There
are still older version of Docker in use and likely hard to get rid of
completely.

Address this by wiping out the file completely and effectively get back to
pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration.

I would recommend backporting to stable trees which have picked up
58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes").

[[email protected]: restore _KMEM switch case]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: page_alloc: avoid false page outside zone error info

If pfn is outside zone boundaries in the first round, ret will be set to
1. But if pfn is changed to inside the zone boundaries in zone span
seqretry path, ret is still set to 1 leading to false page outside zone
error info.

This is from code inspection. The race window should be really small thus
hard to trigger in real world.

[[email protected]: code simplification, per Matthew]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: bdc8cb984576 ("[PATCH] memory hotplug locking: zone span seqlock")
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftest: add a testcase of ksm zero pages

Add a function test_unmerge_zero_page() to test the functionality on
unsharing and counting ksm-placed zero pages and counting of this patch
series.

test_unmerge_zero_page() actually contains four subjct test objects:
(1) whether the count of ksm zero pages can update correctly after merging;
(2) whether the count of ksm zero pages can update correctly after
unmerging by madvise(...MADV_UNMERGEABLE);
(3) whether the count of ksm zero pages can update correctly after
unmerging by triggering write fault.
(4) whether ksm zero pages are really unmerged.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: xu xin <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Xiaokai Ran <[email protected]>
Reviewed-by: Yang Yang <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Xuexin Jiang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

ksm: consider KSM-placed zeropages when calculating KSM profit

When use_zero_pages is enabled, the calculation of ksm profit is not
correct because ksm zero pages is not counted in. So update the
calculation of KSM profit including the documentation.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: xu xin <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Xiaokai Ran <[email protected]>
Cc: Yang Yang <[email protected]>
Cc: Jiang Xuexin <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

ksm: add ksm zero pages for each process

As the number of ksm zero pages is not included in ksm_merging_pages per
process when enabling use_zero_pages, it's unclear of how many actual
pages are merged by KSM. To let users accurately estimate their memory
demands when unsharing KSM zero-pages, it's necessary to show KSM zero-
pages per process. In addition, it help users to know the actual KSM
profit because KSM-placed zero pages are also benefit from KSM.

since unsharing zero pages placed by KSM accurately is achieved, then
tracking empty pages merging and unmerging is not a difficult thing any
longer.

Since we already have /proc/<pid>/ksm_stat, just add the information of
'ksm_zero_pages' in it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: xu xin <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Xiaokai Ran <[email protected]>
Reviewed-by: Yang Yang <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Xuexin Jiang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

ksm: count all zero pages placed by KSM

As pages_sharing and pages_shared don't include the number of zero pages
merged by KSM, we cannot know how many pages are zero pages placed by KSM
when enabling use_zero_pages, which leads to KSM not being transparent
with all actual merged pages by KSM. In the early days of use_zero_pages,
zero-pages was unable to get unshared by the ways like MADV_UNMERGEABLE so
it's hard to count how many times one of those zeropages was then
unmerged.

But now, unsharing KSM-placed zero page accurately has been achieved, so
we can easily count both how many times a page full of zeroes was merged
with zero-page and how many times one of those pages was then unmerged.
and so, it helps to estimate memory demands when each and every shared
page could get unshared.

So we add ksm_zero_pages under /sys/kernel/mm/ksm/ to show the number
of all zero pages placed by KSM. Meanwhile, we update the Documentation.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: xu xin <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Xuexin Jiang <[email protected]>
Reviewed-by: Xiaokai Ran <[email protected]>
Reviewed-by: Yang Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

ksm: support unsharing KSM-placed zero pages

Patch series "ksm: support tracking KSM-placed zero-pages", v10.

The core idea of this patch set is to enable users to perceive the number
of any pages merged by KSM, regardless of whether use_zero_page switch has
been turned on, so that users can know how much free memory increase is
really due to their madvise(MERGEABLE) actions. But the problem is, when
enabling use_zero_pages, all empty pages will be merged with kernel zero
pages instead of with each other as use_zero_pages is disabled, and then
these zero-pages are no longer monitored by KSM.

The motivations to do this is seen at:
https://lore.kernel.org/lkml/202302100915227721315@zte.com.cn/

In one word, we hope to implement the support for KSM-placed zero pages
tracking without affecting the feature of use_zero_pages, so that app
developer can also benefit from knowing the actual KSM profit by getting
KSM-placed zero pages to optimize applications eventually when
/sys/kernel/mm/ksm/use_zero_pages is enabled.

This patch (of 5):

When use_zero_pages of ksm is enabled, madvise(addr, len,
MADV_UNMERGEABLE) and other ways (like write 2 to /sys/kernel/mm/ksm/run)
to trigger unsharing will *not* actually unshare the shared zeropage as
placed by KSM (which is against the MADV_UNMERGEABLE documentation). As
these KSM-placed zero pages are out of the control of KSM, the related
counts of ksm pages don't expose how many zero pages are placed by KSM
(these special zero pages are different from those initially mapped zero
pages, because the zero pages mapped to MADV_UNMERGEABLE areas are
expected to be a complete and unshared page).

To not blindly unshare all shared zero_pages in applicable VMAs, the patch
use pte_mkdirty (related with architecture) to mark KSM-placed zero pages.
Thus, MADV_UNMERGEABLE will only unshare those KSM-placed zero pages.

In addition, we'll reuse this mechanism to reliably identify KSM-placed
ZeroPages to properly account for them (e.g., calculating the KSM profit
that includes zeropages) in the latter patches.

The patch will not degrade the performance of use_zero_pages as it doesn't
change the way of merging empty pages in use_zero_pages's feature.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: xu xin <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Xuexin Jiang <[email protected]>
Reviewed-by: Xiaokai Ran <[email protected]>
Reviewed-by: Yang Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/migrate_device: try to handle swapcache pages

Migrating file pages and swapcache pages into device memory is not
supported. Try to get rid of the swap cache, and if successful, go ahead
as with other anonymous pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mika Penttilä <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Ralph Campbell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests: cgroup: add zswap-memcg unwanted writeback test

Add a test to verify that when a memcg hits its limit in zswap, it doesn't
trigger an unwanted writeback that would result in pages not owned by that
memcg to be sent to disk, even if zswap isn't full. This was fixed by
commit 0bdf0efa180a("zswap: do not shrink if cgroup may not zswap").

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Domenico Cerasuolo <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vitaly Wool <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests: cgroup: add test_zswap with no kmem bypass test

Add a cgroup selftest that verifies memcg charging in zswap. The original
issue was that kmem bypass was applied to pages swapped out to zswap by
kswapd, resulting in zswapped memory not being charged. It was fixed by
commit cd08d80ecdac("mm: correctly charge compressed memory to its
memcg").

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Domenico Cerasuolo <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vitaly Wool <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests: cgroup: add test_zswap program

Patch series "selftests: cgroup: add zswap test program".

This series adds 2 zswap related selftests that verify known and fixed
issues. A new dedicated test program (test_zswap) is proposed since the
test cases are specific to zswap and hosts specific helpers.

The first patch adds the (empty) test program, while the other 2 add an
actual test function each.

This patch (of 3):

Add empty cgroup-zswap self test scaffold program, test functions to be
added in the next commits.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Domenico Cerasuolo <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vitaly Wool <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/page_alloc: use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().

__build_all_zonelists() acquires zonelist_update_seq by first disabling
interrupts via local_irq_save() and then acquiring the seqlock with
write_seqlock().  This is troublesome and leads to problems on PREEMPT_RT.
The problem is that the inner spinlock_t becomes a sleeping lock on
PREEMPT_RT and must not be acquired with disabled interrupts.

The API provides write_seqlock_irqsave() which does the right thing in one
step.  printk_deferred_enter() has to be invoked in non-migrate-able
context to ensure that deferred printing is enabled and disabled on the
same CPU.  This is the case after zonelist_update_seq has been acquired.

There was discussion on the first submission that the order should be:
local_irq_disable();
printk_deferred_enter();
write_seqlock();

to avoid pitfalls like having an unaccounted printk() coming from
write_seqlock_irqsave() before printk_deferred_enter() is invoked.  The
only origin of such a printk() can be a lockdep splat because the lockdep
annotation happens after the sequence count is incremented.  This is
exceptional and subject to change.

It was also pointed that PREEMPT_RT can be affected by the printk problem
since its write_seqlock_irqsave() does not really disable interrupts.
This isn't the case because PREEMPT_RT's printk implementation differs
from the mainline implementation in two important aspects:

- Printing happens in a dedicated threads and not at during the
  invocation of printk().
- In emergency cases where synchronous printing is used, a different
  driver is used which does not use tty_port::lock.

Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
printk output.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: John Ogness <[email protected]>
Cc: Luis Claudio R. Goncalves <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

zsmalloc: remove zs_compact_control

__zs_compact always putback src_zspage into class list after
migrate_zspage. Thus, we don't need to keep last position of src_zspage
any more. Let's remove it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Sergey Senozhatsky <[email protected]>
Cc: Alexey Romanov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

zsmalloc: move migration destination zspage inuse check

Destination zspage fullness check need to be done after zs_object_copy()
because that's where source and destination zspages fullness change.
Checking destination zspage fullness before zs_object_copy() may cause
migration to loop through source zspage sub-pages scanning for allocate
objects just to find out at the end that the destination zspage is full.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Alexey Romanov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

zsmalloc: do not scan for allocated objects in empty zspage

Patch series "zsmalloc: small compaction improvements", v2.

A tiny series that can reduce the number of find_alloced_obj() invocations
(which perform a linear scan of sub-page) during compaction. Inspired by
Alexey Romanov's findings.

This patch (of 3):

zspage migration can terminate as soon as it moves the last allocated
object from the source zspage. Add a simple helper zspage_empty() that
tests zspage ->inuse on each migration iteration.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Suggested-by: Alexey Romanov <[email protected]>
Reviewed-by: Alexey Romanov <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mm_init.c: remove obsolete macro HASH_SMALL

HASH_SMALL only works when parameter numentries is 0. But the sole caller
futex_init() never calls alloc_large_system_hash() with numentries set to
0. So HASH_SMALL is obsolete and remove it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Cc: André Almeida <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/page_alloc: fix min_free_kbytes calculation regarding ZONE_MOVABLE

The current calculation of min_free_kbytes only uses ZONE_DMA and
ZONE_NORMAL pages,but the ZONE_MOVABLE zone->_watermark[WMARK_MIN] will
also divide part of min_free_kbytes.This will cause the min watermark of
ZONE_NORMAL to be too small in the presence of ZONE_MOVEABLE.

__GFP_HIGH and PF_MEMALLOC allocations usually don't need movable zone
pages, so just like ZONE_HIGHMEM, cap pages_min to a small value in
__setup_per_zone_wmarks().

On my testing machine with 16GB of memory (transparent hugepage is turned
off by default, and movablecore=12G is configured) The following is a
comparative test data of watermark_min

no patch add patch
ZONE_DMA 1 8
ZONE_DMA32 151 709
ZONE_NORMAL 233 1113
ZONE_MOVABLE 1434 128
min_free_kbytes 7288 7326

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: liuq <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

fs: convert block_commit_write to return void

block_commit_write() always returns 0, this patch changes it to return
void.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Bean Huo <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Theodore Ts'o <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Joseph Qi <[email protected]>
Cc: Luís Henriques <[email protected]>
Cc: Mark Fasheh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

fs/buffer: clean up block_commit_write

Originally inode is used to get blksize, after commit 45bce8f3e343
("fs/buffer.c: make block-size be per-page and protected by the page
lock"), __block_commit_write no longer uses this parameter inode.

[[email protected]: remove now-unused local `inode']
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Bean Huo <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Joseph Qi <[email protected]>
Cc: Luís Henriques <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memory-failure: remove unneeded 'inline' annotation

Remove unneeded 'inline' annotation from num_poisoned_pages_inc() and
num_poisoned_pages_sub(). No functional change intended.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

memory tier: use helper function destroy_memory_type()

Use helper function destroy_memory_type() to release memtype instead of
open code it to help improve code readability a bit. No functional change
intended.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Wei Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memory-failure: remove unneeded page state check in shake_page()

Remove unneeded PageLRU(p) and is_free_buddy_page(p) check as slab caches
are not shrunk now. This check can be added back when a lightweight range
based shrinker is available.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: add a fast path case in mas_wr_slot_store()

When expanding a range in two directions, only partially overwriting the
previous and next ranges, the number of entries will not be increased, so
we can just update the pivots as a fast path. However, it may introduce
potential risks in RCU mode, because it updates two pivots. We only
enable it in non-RCU mode.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: optimize mas_wr_append(), also improve duplicating VMAs

When the new range can be completely covered by the original last range
without touching the boundaries on both sides, two new entries can be
appended to the end as a fast path. We update the original last pivot at
the end, and the newly appended two entries will not be accessed before
this, so it is also safe in RCU mode.

This is useful for sequential insertion, which is what we do in
dup_mmap(). Enabling BENCH_FORK in test_maple_tree and just running
bench_forking() gives the following time-consuming numbers:

before: after:
17,874.83 msec 15,738.38 msec

It shows about a 12% performance improvement for duplicating VMAs.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: add test for expanding range in RCU mode

Add test for expanding range in RCU mode. If we use the fast path of the
slot store to expand range in RCU mode, this test will fail.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

maple_tree: add test for mas_wr_modify() fast path

Patch series "Optimize the fast path of mas_store()", v4.

Add fast paths for mas_wr_append() and mas_wr_slot_store() respectively.
The newly added fast path of mas_wr_append() is used in fork() and how
much it benefits fork() depends on how many VMAs are duplicated.

Thanks Liam for the review.

This patch (of 4):

Add tests for all cases of mas_wr_append() and mas_wr_slot_store().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/filemap.c: fix update prev_pos after one read request done

ra->prev_pos tracks the last visited byte in the previous read request.
It is used to check whether it is sequential read in ondemand_readahead
and thus affects the readahead window.

After commit 06c0444290ce ("mm/filemap.c: generic_file_buffered_read() now
uses find_get_pages_contig"), update logic of prev_pos is changed.  It
updates prev_pos after each return from filemap_get_pages().  But the read
request from user may be not fully completed at this point.  The updated
prev_pos impacts the subsequent readahead window.

The real problem is performance drop of fsck_msdos between linux-5.4 and
linux-5.15(also linux-6.4).  Comparing to linux-5.4,It spends about 110%
time and read 140% pages.  The read pattern of fsck_msdos is not fully
sequential.

Simplified read pattern of fsck_msdos likes below:
1.read at page offset 0xa,size 0x1000
2.read at other page offset like 0x20,size 0x1000
3.read at page offset 0xa,size 0x4000
4.read at page offset 0xe,size 0x1000

Here is the read status on linux-6.4:
1.after read at page offset 0xa,size 0x1000
    ->page ofs 0xa go into pagecache
2.after read at page offset 0x20,size 0x1000
    ->page ofs 0x20 go into pagecache
3.read at page offset 0xa,size 0x4000
    ->filemap_get_pages read ofs 0xa from pagecache and returns
    ->prev_pos is updated to 0xb and goto next loop
    ->filemap_get_pages tends to read ofs 0xb,size 0x3000
    ->initial_readahead case in ondemand_readahead since prev_pos is
      the same as request ofs.
    ->read 8 pages while async size is 5 pages
      (PageReadahead flag at page 0xe)
4.read at page offset 0xe,size 0x1000
    ->hit page 0xe with PageReadahead flag set,double the ra_size.
      read 16 pages while async size is 16 pages
Now it reads 24 pages while actually uses 5 pages

on linux-5.4:
1.the same as 6.4
2.the same as 6.4
3.read at page offset 0xa,size 0x4000
    ->read ofs 0xa from pagecache
    ->read ofs 0xb,size 0x3000 using page_cache_sync_readahead
      read 3 pages
    ->prev_pos is updated to 0xd before generic_file_buffered_read
      returns
4.read at page offset 0xe,size 0x1000
    ->initial_readahead case in ondemand_readahead since
      request ofs-prev_pos==1
    ->read 4 pages while async size is 3 pages

Now it reads 7 pages while actually uses 5 pages.

In above demo, the initial_readahead case is triggered by offset of user
request on linux-5.4.  While it may be triggered by update logic of
prev_pos on linux-6.4.

To fix the performance drop, update prev_pos after finishing one read
request.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Haibo Li <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: AngeloGioacchino Del Regno <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matthias Brugger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: add gup test matrix in run_vmtests.sh

Add a matrix for testing gup based on the current gup_test.  Only run the
matrix when -a is specified because it's a bit slow.

It covers:

  - Different types of huge pages: thp, hugetlb, or no huge page
  - Permissions: Write / Read-only
  - Fast-gup, with/without
  - Types of the GUP: pin / gup / longterm pins
  - Shared / Private memories
  - GUP size: 1 / 512 / random page sizes

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A . Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: add -a to run_vmtests.sh

Allows to specify optional tests in run_vmtests.sh, where we can run time
consuming test matrix only when user specified "-a".

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A . Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/gup: retire follow_hugetlb_page()

Now __get_user_pages() should be well prepared to handle thp completely,
as long as hugetlb gup requests even without the hugetlb's special path.

Time to retire follow_hugetlb_page().

Tweak misc comments to reflect reality of follow_hugetlb_page()'s removal.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A . Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/gup: accelerate thp gup even for "pages != NULL"

The acceleration of THP was done with ctx.page_mask, however it'll be
ignored if **pages is non-NULL.

The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
accelerate mm_populate() treatment of THP pages").  It didn't explain why
we can't optimize the **pages non-NULL case.  It's possible that at that
time the major goal was for mm_populate() which should be enough back
then.

Optimize thp for all cases, by properly looping over each subpage, doing
cache flushes, and boost refcounts / pincounts where needed in one go.

This can be verified using gup_test below:

  # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10

Before:    13992.50 ( +-8.75%)
After:       378.50 (+-69.62%)

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A . Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/gup: cleanup next_page handling

The only path that doesn't use generic "**pages" handling is the gate vma.
Make it use the same path, meanwhile tune the next_page label upper to
cover "**pages" handling. This prepares for THP handling for "**pages".

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A . Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/hugetlb: add page_mask for hugetlb_follow_page_mask()

follow_page() doesn't need it, but we'll start to need it when unifying
gup for hugetlb.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A . Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/hugetlb: prepare hugetlb_follow_page_mask() for FOLL_PIN

follow_page() doesn't use FOLL_PIN, meanwhile hugetlb seems to not be the
target of FOLL_WRITE either.  However add the checks.

Namely, either the need to CoW due to missing write bit, or proper
unsharing on !AnonExclusive pages over R/O pins to reject the follow page.
That brings this function closer to follow_hugetlb_page().

So we don't care before, and also for now.  But we'll care if we switch
over slow-gup to use hugetlb_follow_page_mask().  We'll also care when to
return -EMLINK properly, as that's the gup internal api to mean "we should
unshare".  Not really needed for follow page path, though.

When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
clear that it just should never fail.  When error happens, instead of
setting page==NULL, capture the errno instead.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A . Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/hugetlb: handle FOLL_DUMP well in follow_page_mask()

Patch series "mm/gup: Unify hugetlb, speed up thp", v4.

Hugetlb has a special path for slow gup that follow_page_mask() is
actually skipped completely along with faultin_page().  It's not only
confusing, but also duplicating a lot of logics that generic gup already
has, making hugetlb slightly special.

This patchset tries to dedup the logic, by first touching up the slow gup
code to be able to handle hugetlb pages correctly with the current follow
page and faultin routines (where we're mostly there..  due to 10 years ago
we did try to optimize thp, but half way done; more below), then at the
last patch drop the special path, then the hugetlb gup will always go the
generic routine too via faultin_page().

Note that hugetlb is still special for gup, mostly due to the pgtable
walking (hugetlb_walk()) that we rely on which is currently per-arch.  But
this is still one small step forward, and the diffstat might be a proof
too that this might be worthwhile.

Then for the "speed up thp" side: as a side effect, when I'm looking at
the chunk of code, I found that thp support is actually partially done.
It doesn't mean that thp won't work for gup, but as long as **pages
pointer passed over, the optimization will be skipped too.  Patch 6 should
address that, so for thp we now get full speed gup.

For a quick number, "chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10"
gives me 13992.50us -> 378.50us.  Gup_test is an extreme case, but just to
show how it affects thp gups.

This patch (of 8):

Firstly, the no_page_table() is meaningless for hugetlb which is a no-op
there, because a hugetlb page always satisfies:

  - vma_is_anonymous() == false
  - vma->vm_ops->fault != NULL

So we can already safely remove it in hugetlb_follow_page_mask(), alongside
with the page* variable.

Meanwhile, what we do in follow_hugetlb_page() actually makes sense for a
dump: we try to fault in the page only if the page cache is already
allocated.  Let's do the same here for follow_page_mask() on hugetlb.

It should so far has zero effect on real dumps, because that still goes
into follow_hugetlb_page().  But this may start to influence a bit on
follow_page() users who mimics a "dump page" scenario, but hopefully in a
good way.  This also paves way for unifying the hugetlb gup-slow.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A . Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64: mte: simplify swap tag restoration logic

As a result of the patches "mm: Call arch_swap_restore() from
do_swap_page()" and "mm: Call arch_swap_restore() from unuse_pte()", there
are no circumstances in which a swapped-in page is installed in a page
table without first having arch_swap_restore() called on it. Therefore,
we no longer need the logic in set_pte_at() that restores the tags, so
remove it.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://linux-review.googlesource.com/id/I8ad54476f3b2d0144ccd8ce0c1d7a2963e5ff6f3
Signed-off-by: Peter Collingbourne <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Chinwen Chang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Evgenii Stepanov <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: [email protected]
Cc: kasan-dev <[email protected]>
Cc: "Kuan-Ying Lee (李冠穎)" <[email protected]>
Cc: Qun-Wei Lin <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: call arch_swap_restore() from unuse_pte()

We would like to move away from requiring architectures to restore
metadata from swap in the set_pte_at() implementation, as this is not only
error-prone but adds complexity to the arch-specific code. This requires
us to call arch_swap_restore() before calling swap_free() whenever pages
are restored from swap. We are currently doing so everywhere except in
unuse_pte(); do so there as well.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://linux-review.googlesource.com/id/I68276653e612d64cde271ce1b5a99ae05d6bbc4f
Signed-off-by: Peter Collingbourne <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: "Huang, Ying" <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Chinwen Chang <[email protected]>
Cc: Evgenii Stepanov <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: kasan-dev <[email protected]>
Cc: "Kuan-Ying Lee (李冠穎)" <[email protected]>
Cc: Qun-Wei Lin <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: make show_free_areas() static

All callers of show_free_areas() pass 0 and NULL, so we can directly use
show_mem() instead of show_free_areas(0, NULL), which could make
show_free_areas() a static function.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: remove arguments of show_mem()

All callers of show_mem() pass 0 and NULL, so we can remove the two
arguments by directly calling __show_mem(0, NULL, MAX_NR_ZONES - 1) in
show_mem().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: make MEMFD_CREATE into a selectable config option

The memfd_create() syscall, enabled by CONFIG_MEMFD_CREATE, is useful on
its own even when not required by CONFIG_TMPFS or CONFIG_HUGETLBFS.

Split it into its own proper bool option that can be enabled by users.

Move that option into mm/ where the code itself also lies. Also add
"select" statements to CONFIG_TMPFS and CONFIG_HUGETLBFS so they
automatically enable CONFIG_MEMFD_CREATE as before.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Weißschuh <[email protected]>
Tested-by: Zhangjin Wu <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: remove page_rmapping()

After converting the last user to folio_raw_mapping(), we can safely
remove the function.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: ZhangPeng <[email protected]>
Reviewed-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Nanyong Sun <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: use a folio in fault_dirty_shared_page()

We can replace four implicit calls to compound_head() with one by using
folio.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: ZhangPeng <[email protected]>
Reviewed-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

swap: stop add to avail list if swap is full

Our test finds a WARN_ON in add_to_avail_list.  During add_to_avail_list,
avail_lists is already in swap_avail_heads, while leads to this WARN_ON.

Here is the simplified calltrace:

------------[ cut here ]------------
Call trace:
add_to_avail_list+0xb8/0xc0
swap_range_free+0x110/0x138
swapcache_free_entries+0x100/0x1c0
free_swap_slot+0xbc/0xe0
put_swap_folio+0x1f0/0x2ec
delete_from_swap_cache+0x6c/0xd0
folio_free_swap+0xa4/0xe4
__try_to_reclaim_swap+0x9c/0x190
free_swap_and_cache+0x84/0x88
unmap_page_range+0x31c/0x934
unmap_single_vma.isra.0+0x48/0x84
unmap_vmas+0x98/0x10c
exit_mmap+0xa4/0x210
mmput+0x88/0x158
do_exit+0x284/0x970
do_group_exit+0x34/0x90
post_copy_siginfo_from_user32+0x0/0x1cc
do_notify_resume+0x15c/0x470
el0_svc+0x74/0x84
el0t_64_sync_handler+0xb8/0xbc
el0t_64_sync+0x190/0x194

During swapoff, try_to_unuse fails to alloc memory due to memory limit and
this leads to the failure of swapoff and causes re-insertion of swap space
back into swap_list.  During _enable_swap_info, this swap device is added
to avail list even this swap device if full.  At the same time, one entry
in this full swap device in released and we try to add this device into
avail list and find it is already in the avail list.  This causes this
WARN_ON.

To fix this.  Don't add to avail list is swap is full.

[[email protected]: coding-style cleanups]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ma Wupeng <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>