Git Repo - linux.git/log

mm: page_counter: remove unneeded atomic ops for low/min

Patch series "memcg: optimize charge codepath", v2.

Recently Linux networking stack has moved from a very old per socket
pre-charge caching to per-cpu caching to avoid pre-charge fragmentation
and unwarranted OOMs.  One impact of this change is that for network
traffic workloads, memcg charging codepath can become a bottleneck.  The
kernel test robot has also reported this regression[1].  This patch series
tries to improve the memcg charging for such workloads.

This patch series implement three optimizations:
(A) Reduce atomic ops in page counter update path.
(B) Change layout of struct page_counter to eliminate false sharing
    between usage and high.
(C) Increase the memcg charge batch to 64.

To evaluate the impact of these optimizations, on a 72 CPUs machine, we
ran the following workload in root memcg and then compared with scenario
where the workload is run in a three level of cgroup hierarchy with top
level having min and low setup appropriately.

$ netserver -6
# 36 instances of netperf with following params
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
1. root memcg 21694.8 Mbps
2. 6.0-rc1 10482.7 Mbps (-51.6%)
3. 6.0-rc1 + (A) 14542.5 Mbps (-32.9%)
4. 6.0-rc1 + (B) 12413.7 Mbps (-42.7%)
5. 6.0-rc1 + (C) 17063.7 Mbps (-21.3%)
6. 6.0-rc1 + (A+B+C) 20120.3 Mbps (-7.2%)

With all three optimizations, the memcg overhead of this workload has
been reduced from 51.6% to just 7.2%.

[1] https://lore.kernel.org/linux-mm/20220619150456.GB34471@xsang-OptiPlex-9020/

This patch (of 3):

For cgroups using low or min protections, the function
propagate_protected_usage() was doing an atomic xchg() operation
irrespectively.  We can optimize out this atomic operation for one
specific scenario where the workload is using the protection (i.e.  min >
0) and the usage is above the protection (i.e.  usage > min).

This scenario is actually very common where the users want a part of their
workload to be protected against the external reclaim.  Though this
optimization does introduce a race when the usage is around the protection
and concurrent charges and uncharged trip it over or under the protection.
In such cases, we might see lower effective protection but the subsequent
charge/uncharge will correct it.

To evaluate the impact of this optimization, on a 72 CPUs machine, we ran
the following workload in a three level of cgroup hierarchy with top level
having min and low setup appropriately to see if this optimization is
effective for the mentioned case.

$ netserver -6
# 36 instances of netperf with following params
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1) 10482.7 Mbps
With patch 14542.5 Mbps (38.7% improvement)

With the patch, the throughput improved by 38.7%

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Reported-by: kernel test robot <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: Feng Tang <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: "Michal Koutný" <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oliver Sang <[email protected]>
Cc: Soheil Hassas Yeganeh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: remove EXPERIMENTAL flag for zswap

zswap has been with us since 2013, and it's widely used in many products.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Heidelberg <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Vitaly Wool <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

drivers/block/zram/zram_drv.c: do not keep dangling zcomp pointer after zram reset

We do all reset operations under write lock, so we don't need to save
->disksize and ->comp to stack variables. Another thing is that ->comp is
freed during zram reset, but comp pointer is not NULL-ed, so zram keeps
the freed pointer value.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/gup.c: refactor check_and_migrate_movable_pages()

When pinning pages with FOLL_LONGTERM check_and_migrate_movable_pages() is
called to migrate pages out of zones which should not contain any longterm
pinned pages.

When migration succeeds all pages will have been unpinned so pinning needs
to be retried.  Migration can also fail, in which case the pages will also
have been unpinned but the operation should not be retried.  If all pages
are in the correct zone nothing will be unpinned and no retry is required.

The logic in check_and_migrate_movable_pages() tracks unnecessary state
and the return codes for each case are difficult to follow.  Refactor the
code to clean this up.  No behaviour change is intended.

[[email protected]: fix unused var warning]
Link: https://lkml.kernel.org/r/19583d1df07fdcb99cfa05c265588a3fa58d1902.1661317396.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Alex Sierra <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Felix Kuehling <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Shigeru Yoshida <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/gup.c: don't pass gup_flags to check_and_migrate_movable_pages()

gup_flags is passed to check_and_migrate_movable_pages() so that it can
call either put_page() or unpin_user_page() to drop the page reference.
However check_and_migrate_movable_pages() is only called for
FOLL_LONGTERM, which implies FOLL_PIN so there is no need to pass
gup_flags.

Link: https://lkml.kernel.org/r/d611c65a9008ff55887307df457c6c2220ad6163.1661317396.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Alex Sierra <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Felix Kuehling <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Shigeru Yoshida <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: skip retry when new limit is not below old one in page_counter_set_max

In page_counter_set_max, we want to make sure the new limit is not below
the concurrently-changing counter value.  We read the counter and check
that the limit is not below the counter before the swap.  After the swap,
we read the counter again and retry in case the counter is incremented as
this may violate the requirement.  Even though the page_counter_try_charge
can see the old limit, it is guaranteed that the counter is not above the
old limit after the increment.  So in case the new limit is not below the
old limit, the counter is guaranteed to be not above the new limit too.
We can skip the retry in this case to optimize a little bit.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Bui Quang Minh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: pagewalk: add api documentation for walk_page_range_novma()

Link: https://lkml.kernel.org/r/8991525.CDJkKcVGEf@devpool047
Signed-off-by: Rolf Eike Beer <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: pagewalk: fix documentation of PTE hole handling

Empty PTEs are passed to the pte_entry callback, not to pte_hole.

Link: https://lkml.kernel.org/r/3695521.kQq0lBPeGt@devpool047
Signed-off-by: Rolf Eike Beer <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memcg: export workingset refault stats for cgroup v1

Workingset refault stats are important and useful metrics to measure how
well reclaimer and swapping work and how healthy the services are, but
they are just available for cgroup v2. There are still plenty users with
cgroup v1, export the stats for cgroup v1.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/page_owner.c: add llseek for page_owner

It is too slow to dump all the pages, in some usage we just want to dump a
given start pfn, for example: a CMA range or a single page.

To speed up and save time, this change allows specifying of a start pfn by
adding llseek for page_owner.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kassey Li <[email protected]>
Suggested-by: Vlastimil Babka <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/damon: replace pmd_huge() with pmd_trans_huge() for THP

pmd_huge() is usually used to indicate a pmd level hugetlb. However a pmd
mapped huge page can only be THP in damon_mkold_pmd_entry() or
damon_young_pmd_entry(), so replace pmd_huge() with pmd_trans_huge() in
this case to make the code more readable according to the discussion [1].

[1] https://lore.kernel.org/all/098c1480-416d-bca9-cedb-ca495df69b64@linux.alibaba.com/

Link: https://lkml.kernel.org/r/a9e010ca5d299e18d740c7c52290ecb6a014dde6.1660805030.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/damon: validate if the pmd entry is present before accessing

pmd_huge() is used to validate if the pmd entry is mapped by a huge page,
also including the case of non-present (migration or hwpoisoned) pmd entry
on arm64 or x86 architectures. This means that pmd_pfn() can not get the
correct pfn number for a non-present pmd entry, which will cause
damon_get_page() to get an incorrect page struct (also may be NULL by
pfn_to_online_page()), making the access statistics incorrect.

This means that the DAMON may make incorrect decision according to the
incorrect statistics, for example, DAMON may can not reclaim cold page
in time due to this cold page was regarded as accessed mistakenly if
DAMOS_PAGEOUT operation is specified.

Moreover it does not make sense that we still waste time to get the page
of the non-present entry. Just treat it as not-accessed and skip it,
which maintains consistency with non-present pte level entries.

So add pmd entry present validation to fix the above issues.

Link: https://lkml.kernel.org/r/58b1d1f5fbda7db49ca886d9ef6783e3dcbbbc98.1660805030.git.baolin.wang@linux.alibaba.com
Fixes: 3f49584b262c ("mm/damon: implement primitives for the virtual memory address spaces")
Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: release private data before split THP

If there is private data attached to a THP, the refcount of THP will be
increased and will prevent the THP from being split.  Attempt to release
any private data attached to the THP before attempting the split to
increase the chance of splitting successfully.

There was a memory failure issue hit during HW error injection testing
with 5.18 kernel + xfs as rootfs.  The test was killed and a system reboot
was required to re-run the test.

The issue was tracked down to a THP split failure caused by the memory
failure not being handled.  The page dump showed:

[ 1785.433075] page:0000000025f9530b refcount:18 mapcount:0 mapping:000000008162eea7 index:0xa10 pfn:0x2f0200
[ 1785.443954] head:0000000025f9530b order:4 compound_mapcount:0 compound_pincount:0
[ 1785.452408] memcg:ff4247f2d28e9000
[ 1785.456304] aops:xfs_address_space_operations ino:8555182 dentry name:"baseos-filenames.solvx"
[ 1785.466612] flags: 0x1000000000012036(referenced|uptodate|lru|active|private|head|node=0|zone=2)
[ 1785.476514] raw: 1000000000012036 ffb9460f8bc07c08 ffb9460f8bc08408 ff4247f22e6299f8
[ 1785.485268] raw: 0000000000000a10 ff4247f194ade900 00000012ffffffff ff4247f2d28e9000

It was like the error was injected to a large folio for xfs with private
data attached.

With private data released before splitting the THP, the test case could
be run successfully many times without rebooting the system.

Link: https://lkml.kernel.org/r/[email protected]
Co-developed-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Yin Fengwei <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Reviewed-by: Aaron Lu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: thp: remove redundant pgtable check in set_huge_zero_page()

When the pgtable is NULL in the set_huge_zero_page(), we should not
increment the count of PTE page table pages by calling mm_inc_nr_ptes().
Otherwise we may receive the following warning when the mm exits:

BUG: non-zero pgtables_bytes on freeing mm

Now we can't observe the above warning since only
do_huge_pmd_anonymous_page() invokes set_huge_zero_page() and the pgtable
can not be NULL.

Therefore, instead of moving mm_inc_nr_ptes() to the non-NULL branch of
pgtable, it is better to remove the redundant pgtable check directly.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Qi Zheng <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memory-failure: kill __soft_offline_page()

Squash the __soft_offline_page() into soft_offline_in_use_page() and kill
__soft_offline_page().

[[email protected]: update hpage when try_to_split_thp_page() succeeds]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memory-failure: kill soft_offline_free_page()

Open-code the page_handle_poison() into soft_offline_page() and kill
unneeded soft_offline_free_page().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: hugetlb_vmemmap: simplify reset_struct_pages()

We can choose to copy three contiguous tail pages' content to the first
three pages instead of copying one by one to simplify the code and reduce
code size from 229 bytes to 63 bytes. The BUILD_BUG_ON() aims to avoid
out-of-bounds accesses.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm, hwpoison: avoid trying to unpoison reserved page

For reserved pages, HWPoison flag will be set without increasing the page
refcnt. So we shouldn't even try to unpoison these pages and thus
decrease the page refcnt unexpectly. Add a PageReserved() check to filter
this case out and remove the below unneeded zero page (zero page is
reserved) check.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm, hwpoison: kill procs if unmap fails

If try_to_unmap() fails, the hwpoisoned page still resides in the address
space of some processes. We should kill these processes or the hwpoisoned
page might be consumed later. collect_procs() is always called to collect
relevant processes now so they can be killed later if unmap fails.

[[email protected]: v2]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm, hwpoison: fix possible use-after-free in mf_dax_kill_procs()

After kill_procs(), tk will be freed without being removed from the
to_kill list. In the next iteration, the freed list entry in the to_kill
list will be accessed, thus leading to use-after-free issue. Adding
list_del() in kill_procs() to fix the issue.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: c36e20249571 ("mm: introduce mf_dax_kill_procs() for fsdax case")
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm, hwpoison: fix extra put_page() in soft_offline_page()

When hwpoison_filter() refuses to soft offline a page, the page refcnt
incremented previously by MF_COUNT_INCREASED would have been consumed via
get_hwpoison_page() if ret <= 0. So the put_ref_page() here will put the
extra one. Remove it to fix the issue.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 9113eaf331bf ("mm/memory-failure.c: add hwpoison_filter for soft offline")
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm, hwpoison: fix page refcnt leaking in unpoison_memory()

When free_raw_hwp_pages() fails its work, the refcnt of the hugetlb page
would have been incremented if ret > 0. Using put_page() to fix refcnt
leaking in this case.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: debb6b9c3fdd ("mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage")
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm, hwpoison: fix page refcnt leaking in try_memory_failure_hugetlb()

Patch series "A few fixup patches for memory-failure", v2.

This series contains a few fixup patches to fix incorrect update of page
refcnt, fix possible use-after-free issue and so on. More details can be
found in the respective changelogs.

This patch (of 6):

When hwpoison_filter() refuses to hwpoison a hugetlb page, the refcnt of
the page would have been incremented if res == 1. Using put_page() to fix
the refcnt leaking in this case.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()")
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: fix use-after free of page_ext after race with memory-offline

The below is one path where race between page_ext and offline of the
respective memory blocks will cause use-after-free on the access of
page_ext structure.

process1               process2
---------                             ---------
a)doing /proc/page_owner           doing memory offline
           through offline_pages.

b) PageBuddy check is failed
   thus proceed to get the
   page_owner information
   through page_ext access.
page_ext = lookup_page_ext(page);

    migrate_pages();
    .................
Since all pages are successfully
migrated as part of the offline
operation,send MEM_OFFLINE notification
where for page_ext it calls:
offline_page_ext()-->
__free_page_ext()-->
   free_page_ext()-->
     vfree(ms->page_ext)
           mem_section->page_ext = NULL

c) Check for the PAGE_EXT
   flags in the page_ext->flags
   access results into the
   use-after-free (leading to
   the translation faults).

As mentioned above, there is really no synchronization between page_ext
access and its freeing in the memory_offline.

The memory offline steps(roughly) on a memory block is as below:

1) Isolate all the pages

2) while(1)
  try free the pages to buddy.(->free_list[MIGRATE_ISOLATE])

3) delete the pages from this buddy list.

4) Then free page_ext.(Note: The struct page is still alive as it is
   freed only during hot remove of the memory which frees the memmap,
   which steps the user might not perform).

This design leads to the state where struct page is alive but the struct
page_ext is freed, where the later is ideally part of the former which
just representing the page_flags (check [3] for why this design is
chosen).

The abovementioned race is just one example __but the problem persists in
the other paths too involving page_ext->flags access(eg:
page_is_idle())__.

Fix all the paths where offline races with page_ext access by maintaining
synchronization with rcu lock and is achieved in 3 steps:

1) Invalidate all the page_ext's of the sections of a memory block by
   storing a flag in the LSB of mem_section->page_ext.

2) Wait until all the existing readers to finish working with the
   ->page_ext's with synchronize_rcu().  Any parallel process that starts
   after this call will not get page_ext, through lookup_page_ext(), for
   the block parallel offline operation is being performed.

3) Now safely free all sections ->page_ext's of the block on which
   offline operation is being performed.

Note: If synchronize_rcu() takes time then optimizations can be done in
this path through call_rcu()[2].

Thanks to David Hildenbrand for his views/suggestions on the initial
discussion[1] and Pavan kondeti for various inputs on this patch.

[1] https://lore.kernel.org/linux-mm/59edde13-4167-8550-86f0-11fc67882107@quicinc.com/
[2] https://lore.kernel.org/all/a26ce299-aed1-b8ad-711e-a49e82bdd180@quicinc.com/T/#u
[3] https://lore.kernel.org/all/6fa6b7aa-731e-891c-3efb-a03d6a700efa@redhat.com/

[[email protected]: rename label `loop' to `ext_put_continue' per David]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Charan Teja Kalla <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Fernand Sieber <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Pavan Kondeti <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: William Kucharski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/vmalloc.c: support HIGHMEM pages in vmap_pages_range_noflush()

If the pages being mapped are in HIGHMEM, page_address() returns NULL.
This probably wasn't noticed before because there aren't currently any
architectures with HAVE_ARCH_HUGE_VMALLOC and HIGHMEM, but it's simpler to
call page_to_phys() and futureproofs us against such configurations
existing.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memcontrol: fix a typo in comment

Fix a spelling mistake in comment.

Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Zeal Robot <[email protected]>
Signed-off-by: xupanda <[email protected]>
Signed-off-by: CGEL ZTE <[email protected]>
Reviewed-by: Mukesh Ojha <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: kill find_min_pfn_with_active_regions()

find_min_pfn_with_active_regions() is only called from free_area_init().
Open-code the PHYS_PFN(memblock_start_of_DRAM()) into free_area_init(),
and kill find_min_pfn_with_active_regions().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arch: mm: rename FORCE_MAX_ZONEORDER to ARCH_FORCE_MAX_ORDER

This Kconfig option is used by individual arch to set its desired
MAX_ORDER. Rename it to reflect its actual use.

Link: https://lkml.kernel.org/r/[email protected]
Acked-by: Mike Rapoport <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
Acked-by: Guo Ren <[email protected]> [csky]
Acked-by: Arnd Bergmann <[email protected]>
Acked-by: Catalin Marinas <[email protected]> [arm64]
Acked-by: Huacai Chen <[email protected]> [LoongArch]
Acked-by: Michael Ellerman <[email protected]> [powerpc]
Cc: Vineet Gupta <[email protected]>
Cc: Taichi Sugaya <[email protected]>
Cc: Neil Armstrong <[email protected]>
Cc: Qin Jian <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Ley Foon Tan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

zsmalloc: zs_object_copy: replace email link to doc

Emails are not documentation.

[[email protected]: fix whitespace damage, repair doc reference]
[[email protected]: coding-style cleanups]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Romanov <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/hugetlb: make detecting shared pte more reliable

If the pagetables are shared, we shouldn't copy or take references.  Since
src could have unshared and dst shares with another vma, huge_pte_none()
is thus used to determine whether dst_pte is shared.  But this check isn't
reliable.  A shared pte could have pte none in pagetable in fact.  The
page count of ptep page should be checked here in order to reliably
determine whether pte is shared.

[[email protected]: remove unused local variable dst_entry in copy_hugetlb_page_range()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: Lukas Bulwahn <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/hugetlb: fix sysfs group leak in hugetlb_unregister_node()

The sysfs group per_node_hstate_attr_group and hstate_demote_attr_group
when h->demote_order != 0 are created in hugetlb_register_node(). But
these sysfs groups are not removed when unregister the node, thus sysfs
group is leaked. Using sysfs_remove_group() to fix this issue.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Fengwei Yin <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()

The memory barrier smp_wmb() is needed to make sure that preceding stores
to the page contents become visible before the below set_pte_at() write.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Yin Fengwei <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/hugetlb: fix missing call to restore_reserve_on_error()

When huge_add_to_page_cache() fails, the page is freed directly without
calling restore_reserve_on_error() to restore reserve for newly allocated
pages not in page cache. Fix this by calling restore_reserve_on_error()
when huge_add_to_page_cache fails.

[[email protected]: remove err == -EEXIST check and retry logic]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/hugetlb: fix WARN_ON(!kobj) in sysfs_create_group()

If sysfs_create_group() fails with hstate_attr_group, hstate_kobjs[hi]
will be set to NULL. Then it will be passed to sysfs_create_group() if
h->demote_order != 0 thus triggering WARN_ON(!kobj) check. Fix this by
making sure hstate_kobjs[hi] != NULL when calling sysfs_create_group.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 79dfc695525f ("hugetlb: add demote hugetlb page sysfs interfaces")
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/hugetlb: fix incorrect update of max_huge_pages

Patch series "A few fixup patches for hugetlb".

This series contains a few fixup patches to fix incorrect update of
max_huge_pages, fix WARN_ON(!kobj) in sysfs_create_group() and so on.
More details can be found in the respective changelogs.

This patch (of 6):

There should be pages_per_huge_page(h) /
pages_per_huge_page(target_hstate) pages incremented for
target_hstate->max_huge_pages when page is demoted. Update max_huge_pages
accordingly for consistency.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

memory tiering: adjust hot threshold automatically

The promotion hot threshold is workload and system configuration
dependent.  So in this patch, a method to adjust the hot threshold
automatically is implemented.  The basic idea is to control the number of
the candidate promotion pages to match the promotion rate limit.  If the
hint page fault latency of a page is less than the hot threshold, we will
try to promote the page, and the page is called the candidate promotion
page.

If the number of the candidate promotion pages in the statistics interval
is much more than the promotion rate limit, the hot threshold will be
decreased to reduce the number of the candidate promotion pages.
Otherwise, the hot threshold will be increased to increase the number of
the candidate promotion pages.

To make the above method works, in each statistics interval, the total
number of the pages to check (on which the hint page faults occur) and the
hot/cold distribution need to be stable.  Because the page tables are
scanned linearly in NUMA balancing, but the hot/cold distribution isn't
uniform along the address usually, the statistics interval should be
larger than the NUMA balancing scan period.  So in the patch, the max scan
period is used as statistics interval and it works well in our tests.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Tested-by: Baolin Wang <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: osalvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zhong Jiang <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

memory tiering: rate limit NUMA migration throughput

In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to promote/demote
hot/cold pages between the fast and cold memory nodes.

A choice is to promote/demote as fast as possible.  But the CPU cycles and
memory bandwidth consumed by the high promoting/demoting throughput will
hurt the latency of some workload because of accessing inflating and slow
memory bandwidth contention.

A way to resolve this issue is to restrict the max promoting/demoting
throughput.  It will take longer to finish the promoting/demoting.  But
the workload latency will be better.  This is implemented in this patch as
the page promotion rate limit mechanism.

The number of the candidate pages to be promoted to the fast memory node
via NUMA balancing is counted, if the count exceeds the limit specified by
the users, the NUMA balancing promotion will be stopped until the next
second.

A new sysctl knob kernel.numa_balancing_promote_rate_limit_MBps is added
for the users to specify the limit.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Tested-by: Baolin Wang <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: osalvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zhong Jiang <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

memory tiering: hot page selection with hint page fault latency

Patch series "memory tiering: hot page selection", v4.

To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory nodes need to be identified.
Essentially, the original NUMA balancing implementation selects the mostly
recently accessed (MRU) pages to promote.  But this isn't a perfect
algorithm to identify the hot pages.  Because the pages with quite low
access frequency may be accessed eventually given the NUMA balancing page
table scanning period could be quite long (e.g.  60 seconds).  So in this
patchset, we implement a new hot page identification algorithm based on
the latency between NUMA balancing page table scanning and hint page
fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.

In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to promote/demote
hot/cold pages between the fast and cold memory nodes.

A choice is to promote/demote as fast as possible.  But the CPU cycles and
memory bandwidth consumed by the high promoting/demoting throughput will
hurt the latency of some workload because of accessing inflating and slow
memory bandwidth contention.

A way to resolve this issue is to restrict the max promoting/demoting
throughput.  It will take longer to finish the promoting/demoting.  But
the workload latency will be better.  This is implemented in this patchset
as the page promotion rate limit mechanism.

The promotion hot threshold is workload and system configuration
dependent.  So in this patchset, a method to adjust the hot threshold
automatically is implemented.  The basic idea is to control the number of
the candidate promotion pages to match the promotion rate limit.

We used the pmbench memory accessing benchmark tested the patchset on a
2-socket server system with DRAM and PMEM installed.  The test results are
as follows,

pmbench score promote rate
(accesses/s) MB/s
------------- ------------
base   146887704.1        725.6
hot selection     165695601.2        544.0
rate limit   162814569.8        165.2
auto adjustment   170495294.0                  136.9

From the results above,

With hot page selection patch [1/3], the pmbench score increases about
12.8%, and promote rate (overhead) decreases about 25.0%, compared with
base kernel.

With rate limit patch [2/3], pmbench score decreases about 1.7%, and
promote rate decreases about 69.6%, compared with hot page selection
patch.

With threshold auto adjustment patch [3/3], pmbench score increases about
4.7%, and promote rate decrease about 17.1%, compared with rate limit
patch.

Baolin helped to test the patchset with MySQL on a machine which contains
1 DRAM node (30G) and 1 PMEM node (126G).

sysbench /usr/share/sysbench/oltp_read_write.lua \
......
--tables=200 \
--table-size=1000000 \
--report-interval=10 \
--threads=16 \
--time=120

The tps can be improved about 5%.

This patch (of 3):

To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory node need to be identified.  Essentially,
the original NUMA balancing implementation selects the mostly recently
accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
identify the hot pages.  Because the pages with quite low access frequency
may be accessed eventually given the NUMA balancing page table scanning
period could be quite long (e.g.  60 seconds).  The most frequently
accessed (MFU) algorithm is better.

So, in this patch we implemented a better hot page selection algorithm.
Which is based on NUMA balancing page table scanning and hint page fault
as follows,

- When the page tables of the processes are scanned to change PTE/PMD
  to be PROT_NONE, the current time is recorded in struct page as scan
  time.

- When the page is accessed, hint page fault will occur.  The scan
  time is gotten from the struct page.  And The hint page fault
  latency is defined as

    hint page fault time - scan time

The shorter the hint page fault latency of a page is, the higher the
probability of their access frequency to be higher.  So the hint page
fault latency is a better estimation of the page hot/cold.

It's hard to find some extra space in struct page to hold the scan time.
Fortunately, we can reuse some bits used by the original NUMA balancing.

NUMA balancing uses some bits in struct page to store the page accessing
CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
multi-stage node selection algorithm to avoid to migrate pages shared
accessed by the NUMA nodes back and forth.  But for pages in the slow
memory node, even if they are shared accessed by multiple NUMA nodes, as
long as the pages are hot, they need to be promoted to the fast memory
node.  So the accessing CPU and PID information are unnecessary for the
slow memory pages.  We can reuse these bits in struct page to record the
scan time.  For the fast memory pages, these bits are used as before.

For the hot threshold, the default value is 1 second, which works well in
our performance test.  All pages with hint page fault latency < hot
threshold will be considered hot.

It's hard for users to determine the hot threshold.  So we don't provide a
kernel ABI to set it, just provide a debugfs interface for advanced users
to experiment.  We will continue to work on a hot threshold automatic
adjustment mechanism.

The downside of the above method is that the response time to the workload
hot spot changing may be much longer.  For example,

- A previous cold memory area becomes hot

- The hint page fault will be triggered.  But the hint page fault
  latency isn't shorter than the hot threshold.  So the pages will
  not be promoted.

- When the memory area is scanned again, maybe after a scan period,
  the hint page fault latency measured will be shorter than the hot
  threshold and the pages will be promoted.

To mitigate this, if there are enough free space in the fast memory node,
the hot threshold will not be used, all pages will be promoted upon the
hint page fault for fast response.

Thanks Zhong Jiang reported and tested the fix for a bug when disabling
memory tiering mode dynamically.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Tested-by: Baolin Wang <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: osalvador <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Zhong Jiang <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/util.c: add warning if __vm_enough_memory fails

If a process has not enough memory to allocate a new virtual mapping, we
may meet verious kinds of error, eg, fork cannot allocate memory, SIGBUS
error in shmem, but it is difficult to confirm them, let's add some debug
information to easily to check this scenario if __vm_enough_memory fails.

Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Yongqiang Liu <[email protected]>
Signed-off-by: Kefeng Wang <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: add more BUILD_BUG_ONs to gfp_migratetype()

gfp_migratetype() also expects GFP_RECLAIMABLE and
GFP_MOVABLE|GFP_RECLAIMABLE to be shiftable into MIGRATE_* enum values, so
add some more BUILD_BUG_ONs to reflect this assumption.

Link: https://linux-review.googlesource.com/id/Iae64e2182f75c3aca776a486b71a72571d66d83e
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Collingbourne <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes

When pinning pages with FOLL_LONGTERM check_and_migrate_movable_pages() is
called to migrate pages out of zones which should not contain any longterm
pinned pages.

When migration succeeds all pages will have been unpinned so pinning needs
to be retried.  This is indicated by returning zero.  When all pages are
in the correct zone the number of pinned pages is returned.

However migration can also fail, in which case pages are unpinned and
-ENOMEM is returned.  However if the failure was due to not being unable
to isolate a page zero is returned.  This leads to indefinite looping in
__gup_longterm_locked().

Fix this by simplifying the return codes such that zero indicates all
pages were successfully pinned in the correct zone while errors indicate
either pages were migrated and pinning should be retried or that migration
has failed and therefore the pinning operation should fail.

[[email protected]: fix return value for __gup_longterm_locked()]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix code layout, per John]
[[email protected]: fix uninitialized return value on __gup_longterm_locked()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alistair Popple <[email protected]>
Signed-off-by: Shigeru Yoshida <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb_cgroup: use helper for_each_hstate and hstate_index

Use helper for_each_hstate and hstate_index to iterate the hstate and get
the hstate index. Minor readability improvement.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Mina Almasry <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb_cgroup: use helper macro NUMA_NO_NODE

It's better to use NUMA_NO_NODE instead of magic number -1. Minor
readability improvement.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Mina Almasry <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb_cgroup: remove unneeded return value

The return value of set_hugetlb_cgroup and set_hugetlb_cgroup_rsvd are
always ignored. Remove them to clean up the code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Mina Almasry <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb_cgroup: hugetlbfs: use helper macro SZ_1{K,M,G}

Use helper macro SZ_1K, SZ_1M and SZ_1G to do the size conversion. Minor
readability improvement.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Mina Almasry <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb_cgroup: remove unneeded nr_pages > 0 check

Patch series "A few cleanup patches for hugetlb_cgroup", v2.

This series contains a few cleaup patches to remove unneeded check, use
helper macro, remove unneeded return value and so on. More details can be
found in the respective changelogs.

This patch (of 5):

When code reaches here, nr_pages must be > 0. Remove unneeded nr_pages > 0
check to simplify the code.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Mina Almasry <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Kselftests: remove support of libhugetlbfs from kselftests

libhugetlbfs, the user side utitlity to work with hugepages, does not have
any active support. There are only 2 selftests which are part of in
vm/hmm_test.c that depends on libhugetlbfs.

This patch modifies the tests so that they will not require libhugetlb
library.

[[email protected]: : remove orphaned references to local_config.{h,mk}]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Tarun Sahu <[email protected]>
Signed-off-by: Axel Rasmussen <[email protected]>
Tested-by: Zach O'Keefe <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Shivaprasad G Bhat <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Vaibhav Jain <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

kfence: add sysfs interface to disable kfence for selected slabs.

By default kfence allocation can happen for any slab object, whose size is
up to PAGE_SIZE, as long as that allocation is the first allocation after
expiration of kfence sample interval.  But in certain debugging scenarios
we may be interested in debugging corruptions involving some specific slub
objects like dentry or ext4_* etc.  In such cases limiting kfence for
allocations involving only specific slub objects will increase the
probablity of catching the issue since kfence pool will not be consumed by
other slab objects.

This patch introduces a sysfs interface
'/sys/kernel/slab/<name>/skip_kfence' to disable kfence for specific
slabs.  Having the interface work in this way does not impact
current/default behavior of kfence and allows us to use kfence for
specific slabs (when needed) as well.  The decision to skip/use kfence is
taken depending on whether kmem_cache.flags has (newly introduced)
SLAB_SKIP_KFENCE flag set or not.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Imran Khan <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: migration: fix the FOLL_GET failure on following huge page

Not all huge page APIs support FOLL_GET option, so move_pages() syscall
will fail to get the page node information for some huge pages.

Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
return NULL page for FOLL_GET when calling move_pages() syscall with the
NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.

Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
Link: https://lore.kernel.org/all/[email protected]
But these huge page APIs don't support FOLL_GET:
  1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
  2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
     It will cause WARN_ON_ONCE for FOLL_GET.
  3. follow_huge_pgd() in mm/hugetlb.c

This is an temporary solution to mitigate the side effect of the race
condition fix by calling follow_page() with FOLL_GET set for huge pages.

After supporting follow huge page by FOLL_GET is done, this fix can be
reverted safely.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
Signed-off-by: Haiyue Wang <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/vmscan: make the annotations of refaults code at the right place

After patch "mm/workingset: prepare the workingset detection
infrastructure for anon LRU", we can handle the refaults of anonymous
pages too. So the annotations of refaults should cover both of anonymous
pages and file pages.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 170b04b7ae4963 ("mm/workingset: prepare the workingset detection infrastructure for anon LRU")
Signed-off-by: Yang Yang <[email protected]>
Signed-off-by: CGEL ZTE <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/damon/core: simplify the parameter passing for region split operation

The parameter 'struct damon_ctx *ctx' is unnecessary in damon region split
operation, so we can remove it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kaixu Xia <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

tools/vm/page_owner_sort: fix -f option

The -f option is to filter out the information of blocks whose memory has
not been released, I noticed some blocks should not be filtered out.

Commit 9cc7e96aa846 ("mm/page_owner: record timestamp and pid") records
the allocation timestamp (ts_nsec) of all pages.

Commit 866b48526217 ("mm/page_owner: record the timestamp of all pages
during free") records the free timestamp (free_ts_nsec) of all pages.
When the page is allocated for the first time, the initial value of
free_ts_nsec is 0, and the corresponding time will be obtained when the
page is released.  But during reallocation the free_ts_nsec will not reset
to 0 again.  In particular, when page migration occurs, these two
timestamps will be the same.

Now page_owner_sort removes all text blocks whose free_ts_nsec is not 0
when using -f option.  However, this way can only select pages allocated
for the first time.  If a freed page is reallocated, free_ts_nsec will be
less than ts_nsec; if page migration occurs, the two timestamps will be
equal.  These cases should be considered as pages are not released.

So I fix the function is_need() to keep text blocks that meet the above
two conditions when using -f option.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yixuan Cao <[email protected]>
Cc: Chongxi Zhao <[email protected]>
Cc: Jiajian Ye <[email protected]>
Cc: Yuhong Feng <[email protected]>
Cc: Liam Mark <[email protected]>
Cc: Georgi Djakov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/util: reduce stack usage of folio_mapcount

folio_test_hugetlb() will call PageHeadHuge which is a function call,
and blocks the compiler from recognizing this redundant load.

After rearranging the code, stack usage is dropped from 32 to 24, and
the function size is smaller (tested on GCC 12):

Before:
Stack usage:
mm/util.c:845:5:folio_mapcount 32 static
Size:
0000000000000ea0 00000000000000c7 T folio_mapcount

After:
Stack usage:
mm/util.c:845:5:folio_mapcount 24 static
Size:
0000000000000ea0 00000000000000b0 T folio_mapcount

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/page_alloc: only search higher order when fallback

It seems unnecessary to search pages with order < alloc_order in
fallback allocation.

This can currently happen with ALLOC_NOFRAGMENT and alloc_order >
pageblock_order, so add a test to prevent it.

[[email protected]: changelog addition]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Abel Wu <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

page_alloc: remove inactive initialization

The allocation address of the table pointer variable is first performed in
the function, no initialization assignment is required, and no invalid
pointer will appear.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Li kunyu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process

Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced in
commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple
preferred nodes"), the policy_nodemask_current()'s semantics for this new
policy has been changed, which returns 'preferred' nodes instead of
'allowed' nodes.

With the changed semantic of policy_nodemask_current, a task with
MPOL_PREFERRED_MANY policy could fail to get its reservation even though
it can fall back to other nodes (either defined by cpusets or all online
nodes) for that reservation failing mmap calles unnecessarily early.

The fix is to not consider MPOL_PREFERRED_MANY for reservations at all
because they, unlike MPOL_MBIND, do not pose any actual hard constrain.

Michal suggested the policy_nodemask_current() is only used by hugetlb,
and could be moved to hugetlb code with more explicit name to enforce the
'allowed' semantics for which only MPOL_BIND policy matters.

apply_policy_zone() is made extern to be called in hugetlb code and its
return value is changed to bool.

[1]. https://lore.kernel.org/lkml/20220801084207 [email protected]/t/

Link: https://lkml.kernel.org/r/[email protected]
Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
Signed-off-by: Feng Tang <[email protected]>
Reported-by: Muchun Song <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ben Widawsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mempolicy: fix lock contention on mems_allowed

The mems_allowed field can be modified by other tasks, so it isn't safe to
access it with alloc_lock unlocked even in the current process context.

Say there are two tasks: A from cpusetA is performing set_mempolicy(2),
and B is changing cpusetA's cpuset.mems:

  A (set_mempolicy) B (echo xx > cpuset.mems)
  -------------------------------------------------------
  pol = mpol_new();
update_tasks_nodemask(cpusetA) {
  foreach t in cpusetA {
    cpuset_change_task_nodemask(t) {
  mpol_set_nodemask(pol) {
      task_lock(t); // t could be A
    new = f(A->mems_allowed);
      update t->mems_allowed;
    pol.create(pol, new);
      task_unlock(t);
  }
    }
  }
}
  task_lock(A);
  A->mempolicy = pol;
  task_unlock(A);

In this case A's pol->nodes is computed by old mems_allowed, and could
be inconsistent with A's new mems_allowed.

While it is different when replacing vmas' policy: the pol->nodes is
gone wild only when current_cpuset_is_being_rebound():

  A (mbind) B (echo xx > cpuset.mems)
  -------------------------------------------------------
  pol = mpol_new();
  mmap_write_lock(A->mm);
cpuset_being_rebound = cpusetA;
update_tasks_nodemask(cpusetA) {
  foreach t in cpusetA {
    cpuset_change_task_nodemask(t) {
  mpol_set_nodemask(pol) {
      task_lock(t); // t could be A
    mask = f(A->mems_allowed);
      update t->mems_allowed;
    pol.create(pol, mask);
      task_unlock(t);
  }
    }
  foreach v in A->mm {
    if (cpuset_being_rebound == cpusetA)
      pol.rebind(pol, cpuset.mems);
    v->vma_policy = pol;
  }
  mmap_write_unlock(A->mm);
    mmap_write_lock(t->mm);
    mpol_rebind_mm(t->mm);
    mmap_write_unlock(t->mm);
  }
}
cpuset_being_rebound = NULL;

In this case, the cpuset.mems, which has already done updating, is finally
used for calculating pol->nodes, rather than A->mems_allowed.  So it is OK
to call mpol_set_nodemask() with alloc_lock unlocked when doing mbind(2).

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 78b132e9bae9 ("mm/mempolicy: remove or narrow the lock on current")
Signed-off-by: Abel Wu <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/cma_debug: show complete cma name in debugfs directories

Currently only 12 characters of the cma name is being used as the debug
directories where as the cma name can be of length CMA_MAX_NAME(=64)
characters. One side problem with this is having 2 cma's with first
common 12 characters would end up in trying to create directories with
same name and fails with -EEXIST thus can limit cma debug functionality.

The 'cma-' prefix is used initially where cma areas don't have any names
and are represented by simple integer values. Since now each cma would be
having its own name, drop 'cma-' prefix for the cma debug directories as
they are clearly evident that they are for cma debug through creating them
in /sys/kernel/debug/cma/ path.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Charan Teja Kalla <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Pavan Kondeti <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/swap: remove the end_write_func argument to __swap_writepage

The argument is always set to end_swap_bio_write, so remove the argument
and mark end_swap_bio_write static.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Vitaly Wool <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

zsmalloc: remove unnecessary size_class NULL check

pool->size_class array elements can't be NULL, so this check
is not needed.

In the whole code, we assign pool->size_class[i] values that are
not NULL. Releasing memory for these values occurs in the
zs_destroy_pool() function, which also releases and destroys the pool.

In addition, in the zs_stats_size_show() and async_free_zspage(),
with similar iterations over the array, we don't check it for NULL
pointer.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Romanov <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

zsmalloc: zs_object_copy: add clarifying comment

Patch series "tidy up zsmalloc implementation"

This patchset remove some unnecessary checks and adds a clarifying
comment. While analysing zs_object_copy() function code, I spent some
time to understand what the call kunmap_atomic(d_addr) is for. It seems
that this point is not trivial and it is worth adding a comment.

This patch (of 2):

It's not obvious why kunmap_atomic(d_addr) call is needed.

[[email protected]: tweak comment layout]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Romanov <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/vmscan: define macros for refaults in struct lruvec

The magic number 0 and 1 are used in several places in vmscan.c.
Define macros for them to improve code readability.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Yang <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests: vm: add /dev/userfaultfd test cases to run_vmtests.sh

This new mode was recently added to the userfaultfd selftest. We want to
exercise both userfaultfd(2) as well as /dev/userfaultfd, so add both
test cases to the script.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Axel Rasmussen <[email protected]>
Reviewed-by: Shuah Khan <[email protected]>
Acked-by: Peter Xu <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry V. Levin <[email protected]>
Cc: Gleb Fotengauer-Malinovskiy <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zhang Yi <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

userfaultfd: update documentation to describe /dev/userfaultfd

Explain the different ways to create a new userfaultfd, and how access
control works for each way.

[[email protected]: improve wording in documentation, per Mike]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Axel Rasmussen <[email protected]>
Acked-by: Peter Xu <[email protected]>
Reviewed-by: Shuah Khan <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry V. Levin <[email protected]>
Cc: Gleb Fotengauer-Malinovskiy <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zhang Yi <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

userfaultfd: selftests: modify selftest to use /dev/userfaultfd

We clearly want to ensure both userfaultfd(2) and /dev/userfaultfd keep
working into the future, so just run the test twice, using each interface.

Instead of always testing both userfaultfd(2) and /dev/userfaultfd, let
the user choose which to test.

As with other test features, change the behavior based on a new command
line flag.  Introduce the idea of "test mods", which are generic (not
specific to a test type) modifications to the behavior of the test.  This
is sort of borrowed from this RFC patch series [1], but simplified a bit.

The benefit is, in "typical" configurations this test is somewhat slow
(say, 30sec or something).  Testing both clearly doubles it, so it may not
always be desirable, as users are likely to use one or the other, but
never both, in the "real world".

[1]: https://patchwork.kernel.org/project/linux-mm/patch/20201129004548.1619714 [email protected]/

[[email protected]: modify selftest to exit with KSFT_SKIP *only* when features are unsupported, per Mike]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Axel Rasmussen <[email protected]>
Acked-by: Peter Xu <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry V. Levin <[email protected]>
Cc: Gleb Fotengauer-Malinovskiy <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zhang Yi <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

userfaultfd: add /dev/userfaultfd for fine grained access control

Historically, it has been shown that intercepting kernel faults with
userfaultfd (thereby forcing the kernel to wait for an arbitrary amount of
time) can be exploited, or at least can make some kinds of exploits
easier.  So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we
changed things so, in order for kernel faults to be handled by
userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl must
be configured so that any unprivileged user can do it.

In a typical implementation of a hypervisor with live migration (take
QEMU/KVM as one such example), we do indeed need to be able to handle
kernel faults.  But, both options above are less than ideal:

- Toggling the sysctl increases attack surface by allowing any
  unprivileged user to do it.

- Granting the live migration process CAP_SYS_PTRACE gives it this
  ability, but *also* the ability to "observe and control the
  execution of another process [...], and examine and change [its]
  memory and registers" (from ptrace(2)). This isn't something we need
  or want to be able to do, so granting this permission violates the
  "principle of least privilege".

This is all a long winded way to say: we want a more fine-grained way to
grant access to userfaultfd, without granting other additional permissions
at the same time.

To achieve this, add a /dev/userfaultfd misc device.  This device provides
an alternative to the userfaultfd(2) syscall for the creation of new
userfaultfds.  The idea is, any userfaultfds created this way will be able
to handle kernel faults, without the caller having any special
capabilities.  Access to this mechanism is instead restricted using e.g.
standard filesystem permissions.

[[email protected]: Handle misc_register() failure properly]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Axel Rasmussen <[email protected]>
Acked-by: Nadav Amit <[email protected]>
Acked-by: Peter Xu <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry V. Levin <[email protected]>
Cc: Gleb Fotengauer-Malinovskiy <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zhang Yi <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests: vm: add hugetlb_shared userfaultfd test to run_vmtests.sh

Patch series "userfaultfd: add /dev/userfaultfd for fine grained access
control", v7.

Why not ...?
============

- Why not /proc/[pid]/userfaultfd? Two main points (additional discussion [1]):

    - /proc/[pid]/* files are all owned by the user/group of the process, and
      they don't really support chmod/chown. So, without extending procfs it
      doesn't solve the problem this series is trying to solve.

    - The main argument *for* this was to support creating UFFDs for remote
      processes. But, that use case clearly calls for CAP_SYS_PTRACE, so to
      support this we could just use the UFFD syscall as-is.

- Why not use a syscall? Access to syscalls is generally controlled by
  capabilities. We don't have a capability which is used for userfaultfd access
  without also granting more / other permissions as well, and adding a new
  capability was rejected [2].

    - It's possible a LSM could be used to control access instead, but I have
      some concerns. I don't think this approach would be as easy to use,
      particularly if we were to try to solve this with something heavyweight
      like SELinux. Maybe we could pursue adding a new LSM specifically for
      this user case, but it may be too narrow of a case to justify that.

[1]: https://patchwork.kernel.org/project/linux-mm/cover/20220719195628.3415852 [email protected]/
[2]: https://lore.kernel.org/lkml/686276b9-4530-2045-6bd8-170e5943abe4@schaufler-ca.com/T/

This patch (of 5):

This not being included was just a simple oversight.  There are certain
features (like minor fault support) which are only enabled on shared
mappings, so without including hugetlb_shared we actually lose a
significant amount of test coverage.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Axel Rasmussen <[email protected]>
Reviewed-by: Shuah Khan <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry V. Levin <[email protected]>
Cc: Gleb Fotengauer-Malinovskiy <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zhang Yi <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/damon/dbgfs: use kmalloc for allocating only one element

Use kmalloc(...) rather than kmalloc_array(1, ...) because the number of
elements we are specifying in this case is 1, kmalloc would accomplish the
same thing and we can simplify.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kenneth Lee <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/filemap.c: convert page_endio() to use a folio

Replace three calls to compound_head() with one.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shaoqin Huang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memory-failure: cleanup try_to_split_thp_page()

Since commit 5d1fd5dc877b ("mm,hwpoison: introduce MF_MSG_UNSPLIT_THP"),
the action_result(,MF_MSG_UNSPLIT_THP,) called to show memory error event
in memory_failure(), so the pr_info() in try_to_split_thp_page() is only
needed in soft_offline_in_use_page().

Meanwhile this could also fix the unexpected prefix for "thp split failed"
due to commit 96f96763de26 ("mm: memory-failure: convert to pr_fmt()").

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: align larger anonymous mappings on THP boundaries

Align larger anonymous memory mappings on THP boundaries by going through
thp_get_unmapped_area if THPs are enabled for the current process.

With this patch, larger anonymous mappings are now THP aligned. When a
malloc library allocates a 2MB or larger arena, that arena can now be
mapped with THPs right from the start, which can result in better TLB hit
rates and execution time.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Rik van Riel <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/page_ext: remove unused variable in offline_page_ext

Remove unused variable 'nid' in offline_page_ext(). This is not used
since the page_ext code inception.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Charan Teja Kalla <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Pavan Kondeti <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/vm: add selftest to verify multi THP collapse

Add support to allocate and verify collapse of multiple hugepage-sized
regions into multiple THPs.

Add "nr" argument to check_huge() that instructs check_huge() to check for
exactly "nr_hpages" THPs.  This has the added benefit of now being able to
check for exactly 0 THPs, and so callsites that previously checked the
negation of exactly 1 THP are now more correct.

->collapse struct collapse_context hook has been expanded with a
"nr_hpages" argument to collapse "nr_hpages" hugepages.  The
collapse_full() test has been repurposed to collapse 4 THPs at once.  It
is expected more tests will want to test multi THP collapse (e.g.
file/shmem).

This is of particular benefit to madvise collapse context given that it
may do many THP collapses during a single syscall.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zach O'Keefe <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Chris Kennelly <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Rongwei Wang <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: "Souptick Joarder (HPE)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/vm: add selftest to verify recollapse of THPs

Add selftest specific to madvise collapse context that tests MADV_COLLAPSE
is "successful" if a hugepage-aligned/sized region is already pmd-mapped.

This test also verifies that MADV_COLLAPSE can collapse memory into THPs
even in "madvise" THP mode and the memory isn't marked VM_HUGEPAGE.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zach O'Keefe <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Chris Kennelly <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Rongwei Wang <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: "Souptick Joarder (HPE)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/vm: add MADV_COLLAPSE collapse context to selftests

Add madvise collapse context to hugepage collapse selftests. This context
is tested with /sys/kernel/mm/transparent_hugepage/enabled set to "never"
in order to avoid unwanted interaction with khugepaged during testing.

Also, refactor updates to sysfs THP settings using a stack so that the THP
settings from nested callers can be restored.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zach O'Keefe <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Chris Kennelly <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Rongwei Wang <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: "Souptick Joarder (HPE)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/vm: dedup hugepage allocation logic

The code

p = alloc_mapping();
printf("Allocate huge page...");
madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
fill_memory(p, 0, hpage_pmd_size);
if (check_huge(p))
success("OK");
else
fail("Fail");

Is repeated many times in different tests. Add a helper, alloc_hpage()
to handle this.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zach O'Keefe <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Chris Kennelly <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Rongwei Wang <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: "Souptick Joarder (HPE)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/vm: modularize collapse selftests

Modularize the collapse action of khugepaged collapse selftests by
introducing a struct collapse_context which specifies how to collapse a
given memory range and the expected semantics of the collapse. This can
be reused later to test other collapse contexts.

Additionally, all tests have logic that checks if a collapse occurred via
reading /proc/self/smaps, and report if this is different than expected.
Move this logic into the per-context ->collapse() hook instead of
repeating it in every test.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zach O'Keefe <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Chris Kennelly <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Rongwei Wang <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: "Souptick Joarder (HPE)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/madvise: add MADV_COLLAPSE to process_madvise()

Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
CAP_SYS_ADMIN or is requesting collapse of it's own memory.

This is useful for the development of userspace agents that seek to
optimize THP utilization system-wide by using userspace signals to
prioritize what memory is most deserving of being THP-backed.

[[email protected]: remove CAP_SYS_ADMIN requirement for process_madvise(MADV_COLLAPSE)]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zach O'Keefe <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Chris Kennelly <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Rongwei Wang <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: "Souptick Joarder (HPE)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/khugepaged: rename prefix of shared collapse functions

The following functions are shared between khugepaged and madvise collapse
contexts. Replace the "khugepaged_" prefix with generic "hpage_collapse_"
prefix in such cases:

khugepaged_test_exit() -> hpage_collapse_test_exit()
khugepaged_scan_abort() -> hpage_collapse_scan_abort()
khugepaged_scan_pmd() -> hpage_collapse_scan_pmd()
khugepaged_find_target_node() -> hpage_collapse_find_target_node()
khugepaged_alloc_page() -> hpage_collapse_alloc_page()

The kerenel ABI (e.g. huge_memory:mm_khugepaged_scan_pmd tracepoint) is
unaltered.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zach O'Keefe <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Chris Kennelly <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Rongwei Wang <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: "Souptick Joarder (HPE)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse

This idea was introduced by David Rientjes[1].

Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request
a synchronous collapse of memory at their own expense.

The benefits of this approach are:

* CPU is charged to the process that wants to spend the cycles for the
  THP
* Avoid unpredictable timing of khugepaged collapse

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage may enter direct reclaim and/or
compaction, regardless of VMA flags.  When the system has multiple NUMA
nodes, the hugepage will be allocated from the node providing the most
native pages.  This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how
pages will be mapped, constructed, or faulted in the future

Return Value

If all hugepage-sized/aligned regions covered by the provided range were
either successfully collapsed, or were already PMD-mapped THPs, this
operation will be deemed successful.  On success, process_madvise(2)
returns the number of bytes advised, and madvise(2) returns 0.  Else, -1
is returned and errno is set to indicate the error for the most-recently
attempted hugepage collapse.  Note that many failures might have occurred,
since the operation may continue to collapse in the event a single
hugepage-sized/aligned region fails.

ENOMEM Memory allocation failed or VMA not found
EBUSY Memcg charging failed
EAGAIN Required resource temporarily unavailable.  Try again
might succeed.
EINVAL Other error: No PMD found, subpage doesn't have Present
bit set, "Special" page no backed by struct page, VMA
incorrectly sized, address not page-aligned, ...

Most notable here is ENOMEM and EBUSY (new to madvise) which are intended
to provide the caller with actionable feedback so they may take an
appropriate fallback measure.

Use Cases

An immediate user of this new functionality are malloc() implementations
that manage memory in hugepage-sized chunks, but sometimes subrelease
memory back to the system in native-sized chunks via MADV_DONTNEED;
zapping the pmd.  Later, when the memory is hot, the implementation could
madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain hugepage
coverage and dTLB performance.  TCMalloc is such an implementation that
could benefit from this[2].

Only privately-mapped anon memory is supported for now, but additional
support for file, shmem, and HugeTLB high-granularity mappings[2] is
expected.  File and tmpfs/shmem support would permit:

* Backing executable text by THPs.  Current support provided by
  CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which
  might impair services from serving at their full rated load after
  (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
  immediately realize iTLB performance prevents page sharing and demand
  paging, both of which increase steady state memory footprint.  With
  MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
  and lower RAM footprints.
* Backing guest memory by hugapages after the memory contents have been
  migrated in native-page-sized chunks to a new host, in a
  userfaultfd-based live-migration stack.

[1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
[2] https://github.com/google/tcmalloc/tree/master/tcmalloc

[[email protected]: avoid possible memory leak in failure path]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected] add missing kfree() to madvise_collapse()]
Link: https://lore.kernel.org/linux-mm/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: delay computation of hpage boundaries until use]]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zach O'Keefe <[email protected]>
Signed-off-by: "Souptick Joarder (HPE)" <[email protected]>
Suggested-by: David Rientjes <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Chris Kennelly <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Rongwei Wang <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage

When scanning an anon pmd to see if it's eligible for collapse, return
SCAN_PMD_MAPPED if the pmd already maps a hugepage.  Note that
SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
file-collapse path, since the latter might identify pte-mapped compound
pages.  This is required by MADV_COLLAPSE which necessarily needs to know
what hugepage-aligned/sized regions are already pmd-mapped.

In order to determine if a pmd already maps a hugepage, refactor
mm_find_pmd():

Return mm_find_pmd() to it's pre-commit f72e7dcdd252 ("mm: let mm_find_pmd
fix buggy race with THP fault") behavior.  ksm was the only caller that
explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic
there (pmd_present() and pmd_trans_huge() checks).

Undo revert change in commit f72e7dcdd252 ("mm: let mm_find_pmd fix buggy
race with THP fault") that open-coded split_huge_pmd_address() pmd lookup
and use mm_find_pmd() instead.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zach O'Keefe <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Chris Kennelly <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Rongwei Wang <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: "Souptick Joarder (HPE)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()

MADV_COLLAPSE is not coupled to the kernel-oriented sysfs THP settings[1].

hugepage_vma_check() is the authority on determining if a VMA is eligible
for THP allocation/collapse, and currently enforces the sysfs THP
settings.  Add a flag to disable these checks.  For now, only apply this
arg to anon and file, which use /sys/kernel/transparent_hugepage/enabled.
We can expand this to shmem, which uses
/sys/kernel/transparent_hugepage/shmem_enabled, later.

Use this flag in collapse_pte_mapped_thp() where previously the VMA flags
passed to hugepage_vma_check() were OR'd with VM_HUGEPAGE to elide the
VM_HUGEPAGE check in "madvise" THP mode.  Prior to "mm: khugepaged: check
THP flag in hugepage_vma_check()", this check also didn't check "never"
THP mode.  As such, this restores the previous behavior of
collapse_pte_mapped_thp() where sysfs THP settings are ignored.  See
comment in code for justification why this is OK.

[1] https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zach O'Keefe <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Chris Kennelly <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Rongwei Wang <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: "Souptick Joarder (HPE)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/khugepaged: add flag to predicate khugepaged-only behavior

Add .is_khugepaged flag to struct collapse_control so khugepaged-specific
behavior can be elided by MADV_COLLAPSE context.

Start by protecting khugepaged-specific heuristics by this flag. In
MADV_COLLAPSE, the user presumably has reason to believe the collapse will
be beneficial and khugepaged heuristics shouldn't prevent the user from
doing so:

1) sysfs-controlled knobs khugepaged_max_ptes_[none|swap|shared]

2) requirement that some pages in region being collapsed be young or
referenced

[[email protected]: consistently order cc->is_khugepaged and pte_* checks]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/linux-mm/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zach O'Keefe <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Chris Kennelly <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Rongwei Wang <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: "Souptick Joarder (HPE)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/khugepaged: propagate enum scan_result codes back to callers

Propagate enum scan_result codes back through return values of
functions downstream of khugepaged_scan_file() and
khugepaged_scan_pmd() to inform callers if the operation was
successful, and if not, why.

Since khugepaged_scan_pmd()'s return value already has a specific meaning
(whether mmap_lock was unlocked or not), add a bool* argument to
khugepaged_scan_pmd() to retrieve this information.

Change khugepaged to take action based on the return values of
khugepaged_scan_file() and khugepaged_scan_pmd() instead of acting deep
within the collapsing functions themselves.

hugepage_vma_revalidate() now returns SCAN_SUCCEED on success to be more
consistent with enum scan_result propagation.

Remove dependency on error pointers to communicate to khugepaged that
allocation failed and it should sleep; instead just use the result of the
scan (SCAN_ALLOC_HUGE_PAGE_FAIL if allocation fails).

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zach O'Keefe <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Chris Kennelly <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Rongwei Wang <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: "Souptick Joarder (HPE)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/khugepaged: dedup and simplify hugepage alloc and charging

The following code is duplicated in collapse_huge_page() and
collapse_file():

        gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;

new_page = khugepaged_alloc_page(hpage, gfp, node);
        if (!new_page) {
                result = SCAN_ALLOC_HUGE_PAGE_FAIL;
                goto out;
        }

        if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
                result = SCAN_CGROUP_CHARGE_FAIL;
                goto out;
        }
        count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);

Also, "node" is passed as an argument to both collapse_huge_page() and
collapse_file() and obtained the same way, via
khugepaged_find_target_node().

Move all this into a new helper, alloc_charge_hpage(), and remove the
duplicate code from collapse_huge_page() and collapse_file().  Also,
simplify khugepaged_alloc_page() by returning a bool indicating allocation
success instead of a copy of the allocated struct page *.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zach O'Keefe <[email protected]>
Suggested-by: Peter Xu <[email protected]>
Acked-by: David Rientjes <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Chris Kennelly <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Rongwei Wang <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: "Souptick Joarder (HPE)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/khugepaged: add struct collapse_control

Modularize hugepage collapse by introducing struct collapse_control. This
structure serves to describe the properties of the requested collapse, as
well as serve as a local scratch pad to use during the collapse itself.

Start by moving global per-node khugepaged statistics into this new
structure. Note that this structure is still statically allocated since
CONFIG_NODES_SHIFT might be arbitrary large, and stack-allocating a
MAX_NUMNODES-sized array could cause -Wframe-large-than= errors.

[[email protected]: use minimal bits to store num page < HPAGE_PMD_NR]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/linux-mm/Ys2CeIm%[email protected]/
[[email protected]: fix build]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix struct collapse_control load_node definition]
Link: https://lore.kernel.org/linux-mm/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zach O'Keefe <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Chris Kennelly <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Rongwei Wang <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: "Souptick Joarder (HPE)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA

Patch series "mm: userspace hugepage collapse", v7.

Introduction
--------------------------------

This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.

This idea was introduced by David Rientjes[5].

Interface
--------------------------------

The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.

process_madvise(2)

Performs a synchronous collapse of the native pages
mapped by the list of iovecs into transparent hugepages.

This operation is independent of the system THP sysfs settings,
but attempts to collapse VMAs marked VM_NOHUGEPAGE will still fail.

THP allocation may enter direct reclaim and/or compaction.

When a range spans multiple VMAs, the semantics of the collapse
over of each VMA is independent from the others.

Caller must have CAP_SYS_ADMIN if not acting on self.

Return value follows existing process_madvise(2) conventions.  A
“success” indicates that all hugepage-sized/aligned regions
covered by the provided range were either successfully
collapsed, or were already pmd-mapped THPs.

madvise(2)

Equivalent to process_madvise(2) on self, with 0 returned on
“success”.

Current Use-Cases
--------------------------------

(1) Immediately back executable text by THPs.  Current support provided
by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
system which might impair services from serving at their full rated
load after (re)starting.  Tricks like mremap(2)'ing text onto
anonymous memory to immediately realize iTLB performance prevents
page sharing and demand paging, both of which increase steady state
memory footprint.  With MADV_COLLAPSE, we get the best of both
worlds: Peak upfront performance and lower RAM footprints.  Note
that subsequent support for file-backed memory is required here.

(2) malloc() implementations that manage memory in hugepage-sized
chunks, but sometimes subrelease memory back to the system in
native-sized chunks via MADV_DONTNEED; zapping the pmd.  Later,
when the memory is hot, the implementation could
madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
hugepage coverage and dTLB performance.  TCMalloc is such an
implementation that could benefit from this[6].  A prior study of
Google internal workloads during evaluation of Temeraire, a
hugepage-aware enhancement to TCMalloc, showed that nearly 20% of
all cpu cycles were spent in dTLB stalls, and that increasing
hugepage coverage by even small amount can help with that[7].

(3) userfaultfd-based live migration of virtual machines satisfy UFFD
faults by fetching native-sized pages over the network (to avoid
latency of transferring an entire hugepage).  However, after guest
memory has been fully copied to the new host, MADV_COLLAPSE can
be used to immediately increase guest performance.  Note that
subsequent support for file/shmem-backed memory is required here.

(4) HugeTLB high-granularity mapping allows HugeTLB a HugeTLB page to
be mapped at different levels in the page tables[8].  As it's not
"transparent" like THP, HugeTLB high-granularity mappings require
an explicit user API. It is intended that MADV_COLLAPSE be co-opted
for this use case[9].  Note that subsequent support for HugeTLB
memory is required here.

Future work
--------------------------------

Only private anonymous memory is supported by this series. File and
shmem memory support will be added later.

One possible user of this functionality is a userspace agent that
attempts to optimize THP utilization system-wide by allocating THPs
based on, for example, task priority, task performance requirements, or
heatmaps.  For the latter, one idea that has already surfaced is using
DAMON to identify hot regions, and driving THP collapse through a new
DAMOS_COLLAPSE scheme[10].

This patch (of 17):

The khugepaged has optimization to reduce huge page allocation calls for
!CONFIG_NUMA by carrying the allocated but failed to collapse huge page to
the next loop.  CONFIG_NUMA doesn't do so since the next loop may try to
collapse huge page from a different node, so it doesn't make too much
sense to carry it.

But when NUMA=n, the huge page is allocated by khugepaged_prealloc_page()
before scanning the address space, so it means huge page may be allocated
even though there is no suitable range for collapsing.  Then the page
would be just freed if khugepaged already made enough progress.  This
could make NUMA=n run have 5 times as much thp_collapse_alloc as NUMA=y
run.  This problem actually makes things worse due to the way more
pointless THP allocations and makes the optimization pointless.

This could be fixed by carrying the huge page across scans, but it will
complicate the code further and the huge page may be carried indefinitely.
But if we take one step back, the optimization itself seems not worth
keeping nowadays since:

  * Not too many users build NUMA=n kernel nowadays even though the kernel is
    actually running on a non-NUMA machine. Some small devices may run NUMA=n
    kernel, but I don't think they actually use THP.
  * Since commit 44042b449872 ("mm/page_alloc: allow high-order pages to be
    stored on the per-cpu lists"), THP could be cached by pcp.  This actually
    somehow does the job done by the optimization.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Signed-off-by: Zach O'Keefe <[email protected]>
Co-developed-by: Peter Xu <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Chris Kennelly <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Rongwei Wang <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: "Souptick Joarder (HPE)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: fix dereferencing possible ERR_PTR

Smatch checker complains that 'secretmem_mnt' dereferencing possible
ERR_PTR(). Let the function return if 'secretmem_mnt' is ERR_PTR, to
avoid deferencing it.

Link: https://lkml.kernel.org/r/20220904074647.GA64291@cloud-MacBookPro
Fixes: 1507f51255c9f ("mm: introduce memfd_secret system call to create "secret" memory areas")
Signed-off-by: Binyi Han <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Ammar Faizi <[email protected]>
Cc: Hagen Paul Pfeifer <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

vmscan: check folio_test_private(), not folio_get_private()

These two predicates are the same for file pages, but are not the same for
anonymous pages.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 07f67a8dedc0 ("mm/vmscan: convert shrink_active_list() to use a folio")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reported-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: fix VM_BUG_ON in __delete_from_swap_cache()

Patch series "Folio fixes for 6.0".

This patch (of 2):

The recent folio conversion changed the VM_BUG_ON() to dump the folio
we're storing instead of the entry we retrieved. This was a mistake;
the entry we retrieved is the more interesting page to dump.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: ceff9d3354e9 ("mm/swap: convert __delete_from_swap_cache() to a folio")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

tools: fix compilation after gfp_types.h split

When gfp_types.h was split from gfp.h, it broke the radix test suite. Fix
the test suite by using gfp_types.h in the tools gfp.h header.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: cb5a065b4ea9 (headers/deps: mm: Split <linux/gfp_types.h> out of <linux/gfp.h>)
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reported-by: Liam R. Howlett <[email protected]>
Reviewed-by: Yury Norov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/damon/dbgfs: fix memory leak when using debugfs_lookup()

When calling debugfs_lookup() the result must have dput() called on it,
otherwise the memory will leak over time. Fix this up by properly calling
dput().

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 75c1c2b53c78b ("mm/damon/dbgfs: support multiple contexts")
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: SeongJae Park <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/migrate_device.c: copy pte dirty bit to page

migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
installs migration entries directly if it can lock the migrating page.
When removing a dirty pte the dirty bit is supposed to be carried over to
the underlying page to prevent it being lost.

Currently migrate_vma_*() can only be used for private anonymous mappings.
That means loss of the dirty bit usually doesn't result in data loss
because these pages are typically not file-backed. However pages may be
backed by swap storage which can result in data loss if an attempt is made
to migrate a dirty page that doesn't yet have the PageDirty flag set.

In this case migration will fail due to unexpected references but the
dirty pte bit will be lost. If the page is subsequently reclaimed data
won't be written back to swap storage as it is considered uptodate,
resulting in data loss if the page is subsequently accessed.

Prevent this by copying the dirty bit to the page when removing the pte to
match what try_to_migrate_one() does.

Link: https://lkml.kernel.org/r/dd48e4882ce859c295c1a77612f66d198b0403f9.1662078528.git-series.apopple@nvidia.com
Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
Signed-off-by: Alistair Popple <[email protected]>
Acked-by: Peter Xu <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reported-by: "Huang, Ying" <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alex Sierra <[email protected]>
Cc: Ben Skeggs <[email protected]>
Cc: Felix Kuehling <[email protected]>
Cc: huang ying <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Karol Herbst <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Lyude Paul <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/migrate_device.c: add missing flush_cache_page()

Currently we only call flush_cache_page() for the anon_exclusive case,
however in both cases we clear the pte so should flush the cache.

Link: https://lkml.kernel.org/r/5676f30436ab71d1a587ac73f835ed8bd2113ff5.1662078528.git-series.apopple@nvidia.com
Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
Signed-off-by: Alistair Popple <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Peter Xu <[email protected]>
Cc: Alex Sierra <[email protected]>
Cc: Ben Skeggs <[email protected]>
Cc: Felix Kuehling <[email protected]>
Cc: huang ying <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Karol Herbst <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Lyude Paul <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/migrate_device.c: flush TLB while holding PTL

When clearing a PTE the TLB should be flushed whilst still holding the PTL
to avoid a potential race with madvise/munmap/etc.  For example consider
the following sequence:

  CPU0                          CPU1
  ----                          ----

  migrate_vma_collect_pmd()
  pte_unmap_unlock()
                                madvise(MADV_DONTNEED)
                                -> zap_pte_range()
                                pte_offset_map_lock()
                                [ PTE not present, TLB not flushed ]
                                pte_unmap_unlock()
                                [ page is still accessible via stale TLB ]
  flush_tlb_range()

In this case the page may still be accessed via the stale TLB entry after
madvise returns.  Fix this by flushing the TLB while holding the PTL.

Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
Link: https://lkml.kernel.org/r/9f801e9d8d830408f2ca27821f606e09aa856899.1662078528.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <[email protected]>
Reported-by: Nadav Amit <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Peter Xu <[email protected]>
Cc: Alex Sierra <[email protected]>
Cc: Ben Skeggs <[email protected]>
Cc: Felix Kuehling <[email protected]>
Cc: huang ying <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Karol Herbst <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Lyude Paul <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

x86/mm: disable instrumentations of mm/pgprot.c

Commit 4867fbbdd6b3 ("x86/mm: move protection_map[] inside the platform")
moved accesses to protection_map[] from mem_encrypt_amd.c to pgprot.c. As
a result, the accesses are now targets of KASAN (and other
instrumentations), leading to the crash during the boot process.

Disable the instrumentations for pgprot.c like commit 67bb8e999e0a
("x86/mm: Disable various instrumentations of mm/mem_encrypt.c and
mm/tlb.c").

Before this patch, my AMD machine cannot boot since v6.0-rc1 with KASAN
enabled, without anything printed. After the change, it successfully
boots up.

Fixes: 4867fbbdd6b3 ("x86/mm: move protection_map[] inside the platform")
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Naohiro Aota <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memory-failure: fall back to vma_address() when ->notify_failure() fails

In the case where a filesystem is polled to take over the memory failure
and receives -EOPNOTSUPP it indicates that page->index and page->mapping
are valid for reverse mapping the failure address.  Introduce
FSDAX_INVALID_PGOFF to distinguish when add_to_kill() is being called from
mf_dax_kill_procs() by a filesytem vs the typical memory_failure() path.

Otherwise, vma_pgoff_address() is called with an invalid fsdax_pgoff which
then trips this failing signature:

kernel BUG at mm/memory-failure.c:319!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 13 PID: 1262 Comm: dax-pmd Tainted: G           OE    N 6.0.0-rc2+ #62
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:add_to_kill.cold+0x19d/0x209
[..]
Call Trace:
  <TASK>
  collect_procs.part.0+0x2c4/0x460
  memory_failure+0x71b/0xba0
  ? _printk+0x58/0x73
  do_madvise.part.0.cold+0xaf/0xc5

Link: https://lkml.kernel.org/r/166153429427.2758201.14605968329933175594.stgit@dwillia2-xfh.jf.intel.com
Fixes: c36e20249571 ("mm: introduce mf_dax_kill_procs() for fsdax case")
Signed-off-by: Dan Williams <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Shiyang Ruan <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Goldwyn Rodrigues <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ritesh Harjani <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memory-failure: fix detection of memory_failure() handlers

Some pagemap types, like MEMORY_DEVICE_GENERIC (device-dax) do not even
have pagemap ops which results in crash signatures like this:

  BUG: kernel NULL pointer dereference, address: 0000000000000010
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 8000000205073067 P4D 8000000205073067 PUD 2062b3067 PMD 0
  Oops: 0000 [#1] PREEMPT SMP PTI
  CPU: 22 PID: 4535 Comm: device-dax Tainted: G           OE    N 6.0.0-rc2+ #59
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:memory_failure+0x667/0xba0
[..]
  Call Trace:
   <TASK>
   ? _printk+0x58/0x73
   do_madvise.part.0.cold+0xaf/0xc5

Check for ops before checking if the ops have a memory_failure()
handler.

Link: https://lkml.kernel.org/r/166153428781.2758201.1990616683438224741.stgit@dwillia2-xfh.jf.intel.com
Fixes: 33a8f7f2b3a3 ("pagemap,pmem: introduce ->memory_failure()")
Signed-off-by: Dan Williams <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Shiyang Ruan <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Goldwyn Rodrigues <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ritesh Harjani <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

xfs: fix SB_BORN check in xfs_dax_notify_failure()

The SB_BORN flag is stored in the vfs superblock, not xfs_sb.

Link: https://lkml.kernel.org/r/166153428094.2758201.7936572520826540019.stgit@dwillia2-xfh.jf.intel.com
Fixes: 6f643c57d57c ("xfs: implement ->notify_failure() for XFS")
Signed-off-by: Dan Williams <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Shiyang Ruan <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Goldwyn Rodrigues <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Ritesh Harjani <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

xfs: quiet notify_failure EOPNOTSUPP cases

Patch series "mm, xfs, dax: Fixes for memory_failure() handling".

I failed to run the memory error injection section of the ndctl test suite
on linux-next prior to the merge window and as a result some bugs were
missed.  While the new enabling targeted reflink enabled XFS filesystems
the bugs cropped up in the surrounding cases of DAX error injection on
ext4-fsdax and device-dax.

One new assumption / clarification in this set is the notion that if a
filesystem's ->notify_failure() handler returns -EOPNOTSUPP, then it must
be the case that the fsdax usage of page->index and page->mapping are
valid.  I am fairly certain this is true for xfs_dax_notify_failure(), but
would appreciate another set of eyes.

This patch (of 4):

XFS always registers dax_holder_operations regardless of whether the
filesystem is capable of handling the notifications.  The expectation is
that if the notify_failure handler cannot run then there are no scenarios
where it needs to run.  In other words the expected semantic is that
page->index and page->mapping are valid for memory_failure() when the
conditions that cause -EOPNOTSUPP in xfs_dax_notify_failure() are present.

A fallback to the generic memory_failure() path is expected so do not warn
when that happens.

Link: https://lkml.kernel.org/r/166153426798.2758201.15108211981034512993.stgit@dwillia2-xfh.jf.intel.com
Link: https://lkml.kernel.org/r/166153427440.2758201.6709480562966161512.stgit@dwillia2-xfh.jf.intel.com
Fixes: 6f643c57d57c ("xfs: implement ->notify_failure() for XFS")
Signed-off-by: Dan Williams <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Shiyang Ruan <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Goldwyn Rodrigues <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Ritesh Harjani <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>