Shakeel Butt [Thu, 25 Aug 2022 00:05:04 +0000 (00:05 +0000)]
mm: page_counter: remove unneeded atomic ops for low/min
Patch series "memcg: optimize charge codepath", v2.
Recently Linux networking stack has moved from a very old per socket
pre-charge caching to per-cpu caching to avoid pre-charge fragmentation
and unwarranted OOMs. One impact of this change is that for network
traffic workloads, memcg charging codepath can become a bottleneck. The
kernel test robot has also reported this regression[1]. This patch series
tries to improve the memcg charging for such workloads.
This patch series implement three optimizations:
(A) Reduce atomic ops in page counter update path.
(B) Change layout of struct page_counter to eliminate false sharing
between usage and high.
(C) Increase the memcg charge batch to 64.
To evaluate the impact of these optimizations, on a 72 CPUs machine, we
ran the following workload in root memcg and then compared with scenario
where the workload is run in a three level of cgroup hierarchy with top
level having min and low setup appropriately.
$ netserver -6
# 36 instances of netperf with following params
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
For cgroups using low or min protections, the function
propagate_protected_usage() was doing an atomic xchg() operation
irrespectively. We can optimize out this atomic operation for one
specific scenario where the workload is using the protection (i.e. min >
0) and the usage is above the protection (i.e. usage > min).
This scenario is actually very common where the users want a part of their
workload to be protected against the external reclaim. Though this
optimization does introduce a race when the usage is around the protection
and concurrent charges and uncharged trip it over or under the protection.
In such cases, we might see lower effective protection but the subsequent
charge/uncharge will correct it.
To evaluate the impact of this optimization, on a 72 CPUs machine, we ran
the following workload in a three level of cgroup hierarchy with top level
having min and low setup appropriately to see if this optimization is
effective for the mentioned case.
$ netserver -6
# 36 instances of netperf with following params
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
Results (average throughput of netperf):
Without (6.0-rc1) 10482.7 Mbps
With patch 14542.5 Mbps (38.7% improvement)
drivers/block/zram/zram_drv.c: do not keep dangling zcomp pointer after zram reset
We do all reset operations under write lock, so we don't need to save
->disksize and ->comp to stack variables. Another thing is that ->comp is
freed during zram reset, but comp pointer is not NULL-ed, so zram keeps
the freed pointer value.
When pinning pages with FOLL_LONGTERM check_and_migrate_movable_pages() is
called to migrate pages out of zones which should not contain any longterm
pinned pages.
When migration succeeds all pages will have been unpinned so pinning needs
to be retried. Migration can also fail, in which case the pages will also
have been unpinned but the operation should not be retried. If all pages
are in the correct zone nothing will be unpinned and no retry is required.
The logic in check_and_migrate_movable_pages() tracks unnecessary state
and the return codes for each case are difficult to follow. Refactor the
code to clean this up. No behaviour change is intended.
Alistair Popple [Wed, 24 Aug 2022 05:09:51 +0000 (15:09 +1000)]
mm/gup.c: don't pass gup_flags to check_and_migrate_movable_pages()
gup_flags is passed to check_and_migrate_movable_pages() so that it can
call either put_page() or unpin_user_page() to drop the page reference.
However check_and_migrate_movable_pages() is only called for
FOLL_LONGTERM, which implies FOLL_PIN so there is no need to pass
gup_flags.
mm: skip retry when new limit is not below old one in page_counter_set_max
In page_counter_set_max, we want to make sure the new limit is not below
the concurrently-changing counter value. We read the counter and check
that the limit is not below the counter before the swap. After the swap,
we read the counter again and retry in case the counter is incremented as
this may violate the requirement. Even though the page_counter_try_charge
can see the old limit, it is guaranteed that the counter is not above the
old limit after the increment. So in case the new limit is not below the
old limit, the counter is guaranteed to be not above the new limit too.
We can skip the retry in this case to optimize a little bit.
Yang Shi [Tue, 16 Aug 2022 18:58:01 +0000 (11:58 -0700)]
mm: memcg: export workingset refault stats for cgroup v1
Workingset refault stats are important and useful metrics to measure how
well reclaimer and swapping work and how healthy the services are, but
they are just available for cgroup v2. There are still plenty users with
cgroup v1, export the stats for cgroup v1.
Baolin Wang [Thu, 18 Aug 2022 07:37:44 +0000 (15:37 +0800)]
mm/damon: replace pmd_huge() with pmd_trans_huge() for THP
pmd_huge() is usually used to indicate a pmd level hugetlb. However a pmd
mapped huge page can only be THP in damon_mkold_pmd_entry() or
damon_young_pmd_entry(), so replace pmd_huge() with pmd_trans_huge() in
this case to make the code more readable according to the discussion [1].
Baolin Wang [Thu, 18 Aug 2022 07:37:43 +0000 (15:37 +0800)]
mm/damon: validate if the pmd entry is present before accessing
pmd_huge() is used to validate if the pmd entry is mapped by a huge page,
also including the case of non-present (migration or hwpoisoned) pmd entry
on arm64 or x86 architectures. This means that pmd_pfn() can not get the
correct pfn number for a non-present pmd entry, which will cause
damon_get_page() to get an incorrect page struct (also may be NULL by
pfn_to_online_page()), making the access statistics incorrect.
This means that the DAMON may make incorrect decision according to the
incorrect statistics, for example, DAMON may can not reclaim cold page
in time due to this cold page was regarded as accessed mistakenly if
DAMOS_PAGEOUT operation is specified.
Moreover it does not make sense that we still waste time to get the page
of the non-present entry. Just treat it as not-accessed and skip it,
which maintains consistency with non-present pte level entries.
So add pmd entry present validation to fix the above issues.
Yin Fengwei [Wed, 10 Aug 2022 06:49:07 +0000 (14:49 +0800)]
mm: release private data before split THP
If there is private data attached to a THP, the refcount of THP will be
increased and will prevent the THP from being split. Attempt to release
any private data attached to the THP before attempting the split to
increase the chance of splitting successfully.
There was a memory failure issue hit during HW error injection testing
with 5.18 kernel + xfs as rootfs. The test was killed and a system reboot
was required to re-run the test.
The issue was tracked down to a THP split failure caused by the memory
failure not being handled. The page dump showed:
Qi Zheng [Thu, 18 Aug 2022 08:27:48 +0000 (16:27 +0800)]
mm: thp: remove redundant pgtable check in set_huge_zero_page()
When the pgtable is NULL in the set_huge_zero_page(), we should not
increment the count of PTE page table pages by calling mm_inc_nr_ptes().
Otherwise we may receive the following warning when the mm exits:
BUG: non-zero pgtables_bytes on freeing mm
Now we can't observe the above warning since only
do_huge_pmd_anonymous_page() invokes set_huge_zero_page() and the pgtable
can not be NULL.
Therefore, instead of moving mm_inc_nr_ptes() to the non-NULL branch of
pgtable, it is better to remove the redundant pgtable check directly.
We can choose to copy three contiguous tail pages' content to the first
three pages instead of copying one by one to simplify the code and reduce
code size from 229 bytes to 63 bytes. The BUILD_BUG_ON() aims to avoid
out-of-bounds accesses.
Miaohe Lin [Thu, 18 Aug 2022 13:00:16 +0000 (21:00 +0800)]
mm, hwpoison: avoid trying to unpoison reserved page
For reserved pages, HWPoison flag will be set without increasing the page
refcnt. So we shouldn't even try to unpoison these pages and thus
decrease the page refcnt unexpectly. Add a PageReserved() check to filter
this case out and remove the below unneeded zero page (zero page is
reserved) check.
Miaohe Lin [Thu, 18 Aug 2022 13:00:15 +0000 (21:00 +0800)]
mm, hwpoison: kill procs if unmap fails
If try_to_unmap() fails, the hwpoisoned page still resides in the address
space of some processes. We should kill these processes or the hwpoisoned
page might be consumed later. collect_procs() is always called to collect
relevant processes now so they can be killed later if unmap fails.
Miaohe Lin [Tue, 23 Aug 2022 03:23:44 +0000 (11:23 +0800)]
mm, hwpoison: fix possible use-after-free in mf_dax_kill_procs()
After kill_procs(), tk will be freed without being removed from the
to_kill list. In the next iteration, the freed list entry in the to_kill
list will be accessed, thus leading to use-after-free issue. Adding
list_del() in kill_procs() to fix the issue.
Miaohe Lin [Thu, 18 Aug 2022 13:00:13 +0000 (21:00 +0800)]
mm, hwpoison: fix extra put_page() in soft_offline_page()
When hwpoison_filter() refuses to soft offline a page, the page refcnt
incremented previously by MF_COUNT_INCREASED would have been consumed via
get_hwpoison_page() if ret <= 0. So the put_ref_page() here will put the
extra one. Remove it to fix the issue.
Miaohe Lin [Thu, 18 Aug 2022 13:00:12 +0000 (21:00 +0800)]
mm, hwpoison: fix page refcnt leaking in unpoison_memory()
When free_raw_hwp_pages() fails its work, the refcnt of the hugetlb page
would have been incremented if ret > 0. Using put_page() to fix refcnt
leaking in this case.
Miaohe Lin [Thu, 18 Aug 2022 13:00:11 +0000 (21:00 +0800)]
mm, hwpoison: fix page refcnt leaking in try_memory_failure_hugetlb()
Patch series "A few fixup patches for memory-failure", v2.
This series contains a few fixup patches to fix incorrect update of page
refcnt, fix possible use-after-free issue and so on. More details can be
found in the respective changelogs.
This patch (of 6):
When hwpoison_filter() refuses to hwpoison a hugetlb page, the refcnt of
the page would have been incremented if res == 1. Using put_page() to fix
the refcnt leaking in this case.
mm: fix use-after free of page_ext after race with memory-offline
The below is one path where race between page_ext and offline of the
respective memory blocks will cause use-after-free on the access of
page_ext structure.
b) PageBuddy check is failed
thus proceed to get the
page_owner information
through page_ext access.
page_ext = lookup_page_ext(page);
migrate_pages();
.................
Since all pages are successfully
migrated as part of the offline
operation,send MEM_OFFLINE notification
where for page_ext it calls:
offline_page_ext()-->
__free_page_ext()-->
free_page_ext()-->
vfree(ms->page_ext)
mem_section->page_ext = NULL
c) Check for the PAGE_EXT
flags in the page_ext->flags
access results into the
use-after-free (leading to
the translation faults).
As mentioned above, there is really no synchronization between page_ext
access and its freeing in the memory_offline.
The memory offline steps(roughly) on a memory block is as below:
1) Isolate all the pages
2) while(1)
try free the pages to buddy.(->free_list[MIGRATE_ISOLATE])
3) delete the pages from this buddy list.
4) Then free page_ext.(Note: The struct page is still alive as it is
freed only during hot remove of the memory which frees the memmap,
which steps the user might not perform).
This design leads to the state where struct page is alive but the struct
page_ext is freed, where the later is ideally part of the former which
just representing the page_flags (check [3] for why this design is
chosen).
The abovementioned race is just one example __but the problem persists in
the other paths too involving page_ext->flags access(eg:
page_is_idle())__.
Fix all the paths where offline races with page_ext access by maintaining
synchronization with rcu lock and is achieved in 3 steps:
1) Invalidate all the page_ext's of the sections of a memory block by
storing a flag in the LSB of mem_section->page_ext.
2) Wait until all the existing readers to finish working with the
->page_ext's with synchronize_rcu(). Any parallel process that starts
after this call will not get page_ext, through lookup_page_ext(), for
the block parallel offline operation is being performed.
3) Now safely free all sections ->page_ext's of the block on which
offline operation is being performed.
Note: If synchronize_rcu() takes time then optimizations can be done in
this path through call_rcu()[2].
Thanks to David Hildenbrand for his views/suggestions on the initial
discussion[1] and Pavan kondeti for various inputs on this patch.
Matthew Wilcox [Thu, 18 Aug 2022 21:07:41 +0000 (22:07 +0100)]
mm/vmalloc.c: support HIGHMEM pages in vmap_pages_range_noflush()
If the pages being mapped are in HIGHMEM, page_address() returns NULL.
This probably wasn't noticed before because there aren't currently any
architectures with HAVE_ARCH_HUGE_VMALLOC and HIGHMEM, but it's simpler to
call page_to_phys() and futureproofs us against such configurations
existing.
Kefeng Wang [Mon, 15 Aug 2022 11:10:17 +0000 (19:10 +0800)]
mm: kill find_min_pfn_with_active_regions()
find_min_pfn_with_active_regions() is only called from free_area_init().
Open-code the PHYS_PFN(memblock_start_of_DRAM()) into free_area_init(),
and kill find_min_pfn_with_active_regions().
Miaohe Lin [Tue, 16 Aug 2022 13:05:53 +0000 (21:05 +0800)]
mm/hugetlb: make detecting shared pte more reliable
If the pagetables are shared, we shouldn't copy or take references. Since
src could have unshared and dst shares with another vma, huge_pte_none()
is thus used to determine whether dst_pte is shared. But this check isn't
reliable. A shared pte could have pte none in pagetable in fact. The
page count of ptep page should be checked here in order to reliably
determine whether pte is shared.
Miaohe Lin [Tue, 16 Aug 2022 13:05:52 +0000 (21:05 +0800)]
mm/hugetlb: fix sysfs group leak in hugetlb_unregister_node()
The sysfs group per_node_hstate_attr_group and hstate_demote_attr_group
when h->demote_order != 0 are created in hugetlb_register_node(). But
these sysfs groups are not removed when unregister the node, thus sysfs
group is leaked. Using sysfs_remove_group() to fix this issue.
Miaohe Lin [Tue, 16 Aug 2022 13:05:50 +0000 (21:05 +0800)]
mm/hugetlb: fix missing call to restore_reserve_on_error()
When huge_add_to_page_cache() fails, the page is freed directly without
calling restore_reserve_on_error() to restore reserve for newly allocated
pages not in page cache. Fix this by calling restore_reserve_on_error()
when huge_add_to_page_cache fails.
Miaohe Lin [Tue, 16 Aug 2022 13:05:49 +0000 (21:05 +0800)]
mm/hugetlb: fix WARN_ON(!kobj) in sysfs_create_group()
If sysfs_create_group() fails with hstate_attr_group, hstate_kobjs[hi]
will be set to NULL. Then it will be passed to sysfs_create_group() if
h->demote_order != 0 thus triggering WARN_ON(!kobj) check. Fix this by
making sure hstate_kobjs[hi] != NULL when calling sysfs_create_group.
Miaohe Lin [Tue, 16 Aug 2022 13:05:48 +0000 (21:05 +0800)]
mm/hugetlb: fix incorrect update of max_huge_pages
Patch series "A few fixup patches for hugetlb".
This series contains a few fixup patches to fix incorrect update of
max_huge_pages, fix WARN_ON(!kobj) in sysfs_create_group() and so on.
More details can be found in the respective changelogs.
This patch (of 6):
There should be pages_per_huge_page(h) /
pages_per_huge_page(target_hstate) pages incremented for
target_hstate->max_huge_pages when page is demoted. Update max_huge_pages
accordingly for consistency.
memory tiering: adjust hot threshold automatically
The promotion hot threshold is workload and system configuration
dependent. So in this patch, a method to adjust the hot threshold
automatically is implemented. The basic idea is to control the number of
the candidate promotion pages to match the promotion rate limit. If the
hint page fault latency of a page is less than the hot threshold, we will
try to promote the page, and the page is called the candidate promotion
page.
If the number of the candidate promotion pages in the statistics interval
is much more than the promotion rate limit, the hot threshold will be
decreased to reduce the number of the candidate promotion pages.
Otherwise, the hot threshold will be increased to increase the number of
the candidate promotion pages.
To make the above method works, in each statistics interval, the total
number of the pages to check (on which the hint page faults occur) and the
hot/cold distribution need to be stable. Because the page tables are
scanned linearly in NUMA balancing, but the hot/cold distribution isn't
uniform along the address usually, the statistics interval should be
larger than the NUMA balancing scan period. So in the patch, the max scan
period is used as statistics interval and it works well in our tests.
memory tiering: rate limit NUMA migration throughput
In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to promote/demote
hot/cold pages between the fast and cold memory nodes.
A choice is to promote/demote as fast as possible. But the CPU cycles and
memory bandwidth consumed by the high promoting/demoting throughput will
hurt the latency of some workload because of accessing inflating and slow
memory bandwidth contention.
A way to resolve this issue is to restrict the max promoting/demoting
throughput. It will take longer to finish the promoting/demoting. But
the workload latency will be better. This is implemented in this patch as
the page promotion rate limit mechanism.
The number of the candidate pages to be promoted to the fast memory node
via NUMA balancing is counted, if the count exceeds the limit specified by
the users, the NUMA balancing promotion will be stopped until the next
second.
A new sysctl knob kernel.numa_balancing_promote_rate_limit_MBps is added
for the users to specify the limit.
memory tiering: hot page selection with hint page fault latency
Patch series "memory tiering: hot page selection", v4.
To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory nodes need to be identified.
Essentially, the original NUMA balancing implementation selects the mostly
recently accessed (MRU) pages to promote. But this isn't a perfect
algorithm to identify the hot pages. Because the pages with quite low
access frequency may be accessed eventually given the NUMA balancing page
table scanning period could be quite long (e.g. 60 seconds). So in this
patchset, we implement a new hot page identification algorithm based on
the latency between NUMA balancing page table scanning and hint page
fault. Which is a kind of mostly frequently accessed (MFU) algorithm.
In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to promote/demote
hot/cold pages between the fast and cold memory nodes.
A choice is to promote/demote as fast as possible. But the CPU cycles and
memory bandwidth consumed by the high promoting/demoting throughput will
hurt the latency of some workload because of accessing inflating and slow
memory bandwidth contention.
A way to resolve this issue is to restrict the max promoting/demoting
throughput. It will take longer to finish the promoting/demoting. But
the workload latency will be better. This is implemented in this patchset
as the page promotion rate limit mechanism.
The promotion hot threshold is workload and system configuration
dependent. So in this patchset, a method to adjust the hot threshold
automatically is implemented. The basic idea is to control the number of
the candidate promotion pages to match the promotion rate limit.
We used the pmbench memory accessing benchmark tested the patchset on a
2-socket server system with DRAM and PMEM installed. The test results are
as follows,
pmbench score promote rate
(accesses/s) MB/s
------------- ------------
base 146887704.1 725.6
hot selection 165695601.2 544.0
rate limit 162814569.8 165.2
auto adjustment 170495294.0 136.9
From the results above,
With hot page selection patch [1/3], the pmbench score increases about
12.8%, and promote rate (overhead) decreases about 25.0%, compared with
base kernel.
With rate limit patch [2/3], pmbench score decreases about 1.7%, and
promote rate decreases about 69.6%, compared with hot page selection
patch.
With threshold auto adjustment patch [3/3], pmbench score increases about
4.7%, and promote rate decrease about 17.1%, compared with rate limit
patch.
Baolin helped to test the patchset with MySQL on a machine which contains
1 DRAM node (30G) and 1 PMEM node (126G).
To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory node need to be identified. Essentially,
the original NUMA balancing implementation selects the mostly recently
accessed (MRU) pages to promote. But this isn't a perfect algorithm to
identify the hot pages. Because the pages with quite low access frequency
may be accessed eventually given the NUMA balancing page table scanning
period could be quite long (e.g. 60 seconds). The most frequently
accessed (MFU) algorithm is better.
So, in this patch we implemented a better hot page selection algorithm.
Which is based on NUMA balancing page table scanning and hint page fault
as follows,
- When the page tables of the processes are scanned to change PTE/PMD
to be PROT_NONE, the current time is recorded in struct page as scan
time.
- When the page is accessed, hint page fault will occur. The scan
time is gotten from the struct page. And The hint page fault
latency is defined as
hint page fault time - scan time
The shorter the hint page fault latency of a page is, the higher the
probability of their access frequency to be higher. So the hint page
fault latency is a better estimation of the page hot/cold.
It's hard to find some extra space in struct page to hold the scan time.
Fortunately, we can reuse some bits used by the original NUMA balancing.
NUMA balancing uses some bits in struct page to store the page accessing
CPU and PID (referring to page_cpupid_xchg_last()). Which is used by the
multi-stage node selection algorithm to avoid to migrate pages shared
accessed by the NUMA nodes back and forth. But for pages in the slow
memory node, even if they are shared accessed by multiple NUMA nodes, as
long as the pages are hot, they need to be promoted to the fast memory
node. So the accessing CPU and PID information are unnecessary for the
slow memory pages. We can reuse these bits in struct page to record the
scan time. For the fast memory pages, these bits are used as before.
For the hot threshold, the default value is 1 second, which works well in
our performance test. All pages with hint page fault latency < hot
threshold will be considered hot.
It's hard for users to determine the hot threshold. So we don't provide a
kernel ABI to set it, just provide a debugfs interface for advanced users
to experiment. We will continue to work on a hot threshold automatic
adjustment mechanism.
The downside of the above method is that the response time to the workload
hot spot changing may be much longer. For example,
- A previous cold memory area becomes hot
- The hint page fault will be triggered. But the hint page fault
latency isn't shorter than the hot threshold. So the pages will
not be promoted.
- When the memory area is scanned again, maybe after a scan period,
the hint page fault latency measured will be shorter than the hot
threshold and the pages will be promoted.
To mitigate this, if there are enough free space in the fast memory node,
the hot threshold will not be used, all pages will be promoted upon the
hint page fault for fast response.
Thanks Zhong Jiang reported and tested the fix for a bug when disabling
memory tiering mode dynamically.
Kefeng Wang [Tue, 26 Jul 2022 14:54:28 +0000 (22:54 +0800)]
mm/util.c: add warning if __vm_enough_memory fails
If a process has not enough memory to allocate a new virtual mapping, we
may meet verious kinds of error, eg, fork cannot allocate memory, SIGBUS
error in shmem, but it is difficult to confirm them, let's add some debug
information to easily to check this scenario if __vm_enough_memory fails.
gfp_migratetype() also expects GFP_RECLAIMABLE and
GFP_MOVABLE|GFP_RECLAIMABLE to be shiftable into MIGRATE_* enum values, so
add some more BUILD_BUG_ONs to reflect this assumption.
mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes
When pinning pages with FOLL_LONGTERM check_and_migrate_movable_pages() is
called to migrate pages out of zones which should not contain any longterm
pinned pages.
When migration succeeds all pages will have been unpinned so pinning needs
to be retried. This is indicated by returning zero. When all pages are
in the correct zone the number of pinned pages is returned.
However migration can also fail, in which case pages are unpinned and
-ENOMEM is returned. However if the failure was due to not being unable
to isolate a page zero is returned. This leads to indefinite looping in
__gup_longterm_locked().
Fix this by simplifying the return codes such that zero indicates all
pages were successfully pinned in the correct zone while errors indicate
either pages were migrated and pinning should be retried or that migration
has failed and therefore the pinning operation should fail.
Patch series "A few cleanup patches for hugetlb_cgroup", v2.
This series contains a few cleaup patches to remove unneeded check, use
helper macro, remove unneeded return value and so on. More details can be
found in the respective changelogs.
This patch (of 5):
When code reaches here, nr_pages must be > 0. Remove unneeded nr_pages > 0
check to simplify the code.
Tarun Sahu [Mon, 1 Aug 2022 07:02:31 +0000 (12:32 +0530)]
Kselftests: remove support of libhugetlbfs from kselftests
libhugetlbfs, the user side utitlity to work with hugepages, does not have
any active support. There are only 2 selftests which are part of in
vm/hmm_test.c that depends on libhugetlbfs.
This patch modifies the tests so that they will not require libhugetlb
library.
Imran Khan [Sun, 14 Aug 2022 19:53:53 +0000 (05:53 +1000)]
kfence: add sysfs interface to disable kfence for selected slabs.
By default kfence allocation can happen for any slab object, whose size is
up to PAGE_SIZE, as long as that allocation is the first allocation after
expiration of kfence sample interval. But in certain debugging scenarios
we may be interested in debugging corruptions involving some specific slub
objects like dentry or ext4_* etc. In such cases limiting kfence for
allocations involving only specific slub objects will increase the
probablity of catching the issue since kfence pool will not be consumed by
other slab objects.
This patch introduces a sysfs interface
'/sys/kernel/slab/<name>/skip_kfence' to disable kfence for specific
slabs. Having the interface work in this way does not impact
current/default behavior of kfence and allows us to use kfence for
specific slabs (when needed) as well. The decision to skip/use kfence is
taken depending on whether kmem_cache.flags has (newly introduced)
SLAB_SKIP_KFENCE flag set or not.
Haiyue Wang [Fri, 12 Aug 2022 08:49:21 +0000 (16:49 +0800)]
mm: migration: fix the FOLL_GET failure on following huge page
Not all huge page APIs support FOLL_GET option, so move_pages() syscall
will fail to get the page node information for some huge pages.
Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
return NULL page for FOLL_GET when calling move_pages() syscall with the
NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.
Note: follow_huge_pud() now supports FOLL_GET in linux 6.0. Link: https://lore.kernel.org/all/[email protected]
But these huge page APIs don't support FOLL_GET:
1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
It will cause WARN_ON_ONCE for FOLL_GET.
3. follow_huge_pgd() in mm/hugetlb.c
This is an temporary solution to mitigate the side effect of the race
condition fix by calling follow_page() with FOLL_GET set for huge pages.
After supporting follow huge page by FOLL_GET is done, this fix can be
reverted safely.
Yang Yang [Sat, 13 Aug 2022 08:07:58 +0000 (08:07 +0000)]
mm/vmscan: make the annotations of refaults code at the right place
After patch "mm/workingset: prepare the workingset detection
infrastructure for anon LRU", we can handle the refaults of anonymous
pages too. So the annotations of refaults should cover both of anonymous
pages and file pages.
Yixuan Cao [Fri, 12 Aug 2022 15:55:15 +0000 (23:55 +0800)]
tools/vm/page_owner_sort: fix -f option
The -f option is to filter out the information of blocks whose memory has
not been released, I noticed some blocks should not be filtered out.
Commit 9cc7e96aa846 ("mm/page_owner: record timestamp and pid") records
the allocation timestamp (ts_nsec) of all pages.
Commit 866b48526217 ("mm/page_owner: record the timestamp of all pages
during free") records the free timestamp (free_ts_nsec) of all pages.
When the page is allocated for the first time, the initial value of
free_ts_nsec is 0, and the corresponding time will be obtained when the
page is released. But during reallocation the free_ts_nsec will not reset
to 0 again. In particular, when page migration occurs, these two
timestamps will be the same.
Now page_owner_sort removes all text blocks whose free_ts_nsec is not 0
when using -f option. However, this way can only select pages allocated
for the first time. If a freed page is reallocated, free_ts_nsec will be
less than ts_nsec; if page migration occurs, the two timestamps will be
equal. These cases should be considered as pages are not released.
So I fix the function is_need() to keep text blocks that meet the above
two conditions when using -f option.
Li kunyu [Wed, 3 Aug 2022 06:41:18 +0000 (14:41 +0800)]
page_alloc: remove inactive initialization
The allocation address of the table pointer variable is first performed in
the function, no initialization assignment is required, and no invalid
pointer will appear.
Feng Tang [Fri, 5 Aug 2022 00:59:03 +0000 (08:59 +0800)]
mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process
Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced in
commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple
preferred nodes"), the policy_nodemask_current()'s semantics for this new
policy has been changed, which returns 'preferred' nodes instead of
'allowed' nodes.
With the changed semantic of policy_nodemask_current, a task with
MPOL_PREFERRED_MANY policy could fail to get its reservation even though
it can fall back to other nodes (either defined by cpusets or all online
nodes) for that reservation failing mmap calles unnecessarily early.
The fix is to not consider MPOL_PREFERRED_MANY for reservations at all
because they, unlike MPOL_MBIND, do not pose any actual hard constrain.
Michal suggested the policy_nodemask_current() is only used by hugetlb,
and could be moved to hugetlb code with more explicit name to enforce the
'allowed' semantics for which only MPOL_BIND policy matters.
apply_policy_zone() is made extern to be called in hugetlb code and its
return value is changed to bool.
Abel Wu [Thu, 11 Aug 2022 12:41:57 +0000 (20:41 +0800)]
mm/mempolicy: fix lock contention on mems_allowed
The mems_allowed field can be modified by other tasks, so it isn't safe to
access it with alloc_lock unlocked even in the current process context.
Say there are two tasks: A from cpusetA is performing set_mempolicy(2),
and B is changing cpusetA's cpuset.mems:
A (set_mempolicy) B (echo xx > cpuset.mems)
-------------------------------------------------------
pol = mpol_new();
update_tasks_nodemask(cpusetA) {
foreach t in cpusetA {
cpuset_change_task_nodemask(t) {
mpol_set_nodemask(pol) {
task_lock(t); // t could be A
new = f(A->mems_allowed);
update t->mems_allowed;
pol.create(pol, new);
task_unlock(t);
}
}
}
}
task_lock(A);
A->mempolicy = pol;
task_unlock(A);
In this case A's pol->nodes is computed by old mems_allowed, and could
be inconsistent with A's new mems_allowed.
While it is different when replacing vmas' policy: the pol->nodes is
gone wild only when current_cpuset_is_being_rebound():
A (mbind) B (echo xx > cpuset.mems)
-------------------------------------------------------
pol = mpol_new();
mmap_write_lock(A->mm);
cpuset_being_rebound = cpusetA;
update_tasks_nodemask(cpusetA) {
foreach t in cpusetA {
cpuset_change_task_nodemask(t) {
mpol_set_nodemask(pol) {
task_lock(t); // t could be A
mask = f(A->mems_allowed);
update t->mems_allowed;
pol.create(pol, mask);
task_unlock(t);
}
}
foreach v in A->mm {
if (cpuset_being_rebound == cpusetA)
pol.rebind(pol, cpuset.mems);
v->vma_policy = pol;
}
mmap_write_unlock(A->mm);
mmap_write_lock(t->mm);
mpol_rebind_mm(t->mm);
mmap_write_unlock(t->mm);
}
}
cpuset_being_rebound = NULL;
In this case, the cpuset.mems, which has already done updating, is finally
used for calculating pol->nodes, rather than A->mems_allowed. So it is OK
to call mpol_set_nodemask() with alloc_lock unlocked when doing mbind(2).
mm/cma_debug: show complete cma name in debugfs directories
Currently only 12 characters of the cma name is being used as the debug
directories where as the cma name can be of length CMA_MAX_NAME(=64)
characters. One side problem with this is having 2 cma's with first
common 12 characters would end up in trying to create directories with
same name and fails with -EEXIST thus can limit cma debug functionality.
The 'cma-' prefix is used initially where cma areas don't have any names
and are represented by simple integer values. Since now each cma would be
having its own name, drop 'cma-' prefix for the cma debug directories as
they are clearly evident that they are for cma debug through creating them
in /sys/kernel/debug/cma/ path.
pool->size_class array elements can't be NULL, so this check
is not needed.
In the whole code, we assign pool->size_class[i] values that are
not NULL. Releasing memory for these values occurs in the
zs_destroy_pool() function, which also releases and destroys the pool.
In addition, in the zs_stats_size_show() and async_free_zspage(),
with similar iterations over the array, we don't check it for NULL
pointer.
Alexey Romanov [Thu, 11 Aug 2022 15:37:54 +0000 (18:37 +0300)]
zsmalloc: zs_object_copy: add clarifying comment
Patch series "tidy up zsmalloc implementation"
This patchset remove some unnecessary checks and adds a clarifying
comment. While analysing zs_object_copy() function code, I spent some
time to understand what the call kunmap_atomic(d_addr) is for. It seems
that this point is not trivial and it is worth adding a comment.
This patch (of 2):
It's not obvious why kunmap_atomic(d_addr) call is needed.
Axel Rasmussen [Mon, 8 Aug 2022 17:56:14 +0000 (10:56 -0700)]
selftests: vm: add /dev/userfaultfd test cases to run_vmtests.sh
This new mode was recently added to the userfaultfd selftest. We want to
exercise both userfaultfd(2) as well as /dev/userfaultfd, so add both
test cases to the script.
Axel Rasmussen [Mon, 8 Aug 2022 17:56:12 +0000 (10:56 -0700)]
userfaultfd: selftests: modify selftest to use /dev/userfaultfd
We clearly want to ensure both userfaultfd(2) and /dev/userfaultfd keep
working into the future, so just run the test twice, using each interface.
Instead of always testing both userfaultfd(2) and /dev/userfaultfd, let
the user choose which to test.
As with other test features, change the behavior based on a new command
line flag. Introduce the idea of "test mods", which are generic (not
specific to a test type) modifications to the behavior of the test. This
is sort of borrowed from this RFC patch series [1], but simplified a bit.
The benefit is, in "typical" configurations this test is somewhat slow
(say, 30sec or something). Testing both clearly doubles it, so it may not
always be desirable, as users are likely to use one or the other, but
never both, in the "real world".
Axel Rasmussen [Mon, 8 Aug 2022 17:56:11 +0000 (10:56 -0700)]
userfaultfd: add /dev/userfaultfd for fine grained access control
Historically, it has been shown that intercepting kernel faults with
userfaultfd (thereby forcing the kernel to wait for an arbitrary amount of
time) can be exploited, or at least can make some kinds of exploits
easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we
changed things so, in order for kernel faults to be handled by
userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl must
be configured so that any unprivileged user can do it.
In a typical implementation of a hypervisor with live migration (take
QEMU/KVM as one such example), we do indeed need to be able to handle
kernel faults. But, both options above are less than ideal:
- Toggling the sysctl increases attack surface by allowing any
unprivileged user to do it.
- Granting the live migration process CAP_SYS_PTRACE gives it this
ability, but *also* the ability to "observe and control the
execution of another process [...], and examine and change [its]
memory and registers" (from ptrace(2)). This isn't something we need
or want to be able to do, so granting this permission violates the
"principle of least privilege".
This is all a long winded way to say: we want a more fine-grained way to
grant access to userfaultfd, without granting other additional permissions
at the same time.
To achieve this, add a /dev/userfaultfd misc device. This device provides
an alternative to the userfaultfd(2) syscall for the creation of new
userfaultfds. The idea is, any userfaultfds created this way will be able
to handle kernel faults, without the caller having any special
capabilities. Access to this mechanism is instead restricted using e.g.
standard filesystem permissions.
Axel Rasmussen [Mon, 8 Aug 2022 17:56:10 +0000 (10:56 -0700)]
selftests: vm: add hugetlb_shared userfaultfd test to run_vmtests.sh
Patch series "userfaultfd: add /dev/userfaultfd for fine grained access
control", v7.
Why not ...?
============
- Why not /proc/[pid]/userfaultfd? Two main points (additional discussion [1]):
- /proc/[pid]/* files are all owned by the user/group of the process, and
they don't really support chmod/chown. So, without extending procfs it
doesn't solve the problem this series is trying to solve.
- The main argument *for* this was to support creating UFFDs for remote
processes. But, that use case clearly calls for CAP_SYS_PTRACE, so to
support this we could just use the UFFD syscall as-is.
- Why not use a syscall? Access to syscalls is generally controlled by
capabilities. We don't have a capability which is used for userfaultfd access
without also granting more / other permissions as well, and adding a new
capability was rejected [2].
- It's possible a LSM could be used to control access instead, but I have
some concerns. I don't think this approach would be as easy to use,
particularly if we were to try to solve this with something heavyweight
like SELinux. Maybe we could pursue adding a new LSM specifically for
this user case, but it may be too narrow of a case to justify that.
This not being included was just a simple oversight. There are certain
features (like minor fault support) which are only enabled on shared
mappings, so without including hugetlb_shared we actually lose a
significant amount of test coverage.
Kenneth Lee [Mon, 8 Aug 2022 22:00:19 +0000 (15:00 -0700)]
mm/damon/dbgfs: use kmalloc for allocating only one element
Use kmalloc(...) rather than kmalloc_array(1, ...) because the number of
elements we are specifying in this case is 1, kmalloc would accomplish the
same thing and we can simplify.
Since commit 5d1fd5dc877b ("mm,hwpoison: introduce MF_MSG_UNSPLIT_THP"),
the action_result(,MF_MSG_UNSPLIT_THP,) called to show memory error event
in memory_failure(), so the pr_info() in try_to_split_thp_page() is only
needed in soft_offline_in_use_page().
Meanwhile this could also fix the unexpected prefix for "thp split failed"
due to commit 96f96763de26 ("mm: memory-failure: convert to pr_fmt()").
Rik van Riel [Tue, 9 Aug 2022 18:24:57 +0000 (14:24 -0400)]
mm: align larger anonymous mappings on THP boundaries
Align larger anonymous memory mappings on THP boundaries by going through
thp_get_unmapped_area if THPs are enabled for the current process.
With this patch, larger anonymous mappings are now THP aligned. When a
malloc library allocates a 2MB or larger arena, that arena can now be
mapped with THPs right from the start, which can result in better TLB hit
rates and execution time.
selftests/vm: add selftest to verify multi THP collapse
Add support to allocate and verify collapse of multiple hugepage-sized
regions into multiple THPs.
Add "nr" argument to check_huge() that instructs check_huge() to check for
exactly "nr_hpages" THPs. This has the added benefit of now being able to
check for exactly 0 THPs, and so callsites that previously checked the
negation of exactly 1 THP are now more correct.
->collapse struct collapse_context hook has been expanded with a
"nr_hpages" argument to collapse "nr_hpages" hugepages. The
collapse_full() test has been repurposed to collapse 4 THPs at once. It
is expected more tests will want to test multi THP collapse (e.g.
file/shmem).
This is of particular benefit to madvise collapse context given that it
may do many THP collapses during a single syscall.
selftests/vm: add MADV_COLLAPSE collapse context to selftests
Add madvise collapse context to hugepage collapse selftests. This context
is tested with /sys/kernel/mm/transparent_hugepage/enabled set to "never"
in order to avoid unwanted interaction with khugepaged during testing.
Also, refactor updates to sysfs THP settings using a stack so that the THP
settings from nested callers can be restored.
Modularize the collapse action of khugepaged collapse selftests by
introducing a struct collapse_context which specifies how to collapse a
given memory range and the expected semantics of the collapse. This can
be reused later to test other collapse contexts.
Additionally, all tests have logic that checks if a collapse occurred via
reading /proc/self/smaps, and report if this is different than expected.
Move this logic into the per-context ->collapse() hook instead of
repeating it in every test.
mm/madvise: add MADV_COLLAPSE to process_madvise()
Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
CAP_SYS_ADMIN or is requesting collapse of it's own memory.
This is useful for the development of userspace agents that seek to
optimize THP utilization system-wide by using userspace signals to
prioritize what memory is most deserving of being THP-backed.
mm/khugepaged: rename prefix of shared collapse functions
The following functions are shared between khugepaged and madvise collapse
contexts. Replace the "khugepaged_" prefix with generic "hpage_collapse_"
prefix in such cases:
Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request
a synchronous collapse of memory at their own expense.
The benefits of this approach are:
* CPU is charged to the process that wants to spend the cycles for the
THP
* Avoid unpredictable timing of khugepaged collapse
Semantics
This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE. If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others. This implies a hugepage cannot cross a VMA boundary. If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.
The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned. If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range. The memory ranges must span at least one
hugepage-sized region.
All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage. Unmapped pages will have their data directly
initialized to 0 in the new hugepage. However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).
Allocation for the new hugepage may enter direct reclaim and/or
compaction, regardless of VMA flags. When the system has multiple NUMA
nodes, the hugepage will be allocated from the node providing the most
native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how
pages will be mapped, constructed, or faulted in the future
Return Value
If all hugepage-sized/aligned regions covered by the provided range were
either successfully collapsed, or were already PMD-mapped THPs, this
operation will be deemed successful. On success, process_madvise(2)
returns the number of bytes advised, and madvise(2) returns 0. Else, -1
is returned and errno is set to indicate the error for the most-recently
attempted hugepage collapse. Note that many failures might have occurred,
since the operation may continue to collapse in the event a single
hugepage-sized/aligned region fails.
ENOMEM Memory allocation failed or VMA not found
EBUSY Memcg charging failed
EAGAIN Required resource temporarily unavailable. Try again
might succeed.
EINVAL Other error: No PMD found, subpage doesn't have Present
bit set, "Special" page no backed by struct page, VMA
incorrectly sized, address not page-aligned, ...
Most notable here is ENOMEM and EBUSY (new to madvise) which are intended
to provide the caller with actionable feedback so they may take an
appropriate fallback measure.
Use Cases
An immediate user of this new functionality are malloc() implementations
that manage memory in hugepage-sized chunks, but sometimes subrelease
memory back to the system in native-sized chunks via MADV_DONTNEED;
zapping the pmd. Later, when the memory is hot, the implementation could
madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain hugepage
coverage and dTLB performance. TCMalloc is such an implementation that
could benefit from this[2].
Only privately-mapped anon memory is supported for now, but additional
support for file, shmem, and HugeTLB high-granularity mappings[2] is
expected. File and tmpfs/shmem support would permit:
* Backing executable text by THPs. Current support provided by
CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which
might impair services from serving at their full rated load after
(re)starting. Tricks like mremap(2)'ing text onto anonymous memory to
immediately realize iTLB performance prevents page sharing and demand
paging, both of which increase steady state memory footprint. With
MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
and lower RAM footprints.
* Backing guest memory by hugapages after the memory contents have been
migrated in native-page-sized chunks to a new host, in a
userfaultfd-based live-migration stack.
mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage
When scanning an anon pmd to see if it's eligible for collapse, return
SCAN_PMD_MAPPED if the pmd already maps a hugepage. Note that
SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
file-collapse path, since the latter might identify pte-mapped compound
pages. This is required by MADV_COLLAPSE which necessarily needs to know
what hugepage-aligned/sized regions are already pmd-mapped.
In order to determine if a pmd already maps a hugepage, refactor
mm_find_pmd():
Return mm_find_pmd() to it's pre-commit f72e7dcdd252 ("mm: let mm_find_pmd
fix buggy race with THP fault") behavior. ksm was the only caller that
explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic
there (pmd_present() and pmd_trans_huge() checks).
Undo revert change in commit f72e7dcdd252 ("mm: let mm_find_pmd fix buggy
race with THP fault") that open-coded split_huge_pmd_address() pmd lookup
and use mm_find_pmd() instead.
mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()
MADV_COLLAPSE is not coupled to the kernel-oriented sysfs THP settings[1].
hugepage_vma_check() is the authority on determining if a VMA is eligible
for THP allocation/collapse, and currently enforces the sysfs THP
settings. Add a flag to disable these checks. For now, only apply this
arg to anon and file, which use /sys/kernel/transparent_hugepage/enabled.
We can expand this to shmem, which uses
/sys/kernel/transparent_hugepage/shmem_enabled, later.
Use this flag in collapse_pte_mapped_thp() where previously the VMA flags
passed to hugepage_vma_check() were OR'd with VM_HUGEPAGE to elide the
VM_HUGEPAGE check in "madvise" THP mode. Prior to "mm: khugepaged: check
THP flag in hugepage_vma_check()", this check also didn't check "never"
THP mode. As such, this restores the previous behavior of
collapse_pte_mapped_thp() where sysfs THP settings are ignored. See
comment in code for justification why this is OK.
mm/khugepaged: add flag to predicate khugepaged-only behavior
Add .is_khugepaged flag to struct collapse_control so khugepaged-specific
behavior can be elided by MADV_COLLAPSE context.
Start by protecting khugepaged-specific heuristics by this flag. In
MADV_COLLAPSE, the user presumably has reason to believe the collapse will
be beneficial and khugepaged heuristics shouldn't prevent the user from
doing so:
mm/khugepaged: propagate enum scan_result codes back to callers
Propagate enum scan_result codes back through return values of
functions downstream of khugepaged_scan_file() and
khugepaged_scan_pmd() to inform callers if the operation was
successful, and if not, why.
Since khugepaged_scan_pmd()'s return value already has a specific meaning
(whether mmap_lock was unlocked or not), add a bool* argument to
khugepaged_scan_pmd() to retrieve this information.
Change khugepaged to take action based on the return values of
khugepaged_scan_file() and khugepaged_scan_pmd() instead of acting deep
within the collapsing functions themselves.
hugepage_vma_revalidate() now returns SCAN_SUCCEED on success to be more
consistent with enum scan_result propagation.
Remove dependency on error pointers to communicate to khugepaged that
allocation failed and it should sleep; instead just use the result of the
scan (SCAN_ALLOC_HUGE_PAGE_FAIL if allocation fails).
new_page = khugepaged_alloc_page(hpage, gfp, node);
if (!new_page) {
result = SCAN_ALLOC_HUGE_PAGE_FAIL;
goto out;
}
if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
result = SCAN_CGROUP_CHARGE_FAIL;
goto out;
}
count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
Also, "node" is passed as an argument to both collapse_huge_page() and
collapse_file() and obtained the same way, via
khugepaged_find_target_node().
Move all this into a new helper, alloc_charge_hpage(), and remove the
duplicate code from collapse_huge_page() and collapse_file(). Also,
simplify khugepaged_alloc_page() by returning a bool indicating allocation
success instead of a copy of the allocated struct page *.
Modularize hugepage collapse by introducing struct collapse_control. This
structure serves to describe the properties of the requested collapse, as
well as serve as a local scratch pad to use during the collapse itself.
Start by moving global per-node khugepaged statistics into this new
structure. Note that this structure is still statically allocated since
CONFIG_NODES_SHIFT might be arbitrary large, and stack-allocating a
MAX_NUMNODES-sized array could cause -Wframe-large-than= errors.
Yang Shi [Wed, 6 Jul 2022 23:59:20 +0000 (16:59 -0700)]
mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA
Patch series "mm: userspace hugepage collapse", v7.
Introduction
--------------------------------
This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.
This idea was introduced by David Rientjes[5].
Interface
--------------------------------
The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.
process_madvise(2)
Performs a synchronous collapse of the native pages
mapped by the list of iovecs into transparent hugepages.
This operation is independent of the system THP sysfs settings,
but attempts to collapse VMAs marked VM_NOHUGEPAGE will still fail.
THP allocation may enter direct reclaim and/or compaction.
When a range spans multiple VMAs, the semantics of the collapse
over of each VMA is independent from the others.
Caller must have CAP_SYS_ADMIN if not acting on self.
Return value follows existing process_madvise(2) conventions. A
“success” indicates that all hugepage-sized/aligned regions
covered by the provided range were either successfully
collapsed, or were already pmd-mapped THPs.
madvise(2)
Equivalent to process_madvise(2) on self, with 0 returned on
“success”.
Current Use-Cases
--------------------------------
(1) Immediately back executable text by THPs. Current support provided
by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
system which might impair services from serving at their full rated
load after (re)starting. Tricks like mremap(2)'ing text onto
anonymous memory to immediately realize iTLB performance prevents
page sharing and demand paging, both of which increase steady state
memory footprint. With MADV_COLLAPSE, we get the best of both
worlds: Peak upfront performance and lower RAM footprints. Note
that subsequent support for file-backed memory is required here.
(2) malloc() implementations that manage memory in hugepage-sized
chunks, but sometimes subrelease memory back to the system in
native-sized chunks via MADV_DONTNEED; zapping the pmd. Later,
when the memory is hot, the implementation could
madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
hugepage coverage and dTLB performance. TCMalloc is such an
implementation that could benefit from this[6]. A prior study of
Google internal workloads during evaluation of Temeraire, a
hugepage-aware enhancement to TCMalloc, showed that nearly 20% of
all cpu cycles were spent in dTLB stalls, and that increasing
hugepage coverage by even small amount can help with that[7].
(3) userfaultfd-based live migration of virtual machines satisfy UFFD
faults by fetching native-sized pages over the network (to avoid
latency of transferring an entire hugepage). However, after guest
memory has been fully copied to the new host, MADV_COLLAPSE can
be used to immediately increase guest performance. Note that
subsequent support for file/shmem-backed memory is required here.
(4) HugeTLB high-granularity mapping allows HugeTLB a HugeTLB page to
be mapped at different levels in the page tables[8]. As it's not
"transparent" like THP, HugeTLB high-granularity mappings require
an explicit user API. It is intended that MADV_COLLAPSE be co-opted
for this use case[9]. Note that subsequent support for HugeTLB
memory is required here.
Future work
--------------------------------
Only private anonymous memory is supported by this series. File and
shmem memory support will be added later.
One possible user of this functionality is a userspace agent that
attempts to optimize THP utilization system-wide by allocating THPs
based on, for example, task priority, task performance requirements, or
heatmaps. For the latter, one idea that has already surfaced is using
DAMON to identify hot regions, and driving THP collapse through a new
DAMOS_COLLAPSE scheme[10].
This patch (of 17):
The khugepaged has optimization to reduce huge page allocation calls for
!CONFIG_NUMA by carrying the allocated but failed to collapse huge page to
the next loop. CONFIG_NUMA doesn't do so since the next loop may try to
collapse huge page from a different node, so it doesn't make too much
sense to carry it.
But when NUMA=n, the huge page is allocated by khugepaged_prealloc_page()
before scanning the address space, so it means huge page may be allocated
even though there is no suitable range for collapsing. Then the page
would be just freed if khugepaged already made enough progress. This
could make NUMA=n run have 5 times as much thp_collapse_alloc as NUMA=y
run. This problem actually makes things worse due to the way more
pointless THP allocations and makes the optimization pointless.
This could be fixed by carrying the huge page across scans, but it will
complicate the code further and the huge page may be carried indefinitely.
But if we take one step back, the optimization itself seems not worth
keeping nowadays since:
* Not too many users build NUMA=n kernel nowadays even though the kernel is
actually running on a non-NUMA machine. Some small devices may run NUMA=n
kernel, but I don't think they actually use THP.
* Since commit 44042b449872 ("mm/page_alloc: allow high-order pages to be
stored on the per-cpu lists"), THP could be cached by pcp. This actually
somehow does the job done by the optimization.
Binyi Han [Sun, 4 Sep 2022 07:46:47 +0000 (00:46 -0700)]
mm: fix dereferencing possible ERR_PTR
Smatch checker complains that 'secretmem_mnt' dereferencing possible
ERR_PTR(). Let the function return if 'secretmem_mnt' is ERR_PTR, to
avoid deferencing it.
The recent folio conversion changed the VM_BUG_ON() to dump the folio
we're storing instead of the entry we retrieved. This was a mistake;
the entry we retrieved is the more interesting page to dump.
mm/damon/dbgfs: fix memory leak when using debugfs_lookup()
When calling debugfs_lookup() the result must have dput() called on it,
otherwise the memory will leak over time. Fix this up by properly calling
dput().
migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
installs migration entries directly if it can lock the migrating page.
When removing a dirty pte the dirty bit is supposed to be carried over to
the underlying page to prevent it being lost.
Currently migrate_vma_*() can only be used for private anonymous mappings.
That means loss of the dirty bit usually doesn't result in data loss
because these pages are typically not file-backed. However pages may be
backed by swap storage which can result in data loss if an attempt is made
to migrate a dirty page that doesn't yet have the PageDirty flag set.
In this case migration will fail due to unexpected references but the
dirty pte bit will be lost. If the page is subsequently reclaimed data
won't be written back to swap storage as it is considered uptodate,
resulting in data loss if the page is subsequently accessed.
Prevent this by copying the dirty bit to the page when removing the pte to
match what try_to_migrate_one() does.
When clearing a PTE the TLB should be flushed whilst still holding the PTL
to avoid a potential race with madvise/munmap/etc. For example consider
the following sequence:
CPU0 CPU1
---- ----
migrate_vma_collect_pmd()
pte_unmap_unlock()
madvise(MADV_DONTNEED)
-> zap_pte_range()
pte_offset_map_lock()
[ PTE not present, TLB not flushed ]
pte_unmap_unlock()
[ page is still accessible via stale TLB ]
flush_tlb_range()
In this case the page may still be accessed via the stale TLB entry after
madvise returns. Fix this by flushing the TLB while holding the PTL.
Naohiro Aota [Wed, 24 Aug 2022 08:47:26 +0000 (17:47 +0900)]
x86/mm: disable instrumentations of mm/pgprot.c
Commit 4867fbbdd6b3 ("x86/mm: move protection_map[] inside the platform")
moved accesses to protection_map[] from mem_encrypt_amd.c to pgprot.c. As
a result, the accesses are now targets of KASAN (and other
instrumentations), leading to the crash during the boot process.
Disable the instrumentations for pgprot.c like commit 67bb8e999e0a
("x86/mm: Disable various instrumentations of mm/mem_encrypt.c and
mm/tlb.c").
Before this patch, my AMD machine cannot boot since v6.0-rc1 with KASAN
enabled, without anything printed. After the change, it successfully
boots up.
Dan Williams [Fri, 26 Aug 2022 17:18:14 +0000 (10:18 -0700)]
mm/memory-failure: fall back to vma_address() when ->notify_failure() fails
In the case where a filesystem is polled to take over the memory failure
and receives -EOPNOTSUPP it indicates that page->index and page->mapping
are valid for reverse mapping the failure address. Introduce
FSDAX_INVALID_PGOFF to distinguish when add_to_kill() is being called from
mf_dax_kill_procs() by a filesytem vs the typical memory_failure() path.
Otherwise, vma_pgoff_address() is called with an invalid fsdax_pgoff which
then trips this failing signature:
kernel BUG at mm/memory-failure.c:319!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 13 PID: 1262 Comm: dax-pmd Tainted: G OE N 6.0.0-rc2+ #62
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:add_to_kill.cold+0x19d/0x209
[..]
Call Trace:
<TASK>
collect_procs.part.0+0x2c4/0x460
memory_failure+0x71b/0xba0
? _printk+0x58/0x73
do_madvise.part.0.cold+0xaf/0xc5
Dan Williams [Fri, 26 Aug 2022 17:17:54 +0000 (10:17 -0700)]
xfs: quiet notify_failure EOPNOTSUPP cases
Patch series "mm, xfs, dax: Fixes for memory_failure() handling".
I failed to run the memory error injection section of the ndctl test suite
on linux-next prior to the merge window and as a result some bugs were
missed. While the new enabling targeted reflink enabled XFS filesystems
the bugs cropped up in the surrounding cases of DAX error injection on
ext4-fsdax and device-dax.
One new assumption / clarification in this set is the notion that if a
filesystem's ->notify_failure() handler returns -EOPNOTSUPP, then it must
be the case that the fsdax usage of page->index and page->mapping are
valid. I am fairly certain this is true for xfs_dax_notify_failure(), but
would appreciate another set of eyes.
This patch (of 4):
XFS always registers dax_holder_operations regardless of whether the
filesystem is capable of handling the notifications. The expectation is
that if the notify_failure handler cannot run then there are no scenarios
where it needs to run. In other words the expected semantic is that
page->index and page->mapping are valid for memory_failure() when the
conditions that cause -EOPNOTSUPP in xfs_dax_notify_failure() are present.
A fallback to the generic memory_failure() path is expected so do not warn
when that happens.