Lorenzo Stoakes [Wed, 11 Oct 2023 17:04:28 +0000 (18:04 +0100)]
mm: abstract the vma_merge()/split_vma() pattern for mprotect() et al.
mprotect() and other functions which change VMA parameters over a range
each employ a pattern of:-
1. Attempt to merge the range with adjacent VMAs.
2. If this fails, and the range spans a subset of the VMA, split it
accordingly.
This is open-coded and duplicated in each case. Also in each case most of
the parameters passed to vma_merge() remain the same.
Create a new function, vma_modify(), which abstracts this operation,
accepting only those parameters which can be changed.
To avoid the mess of invoking each function call with unnecessary
parameters, create inline wrapper functions for each of the modify
operations, parameterised only by what is required to perform the action.
We can also significantly simplify the logic - by returning the VMA if we
split (or merged VMA if we do not) we no longer need specific handling for
merge/split cases in any of the call sites.
Note that the userfaultfd_release() case works even though it does not
split VMAs - since start is set to vma->vm_start and end is set to
vma->vm_end, the split logic does not trigger.
In addition, since we calculate pgoff to be equal to vma->vm_pgoff + (start
- vma->vm_start) >> PAGE_SHIFT, and start - vma->vm_start will be 0 in this
instance, this invocation will remain unchanged.
We eliminate a VM_WARN_ON() in mprotect_fixup() as this simply asserts that
vma_merge() correctly ensures that flags remain the same, something that is
already checked in is_mergeable_vma() and elsewhere, and in any case is not
specific to mprotect().
Lorenzo Stoakes [Wed, 11 Oct 2023 17:04:27 +0000 (18:04 +0100)]
mm: move vma_policy() and anon_vma_name() decls to mm_types.h
Patch series "Abstract vma_merge() and split_vma()", v4.
The vma_merge() interface is very confusing and its implementation has led
to numerous bugs as a result of that confusion.
In addition there is duplication both in invocation of vma_merge(), but
also in the common mprotect()-style pattern of attempting a merge, then if
this fails, splitting the portion of a VMA about to have its attributes
changed.
This pattern has been copy/pasted around the kernel in each instance where
such an operation has been required, each very slightly modified from the
last to make it even harder to decipher what is going on.
Simplify the whole thing by dividing the actual uses of vma_merge() and
split_vma() into specific and abstracted functions and de-duplicate the
vma_merge()/split_vma() pattern altogether.
Doing so also opens the door to changing how vma_merge() is implemented -
by knowing precisely what cases a caller is invoking rather than having a
central interface where anything might happen we can untangle the brittle
and confusing vma_merge() implementation into something more workable.
For mprotect()-like cases we introduce vma_modify() which performs the
vma_merge()/split_vma() pattern, returning a pointer to either the merged
or split VMA or an ERR_PTR(err) if the splits fail.
We provide a number of inline helper functions to make things even clearer:-
* vma_modify_flags() - Prepare to modify the VMA's flags.
* vma_modify_flags_name() - Prepare to modify the VMA's flags/anon_vma_name
* vma_modify_policy() - Prepare to modify the VMA's mempolicy.
* vma_modify_flags_uffd() - Prepare to modify the VMA's flags/uffd context.
For cases where a new VMA is attempted to be merged with adjacent VMAs we
add:-
* vma_merge_new_vma() - Prepare to merge a new VMA.
* vma_merge_extend() - Prepare to extend the end of a new VMA.
This patch (of 5):
The vma_policy() define is a helper specifically for a VMA field so it
makes sense to host it in the memory management types header.
The anon_vma_name(), anon_vma_name_alloc() and anon_vma_name_free()
functions are a little out of place in mm_inline.h as they define external
functions, and so it makes sense to locate them in mm_types.h.
The purpose of these relocations is to make it possible to abstract static
inline wrappers which invoke both of these helpers.
The original problem of the overly long list of waiters on a locked page
was solved properly by commit 9a1ea439b16b ("mm:
put_and_wait_on_page_locked() while page is migrated"). In the meantime,
using bookmarks for the writeback bit can cause livelocks, so we need to
stop using them.
Lorenzo Stoakes [Tue, 10 Oct 2023 07:25:02 +0000 (08:25 +0100)]
mm/mprotect: allow unfaulted VMAs to be unaccounted on mprotect()
When mprotect() is used to make unwritable VMAs writable, they have the
VM_ACCOUNT flag applied and memory accounted accordingly.
If the VMA has had no pages faulted in and is then made unwritable once
again, it will remain accounted for, despite not being capable of
extending memory usage.
The first mprotect() splits the range into 3 VMAs and the second fails to
merge the three as the middle VMA has VM_ACCOUNT set and the others do
not, rendering them unmergeable.
This is unnecessary, since no pages have actually been allocated and the
middle VMA is not capable of utilising more memory, thereby introducing
unnecessary VMA fragmentation (and accounting for more memory than is
necessary).
Since we cannot efficiently determine which pages map to an anonymous VMA,
we have to be very conservative - determining whether any pages at all
have been faulted in, by checking whether vma->anon_vma is NULL.
We can see that the lack of anon_vma implies that no anonymous pages are
present as evidenced by vma_needs_copy() utilising this on fork to
determine whether page tables need to be copied.
The only place where anon_vma is set NULL explicitly is on fork with
VM_WIPEONFORK set, however since this flag is intended to cause the child
process to not CoW on a given memory range, it is right to interpret this
as indicating the VMA has no faulted-in anonymous memory mapped.
If the VMA was forked without VM_WIPEONFORK set, then anon_vma_fork() will
have ensured that a new anon_vma is assigned (and correctly related to its
parent anon_vma) should any pages be CoW-mapped.
The overall operation is safe against races as we hold a write lock against
mm->mmap_lock.
If we could efficiently look up the VMA's faulted-in pages then we would
unaccount all those pages not yet faulted in. However as the original
comment alludes this simply isn't currently possible, so we are
conservative and account all pages or none at all.
merged "can we have THPs in this VMA?" logic that was previously done
separately by fault-path, khugepaged, and smaps "THPeligible" checks.
During the process, the semantics of the fault path check changed in two
ways:
1) A VM_NO_KHUGEPAGED check was introduced (also added to smaps path).
2) We no longer checked if non-anonymous memory had a vm_ops->huge_fault
handler that could satisfy the fault. Previously, this check had been
done in create_huge_pud() and create_huge_pmd() routines, but after
the changes, we never reach those routines.
During the review of the above commits, it was determined that in-tree
users weren't affected by the change; most notably, since the only
relevant user (in terms of THP) of VM_MIXEDMAP or ->huge_fault is DAX,
which is explicitly approved early in approval logic. However, this was a
bad assumption to make as it assumes the only reason to support
->huge_fault was for DAX (which is not true in general).
Remove the VM_NO_KHUGEPAGED check when not in collapse path and give any
->huge_fault handler a chance to handle the fault. Note that we don't
validate the file mode or mapping alignment, which is consistent with the
behavior before the aforementioned commits.
Nhat Pham [Fri, 6 Oct 2023 18:46:28 +0000 (11:46 -0700)]
hugetlb: memcg: account hugetlb-backed memory in memory controller
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness. But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc. fair.
What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.
This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller). Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.
Some caveats to consider:
* This feature is only available on cgroup v2.
* There is no hugetlb pool management involved in the memory
controller. As stated above, hugetlb folios are only charged towards
the memory controller when it is used. Host overcommit management
has to consider it when configuring hard limits.
* Failure to charge towards the memcg results in SIGBUS. This could
happen even if the hugetlb pool still has pages (but the cgroup
limit is hit and reclaim attempt fails).
* When this feature is enabled, hugetlb pages contribute to memory
reclaim protection. low, min limits tuning must take into account
hugetlb memory.
* Hugetlb pages utilized while this option is not selected will not
be tracked by the memory controller (even if cgroup v2 is remounted
later on).
Nhat Pham [Fri, 6 Oct 2023 18:46:27 +0000 (11:46 -0700)]
memcontrol: only transfer the memcg data for migration
For most migration use cases, only transfer the memcg data from the old
folio to the new folio, and clear the old folio's memcg data. No charging
and uncharging will be done.
This shaves off some work on the migration path, and avoids the temporary
double charging of a folio during its migration.
The only exception is replace_page_cache_folio(), which will use the old
mem_cgroup_migrate() (now renamed to mem_cgroup_replace_folio). In that
context, the isolation of the old page isn't quite as thorough as with
migration, so we cannot use our new implementation directly.
This patch is the result of the following discussion on the new hugetlb
memcg accounting behavior:
Nhat Pham [Fri, 6 Oct 2023 18:46:26 +0000 (11:46 -0700)]
memcontrol: add helpers for hugetlb memcg accounting
Patch series "hugetlb memcg accounting", v4.
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness. But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etcetera fair.
What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.
This patch series rectifies this issue by charging the memcg when the
hugetlb folio is allocated, and uncharging when the folio is freed. In
addition, a new selftest is added to demonstrate and verify this new
behavior.
This patch (of 4):
This patch exposes charge committing and cancelling as parts of the memory
controller interface. These functionalities are useful when the
try_charge() and commit_charge() stages have to be separated by other
actions in between (which can fail). One such example is the new hugetlb
accounting behavior in the following patch.
The patch also adds a helper function to obtain a reference to the
current task's memcg.
Originally, hugetlb_cgroup was the only hugetlb user of tail page
structure fields. So, the code defined and checked against
HUGETLB_CGROUP_MIN_ORDER to make sure pages weren't too small to use.
However, by now, tail page #2 is used to store hugetlb hwpoison and
subpool information as well. In other words, without that tail page
hugetlb doesn't work.
Acknowledge this fact by getting rid of HUGETLB_CGROUP_MIN_ORDER and
checks against it. Instead, just check for the minimum viable page order
at hstate creation time.
mm: use folio_xor_flags_has_waiters() in folio_end_writeback()
Match how folio_unlock() works by combining the test for PG_waiters with
the clearing of PG_writeback. This should have a small performance win,
and removes the last user of folio_wake().
Rather than check the result of test-and-clear, just check that we have
the writeback bit set at the start. This wouldn't catch every case, but
it's good enough (and enables the next patch).
Optimise folio_end_read() by setting the uptodate bit at the same time we
clear the unlock bit. This saves at least one memory barrier and one
write-after-write hazard.
powerpc: implement arch_xor_unlock_is_negative_byte on 32-bit
Simply remove the ifdef. The assembly is identical to that in the
non-optimised case of test_and_clear_bits() on PPC32, and it's not clear
to me how the PPC32 optimisation works, nor whether it would work for
arch_xor_unlock_is_negative_byte(). If that optimisation would work,
someone can implement it later, but this is more efficient than the
implementation in filemap.c.
Using EOR to clear the guaranteed-to-be-set lock bit will test the
negative flag just like the x86 implementation. This should be more
efficient than the generic implementation in filemap.c. It would be
better if m68k had __GCC_ASM_FLAG_OUTPUTS__.
Coldfire doesn't have a byte-sized EOR, so we test bit 7 after the EOR,
which is a second memory access, but it's slightly better than the current
C code.
Replace clear_bit_and_unlock_is_negative_byte() with
xor_unlock_is_negative_byte(). We have a few places that like to lock a
folio, set a flag and unlock it again. Allow for the possibility of
combining the latter two operations for efficiency. We are guaranteed
that the caller holds the lock, so it is safe to unlock it with the xor.
The caller must guarantee that nobody else will set the flag without
holding the lock; it is not safe to do this with the PG_dirty flag, for
example.
iomap: hold state_lock over call to ifs_set_range_uptodate()
Patch series "Add folio_end_read", v2.
The core of this patchset is the new folio_end_read() call which
filesystems can use when finishing a page cache read instead of separate
calls to mark the folio uptodate and unlock it. As an illustration of its
use, I converted ext4, iomap & mpage; more can be converted.
I think that's useful by itself, but the interesting optimisation is that
we can implement that with a single XOR instruction that sets the uptodate
bit, clears the lock bit, tests the waiter bit and provides a write memory
barrier. That removes one memory barrier and one atomic instruction from
each page read, which seems worth doing. That's in patch 15.
The last two patches could be a separate series, but basically we can do
the same thing with the writeback flag that we do with the unlock flag;
clear it and test the waiters bit at the same time.
This patch (of 17):
This is really preparation for the next patch, but it lets us call
folio_mark_uptodate() in just one place instead of two.
Breno Leitao [Thu, 5 Oct 2023 16:39:22 +0000 (09:39 -0700)]
selftests/mm: add a new test for madv and hugetlb
Create a selftest that exercises the race between page faults and
madvise(MADV_DONTNEED) in the same huge page. Do it by running two
threads that touches the huge page and madvise(MADV_DONTNEED) at the same
time.
In case of a SIGBUS coming at pagefault, the test should fail, since we
hit the bug.
The test doesn't have a signal handler, and if it fails, it fails like
the following
----------------------------------
running ./hugetlb_fault_after_madv
----------------------------------
./run_vmtests.sh: line 186: 595563 Bus error (core dumped) "$@"
[FAIL]
This selftest goes together with the fix of the bug[1] itself.
Breno Leitao [Thu, 5 Oct 2023 16:39:21 +0000 (09:39 -0700)]
selftests/mm: export get_free_hugepages()
Patch series "New selftest for mm", v2.
This is a simple test case that reproduces an mm problem[1], where a page
fault races with madvise(), and it is not trivial to reproduce and debug.
This test-case aims to avoid such race problems from happening again,
impacting workloads that leverages external allocators, such as tcmalloc,
jemalloc, etc.
Pankaj Raghav [Fri, 6 Oct 2023 11:01:20 +0000 (13:01 +0200)]
filemap: call filemap_get_folios_tag() from filemap_get_folios()
filemap_get_folios() is filemap_get_folios_tag() with XA_PRESENT as the
tag that is being matched. Return filemap_get_folios_tag() with
XA_PRESENT as the tag instead of duplicating the code in
filemap_get_folios().
Kemeng Shi [Wed, 27 Sep 2023 10:35:13 +0000 (18:35 +0800)]
mm/page_alloc: remove unnecessary check in break_down_buddy_pages
Patch series "Two minor cleanups to break_down_buddy_pages", v2.
Two minor cleanups to break_down_buddy_pages.
This patch (of 2):
1. We always have target in range started with next_page and full free
range started with current_buddy.
2. The last split range size is 1 << low and low should be >= 0, then
size >= 1. So page + size != page is always true (because size > 0).
As summary, current_page will not equal to target page.
When tracing through the code in vma_merge(), it was not completely
clear why the error return to a dup_anon_vma() call would not overwrite
a previous attempt to the same function. This commit adds a comment
specifying why it is safe.
Lorenzo Stoakes [Mon, 2 Oct 2023 23:14:54 +0000 (00:14 +0100)]
mm/gup: adapt get_user_page_vma_remote() to never return NULL
get_user_pages_remote() will never return 0 except in the case of
FOLL_NOWAIT being specified, which we explicitly disallow.
This simplifies error handling for the caller and avoids the awkwardness
of dealing with both errors and failing to pin. Failing to pin here is an
error.
Lorenzo Stoakes [Mon, 2 Oct 2023 23:14:51 +0000 (00:14 +0100)]
mm: make __access_remote_vm() static
Patch series "various improvements to the GUP interface", v2.
A series of fixes to simplify and improve the GUP interface with an eye to
providing groundwork to future improvements:-
* __access_remote_vm() and access_remote_vm() are functionally identical,
so make the former static such that in future we can potentially change
the external-facing implementation details of this function.
* Extend is_valid_gup_args() to cover the missing FOLL_TOUCH case, and
simplify things by defining INTERNAL_GUP_FLAGS to check against.
* Adjust __get_user_pages_locked() to explicitly treat a failure to pin any
pages as an error in all circumstances other than FOLL_NOWAIT being
specified, bringing it in line with the nommu implementation of this
function.
* (With many thanks to Arnd who suggested this in the first instance)
Update get_user_page_vma_remote() to explicitly only return a page or an
error, simplifying the interface and avoiding the questionable
IS_ERR_OR_NULL() pattern.
This patch (of 4):
access_remote_vm() passes through parameters to __access_remote_vm()
directly, so remove the __access_remote_vm() function from mm.h and use
access_remote_vm() in the one caller that needs it (ptrace_access_vm()).
This allows future adjustments to the GUP-internal __access_remote_vm()
function while keeping the access_remote_vm() function stable.
memory: move exclusivity detection in do_wp_page() into wp_can_reuse_anon_folio()
Let's clean up do_wp_page() a bit, removing two labels and making it a
easier to read.
wp_can_reuse_anon_folio() now only operates on the whole folio. Move the
SetPageAnonExclusive() out into do_wp_page(). No need to do this under
page lock -- the page table lock is sufficient.
mm/rmap: move SetPageAnonExclusive() out of page_move_anon_rmap()
Patch series "mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()".
Convert page_move_anon_rmap() to folio_move_anon_rmap(), letting the
callers handle PageAnonExclusive. I'm including cleanup patch #3 because
it fits into the picture and can be done cleaner by the conversion.
This patch (of 3):
Let's move it into the caller: there is a difference between whether an
anon folio can only be mapped by one process (e.g., into one VMA), and
whether it is truly exclusive (e.g., no references -- including GUP --
from other processes).
Further, for large folios the page might not actually be pointing at the
head page of the folio, so it better be handled in the caller. This is a
preparation for converting page_move_anon_rmap() to consume a folio.
mm: handle write faults to RO pages under the VMA lock
I think this is a pretty rare occurrence, but for consistency handle
faults with the VMA lock held the same way that we handle other faults
with the VMA lock held.
Most file-backed faults are already handled through ->map_pages(), but if
we need to do I/O we'll come this way. Since filemap_fault() is now safe
to be called under the VMA lock, we can handle these faults under the VMA
lock now.
If the page is not currently present in the page tables, we need to call
the page fault handler to find out which page we're supposed to COW, so we
need to both check that there is already an anon_vma and that the fault
handler doesn't need the mmap_lock.
There are many implementations of ->fault and some of them depend on
mmap_lock being held. All vm_ops that implement ->map_pages() end up
calling filemap_fault(), which I have audited to be sure it does not rely
on mmap_lock. So (for now) key off ->map_pages existing as a flag to
indicate that it's safe to call ->fault while only holding the vma lock.
It is usually safe to call wp_page_copy() under the VMA lock. The only
unsafe situation is when no anon_vma has been allocated for this VMA, and
we have to look at adjacent VMAs to determine if their anon_vma can be
shared. Since this happens only for the first COW of a page in this VMA,
the majority of calls to wp_page_copy() do not need to fall back to the
mmap_sem.
Add vmf_anon_prepare() as an alternative to anon_vma_prepare() which will
return RETRY if we currently hold the VMA lock and need to allocate an
anon_vma. This lets us drop the check in do_wp_page().
mm: make lock_folio_maybe_drop_mmap() VMA lock aware
Patch series "Handle more faults under the VMA lock", v2.
At this point, we're handling the majority of file-backed page faults
under the VMA lock, using the ->map_pages entry point. This patch set
attempts to expand that for the following siutations:
- We have to do a read. This could be because we've hit the point in
the readahead window where we need to kick off the next readahead,
or because the page is simply not present in cache.
- We're handling a write fault. Most applications don't do I/O by writes
to shared mmaps for very good reasons, but some do, and it'd be nice
to not make that slow unnecessarily.
- We're doing a COW of a private mapping (both PTE already present
and PTE not-present). These are two different codepaths and I handle
both of them in this patch set.
There is no support in this patch set for drivers to mark themselves as
being VMA lock friendly; they could implement the ->map_pages
vm_operation, but if they do, they would be the first. This is probably
something we want to change at some point in the future, and I've marked
where to make that change in the code.
There is very little performance change in the benchmarks we've run;
mostly because the vast majority of page faults are handled through the
other paths. I still think this patch series is useful for workloads that
may take these paths more often, and just for cleaning up the fault path
in general (it's now clearer why we have to retry in these cases).
This patch (of 6):
Drop the VMA lock instead of the mmap_lock if that's the one which
is held.
Hugh Dickins [Thu, 12 Oct 2023 04:40:09 +0000 (21:40 -0700)]
percpu_counter: extend _limited_add() to negative amounts
Though tmpfs does not need it, percpu_counter_limited_add() can be twice
as useful if it works sensibly with negative amounts (subs) - typically
decrements towards a limit of 0 or nearby: as suggested by Dave Chinner.
And in the course of that reworking, skip the percpu counter sum if it is
already obvious that the limit would be passed: as suggested by Tim Chen.
Extend the comment above __percpu_counter_limited_add(), defining the
behaviour with positive and negative amounts, allowing negative limits,
but not bothering about overflow beyond S64_MAX.
Percpu counter's compare and add are separate functions: without locking
around them (which would defeat their purpose), it has been possible to
overflow the intended limit. Imagine all the other CPUs fallocating tmpfs
huge pages to the limit, in between this CPU's compare and its add.
I have not seen reports of that happening; but tmpfs's recent addition of
dquot_alloc_block_nodirty() in between the compare and the add makes it
even more likely, and I'd be uncomfortable to leave it unfixed.
Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
I believe this implementation is correct, and slightly more efficient than
the combination of compare and add (taking the lock once rather than twice
when nearing full - the last 128MiB of a tmpfs volume on a machine with
128 CPUs and 4KiB pages); but it does beg for a better design - when
nearing full, there is no new batching, but the costly percpu counter sum
across CPUs still has to be done, while locked.
Follow __percpu_counter_sum()'s example, including cpu_dying_mask as well
as cpu_online_mask: but shouldn't __percpu_counter_compare() and
__percpu_counter_limited_add() then be adding a num_dying_cpus() to
num_online_cpus(), when they calculate the maximum which could be held
across CPUs? But the times when it matters would be vanishingly rare.
shmem: _add_to_page_cache() before shmem_inode_acct_blocks()
There has been a recurring problem, that when a tmpfs volume is being
filled by racing threads, some fail with ENOSPC (or consequent SIGBUS or
EFAULT) even though all allocations were within the permitted size.
This was a problem since early days, but magnified and complicated by the
addition of huge pages. We have often worked around it by adding some
slop to the tmpfs size, but it's hard to say how much is needed, and some
users prefer not to do that e.g. keeping sparse files in a tightly
tailored tmpfs helps to prevent accidental writing to holes.
This comes from the allocation sequence:
1. check page cache for existing folio
2. check and reserve from vm_enough_memory
3. check and account from size of tmpfs
4. if huge, check page cache for overlapping folio
5. allocate physical folio, huge or small
6. check and charge from mem cgroup limit
7. add to page cache (but maybe another folio already got in).
Concurrent tasks allocating at the same position could deplete the size
allowance and fail. Doing vm_enough_memory and size checks before the
folio allocation was intentional (to limit the load on the page allocator
from this source) and still has some virtue; but memory cgroup never did
that, so I think it's better reordered to favour predictable behaviour.
1. check page cache for existing folio
2. if huge, check page cache for overlapping folio
3. allocate physical folio, huge or small
4. check and charge from mem cgroup limit
5. add to page cache (but maybe another folio already got in)
6. check and reserve from vm_enough_memory
7. check and account from size of tmpfs.
The folio lock held from allocation onwards ensures that the !uptodate
folio cannot be used by others, and can safely be deleted from the cache
if checks 6 or 7 subsequently fail (and those waiting on folio lock
already check that the folio was not truncated once they get the lock);
and the early addition to page cache ensures that racers find it before
they try to duplicate the accounting.
Seize the opportunity to tidy up shmem_get_folio_gfp()'s ENOSPC retrying,
which can be combined inside the new shmem_alloc_and_add_folio(): doing 2
splits twice (once huge, once nonhuge) is not exactly equivalent to trying
5 splits (and giving up early on huge), but let's keep it simple unless
more complication proves necessary.
Userfaultfd is a foreign country: they do things differently there, and
for good reason - to avoid mmap_lock deadlock. Leave ordering in
shmem_mfill_atomic_pte() untouched for now, but I would rather like to
mesh it better with shmem_get_folio_gfp() in the future.
shmem: move memcg charge out of shmem_add_to_page_cache()
Extract shmem's memcg charging out of shmem_add_to_page_cache(): it's
misleading done there, because many calls are dealing with a swapcache
page, whose memcg is nowadays always remembered while swapped out, then
the charge re-levied when it's brought back into swapcache.
Temporarily move it back up to the shmem_get_folio_gfp() level, where the
memcg was charged before v5.8; but the next commit goes on to move it back
down to a new home.
In making this change, it becomes clear that shmem_swapin_folio() does not
need to know the vma, just the fault mm (if any): call it fault_mm rather
than charge_mm - let mem_cgroup_charge() decide whom to charge.
shmem: shmem_acct_blocks() and shmem_inode_acct_blocks()
By historical accident, shmem_acct_block() and shmem_inode_acct_block()
were never pluralized when the pages argument was added, despite their
complements being shmem_unacct_blocks() and shmem_inode_unacct_blocks()
all along. It has been an irritation: fix their naming at last.
shmem: trivial tidyups, removing extra blank lines, etc
Mostly removing a few superfluous blank lines, joining short arglines,
imposing some 80-column observance, correcting a couple of comments. None
of it more interesting than deleting a repeated INIT_LIST_HEAD().
shmem: factor shmem_falloc_wait() out of shmem_fault()
That Trinity livelock shmem_falloc avoidance block is unlikely, and a
distraction from the proper business of shmem_fault(): separate it out.
(This used to help compilers save stack on the fault path too, but both
gcc and clang nowadays seem to make better choices anyway.)
shmem: shrink shmem_inode_info: dir_offsets in a union
Patch series "shmem,tmpfs: general maintenance".
Mostly just cosmetic mods in mm/shmem.c, but the last two enforcing the
"size=" limit better. 8/8 goes into percpu counter territory, and could
stand alone.
This patch (of 8):
Shave 32 bytes off (the 64-bit) shmem_inode_info. There was a 4-byte
pahole after stop_eviction, better filled by fsflags. And the 24-byte
dir_offsets can only be used by directories, whereas shrinklist and
swaplist only by shmem_mapping() inodes (regular files or long symlinks):
so put those into a union. No change in mm/shmem.c is required for this.
Lorenzo Stoakes [Sat, 30 Sep 2023 23:10:29 +0000 (00:10 +0100)]
mm/filemap: clarify filemap_fault() comments for not uptodate case
The existing comments in filemap_fault() suggest that, after either a
minor fault has occurred and filemap_get_folio() found a folio in the page
cache, or a major fault arose and __filemap_get_folio(FGP_CREATE...) did
the job (having relied on do_sync_mmap_readahead() or filemap_read_folio()
to read in the folio), the only possible reason it could not be uptodate
is because of an error.
This is not so, as if, for instance, the fault occurred within a VMA which
had the VM_RAND_READ flag set (via madvise() with the MADV_RANDOM flag
specified), this would cause even synchronous readahead to fail to read in
the folio.
I confirmed this by dropping page caches and faulting in memory
madvise()'d this way, observing that this code path was reached on each
occasion.
Clarify the comments to include this case, and additionally update the
comment recently added around the invalidate lock logic to make it clear
the comment explicitly refers to the minor fault case.
In addition, while we're here, refer to folios rather than pages.
radix tree test suite: fix allocation calculation in kmem_cache_alloc_bulk()
The bulk allocation is iterating through an array and storing enough
memory for the entire bulk allocation instead of a single array entry.
Only allocate an array element of the size set in the kmem_cache.
fs/proc/task_mmu: add fast paths to get/clear PAGE_IS_WRITTEN flag
Adding fast code paths to handle specifically only get and/or clear
operation of PAGE_IS_WRITTEN, increases its performance by 0-35%. The
results of some test cases are given below:
Test-case-1
t1 = (Get + WP) time
t2 = WP time
t1 t2
Without this patch: 140-170mcs 90-115mcs
With this patch: 110mcs 80mcs
Worst case diff: 35% faster 30% faster
Test-case-2
t3 = atomic Get and WP
t3
Without this patch: 120-140mcs
With this patch: 100-110mcs
Worst case diff: 21% faster
fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs
The PAGEMAP_SCAN IOCTL on the pagemap file can be used to get or optionally
clear the info about page table entries. The following operations are
supported in this IOCTL:
- Scan the address range and get the memory ranges matching the provided
criteria. This is performed when the output buffer is specified.
- Write-protect the pages. The PM_SCAN_WP_MATCHING is used to write-protect
the pages of interest. The PM_SCAN_CHECK_WPASYNC aborts the operation if
non-Async Write Protected pages are found. The ``PM_SCAN_WP_MATCHING``
can be used with or without PM_SCAN_CHECK_WPASYNC.
- Both of those operations can be combined into one atomic operation where
we can get and write protect the pages as well.
Following flags about pages are currently supported:
- PAGE_IS_WPALLOWED - Page has async-write-protection enabled
- PAGE_IS_WRITTEN - Page has been written to from the time it was write protected
- PAGE_IS_FILE - Page is file backed
- PAGE_IS_PRESENT - Page is present in the memory
- PAGE_IS_SWAPPED - Page is in swapped
- PAGE_IS_PFNZERO - Page has zero PFN
- PAGE_IS_HUGE - Page is THP or Hugetlb backed
This IOCTL can be extended to get information about more PTE bits. The
entire address range passed by user [start, end) is scanned until either
the user provided buffer is full or max_pages have been found.
Peter Xu [Mon, 21 Aug 2023 14:15:13 +0000 (19:15 +0500)]
userfaultfd: UFFD_FEATURE_WP_ASYNC
Patch series "Implement IOCTL to get and optionally clear info about
PTEs", v33.
*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() and ResetWriteWatch() syscalls [1]. The GetWriteWatch()
retrieves the addresses of the pages that are written to in a region of
virtual memory.
This syscall is used in Windows applications and games etc. This syscall
is being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these
patches. So the whole gaming on Linux can effectively get benefit from
this. It means there would be tons of users of this code.
CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.
Andrei defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.
*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically
So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.
At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)
All the different masks were added on the request of CRIU devs to create
interface more generic and better.
Add a new userfaultfd-wp feature UFFD_FEATURE_WP_ASYNC, that allows
userfaultfd wr-protect faults to be resolved by the kernel directly.
It can be used like a high accuracy version of soft-dirty, without vma
modifications during tracking, and also with ranged support by default
rather than for a whole mm when reset the protections due to existence of
ioctl(UFFDIO_WRITEPROTECT).
Several goals of such a dirty tracking interface:
1. All types of memory should be supported and tracable. This is nature
for soft-dirty but should mention when the context is userfaultfd,
because it used to only support anon/shmem/hugetlb. The problem is for
a dirty tracking purpose these three types may not be enough, and it's
legal to track anything e.g. any page cache writes from mmap.
2. Protections can be applied to partial of a memory range, without vma
split/merge fuss. The hope is that the tracking itself should not
affect any vma layout change. It also helps when reset happens because
the reset will not need mmap write lock which can block the tracee.
3. Accuracy needs to be maintained. This means we need pte markers to work
on any type of VMA.
One could question that, the whole concept of async dirty tracking is not
really close to fundamentally what userfaultfd used to be: it's not "a
fault to be serviced by userspace" anymore. However, using userfaultfd-wp
here as a framework is convenient for us in at least:
1. VM_UFFD_WP vma flag, which has a very good name to suite something like
this, so we don't need VM_YET_ANOTHER_SOFT_DIRTY. Just use a new
feature bit to identify from a sync version of uffd-wp registration.
2. PTE markers logic can be leveraged across the whole kernel to maintain
the uffd-wp bit as long as an arch supports, this also applies to this
case where uffd-wp bit will be a hint to dirty information and it will
not go lost easily (e.g. when some page cache ptes got zapped).
3. Reuse ioctl(UFFDIO_WRITEPROTECT) interface for either starting or
resetting a range of memory, while there's no counterpart in the old
soft-dirty world, hence if this is wanted in a new design we'll need a
new interface otherwise.
We can somehow understand that commonality because uffd-wp was
fundamentally a similar idea of write-protecting pages just like
soft-dirty.
This implementation allows WP_ASYNC to imply WP_UNPOPULATED, because so
far WP_ASYNC seems to not usable if without WP_UNPOPULATE. This also
gives us chance to modify impl of WP_ASYNC just in case it could be not
depending on WP_UNPOPULATED anymore in the future kernels. It's also fine
to imply that because both features will rely on PTE_MARKER_UFFD_WP config
option, so they'll show up together (or both missing) in an UFFDIO_API
probe.
vma_can_userfault() now allows any VMA if the userfaultfd registration is
only about async uffd-wp. So we can track dirty for all kinds of memory
including generic file systems (like XFS, EXT4 or BTRFS).
One trick worth mention in do_wp_page() is that we need to manually update
vmf->orig_pte here because it can be used later with a pte_same() check -
this path always has FAULT_FLAG_ORIG_PTE_VALID set in the flags.
The major defect of this approach of dirty tracking is we need to populate
the pgtables when tracking starts. Soft-dirty doesn't do it like that.
It's unwanted in the case where the range of memory to track is huge and
unpopulated (e.g., tracking updates on a 10G file with mmap() on top,
without having any page cache installed yet). One way to improve this is
to allow pte markers exist for larger than PTE level for PMD+. That will
not change the interface if to implemented, so we can leave that for
later.
Yosry Ahmed [Fri, 22 Sep 2023 17:57:40 +0000 (17:57 +0000)]
mm: memcg: normalize the value passed into memcg_rstat_updated()
memcg_rstat_updated() uses the value of the state update to keep track of
the magnitude of pending updates, so that we only do a stats flush when
it's worth the work. Most values passed into memcg_rstat_updated() are in
pages, however, a few of them are actually in bytes or KBs.
To put this into perspective, a 512 byte slab allocation today would look
the same as allocating 512 pages. This may result in premature flushes,
which means unnecessary work and latency.
Normalize all the state values passed into memcg_rstat_updated() to pages.
Round up non-zero sub-page to 1 page, because memcg_rstat_updated()
ignores 0 page updates.
Yosry Ahmed [Fri, 22 Sep 2023 17:57:39 +0000 (17:57 +0000)]
mm: memcg: refactor page state unit helpers
Patch series "mm: memcg: fix tracking of pending stats updates values", v2.
While working on adjacent code [1], I realized that the values passed into
memcg_rstat_updated() to keep track of the magnitude of pending updates is
consistent. It is mostly in pages, but sometimes it can be in bytes or
KBs. Fix that.
Patch 1 reworks memcg_page_state_unit() so that we can reuse it in patch 2
to check and normalize the units of state updates.
memcg_page_state_unit() is currently used to identify the unit of a memcg
state item so that all stats in memory.stat are in bytes. However, it
lies about the units of WORKINGSET_* stats. These stats actually
represent pages, but we present them to userspace as a scalar number of
events. In retrospect, maybe those stats should have been memcg "events"
rather than memcg "state".
In preparation for using memcg_page_state_unit() for other purposes that
need to know the truthful units of different stat items, break it down
into two helpers:
- memcg_page_state_unit() retuns the actual unit of the item.
- memcg_page_state_output_unit() returns the unit used for output.
Use the latter instead of the former in memcg_page_state_output() and
lruvec_page_state_output(). While we are at it, let's show cgroup v1 some
love and add memcg_page_state_local_output() for consistency.
mm/memcg: annotate struct mem_cgroup_threshold_ary with __counted_by
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct
mem_cgroup_threshold_ary.
Mike Kravetz [Tue, 17 Oct 2023 02:55:49 +0000 (19:55 -0700)]
hugetlb: check for hugetlb folio before vmemmap_restore
In commit d8f5f7e445f0 ("hugetlb: set hugetlb page flag before
optimizing vmemmap") checks were added to print a warning if
hugetlb_vmemmap_restore was called on a non-hugetlb page.
This was mostly due to ordering issues in the hugetlb page set up and tear
down sequencees. One place missed was the routine
dissolve_free_huge_page.
Naoya Horiguchi noted: "I saw that VM_WARN_ON_ONCE() in
hugetlb_vmemmap_restore is triggered when memory_failure() is called on a
free hugetlb page with vmemmap optimization disabled (the warning is not
triggered if vmemmap optimization is enabled). I think that we need check
folio_test_hugetlb() before dissolve_free_huge_page() calls
hugetlb_vmemmap_restore_folio()."
Tiezhu Yang [Tue, 11 Jul 2023 09:13:34 +0000 (17:13 +0800)]
selftests/clone3: Fix broken test under !CONFIG_TIME_NS
When execute the following command to test clone3 under !CONFIG_TIME_NS:
# make headers && cd tools/testing/selftests/clone3 && make && ./clone3
we can see the following error info:
# [7538] Trying clone3() with flags 0x80 (size 0)
# Invalid argument - Failed to create new process
# [7538] clone3() with flags says: -22 expected 0
not ok 18 [7538] Result (-22) is different than expected (0)
...
# Totals: pass:18 fail:1 xfail:0 xpass:0 skip:0 error:0
This is because if CONFIG_TIME_NS is not set, but the flag
CLONE_NEWTIME (0x80) is used to clone a time namespace, it
will return -EINVAL in copy_time_ns().
If kernel does not support CONFIG_TIME_NS, /proc/self/ns/time
will be not exist, and then we should skip clone3() test with
CLONE_NEWTIME.
With this patch under !CONFIG_TIME_NS:
# make headers && cd tools/testing/selftests/clone3 && make && ./clone3
...
# Time namespaces are not supported
ok 18 # SKIP Skipping clone3() with CLONE_NEWTIME
...
# Totals: pass:18 fail:0 xfail:0 xpass:0 skip:1 error:0
Liam R. Howlett [Thu, 12 Oct 2023 15:52:33 +0000 (11:52 -0400)]
maple_tree: add GFP_KERNEL to allocations in mas_expected_entries()
Users complained about OOM errors during fork without triggering
compaction. This can be fixed by modifying the flags used in
mas_expected_entries() so that the compaction will be triggered in low
memory situations. Since mas_expected_entries() is only used during fork,
the extra argument does not need to be passed through.
Additionally, the two test_maple_tree test cases and one benchmark test
were altered to use the correct locking type so that allocations would not
trigger sleeping and thus fail. Testing was completed with lockdep atomic
sleep detection.
The additional locking change requires rwsem support additions to the
tools/ directory through the use of pthreads pthread_rwlock_t. With this
change test_maple_tree works in userspace, as a module, and in-kernel.
Users may notice that the system gave up early on attempting to start new
processes instead of attempting to reclaim memory.
selftests/mm: include mman header to access MREMAP_DONTUNMAP identifier
Definition for MREMAP_DONTUNMAP is not present in glibc older than 2.32
thus throwing an undeclared error when running make on mm. Including
linux/mman.h solves the build error for people having older glibc.
Oleksij Rempel [Wed, 11 Oct 2023 11:25:19 +0000 (13:25 +0200)]
mailmap: correct email aliasing for Oleksij Rempel
Ensure the current work email addresses for Oleksij Rempel are preserved
and not overridden by private address. Alias the alternate work email to
the primary work email address.
mailmap: map Bartosz's old address to the current one
I no longer work for BayLibre but many DT bindings have my BL address in
the maintainers entries. Map it to the email address I use for kernel
development.
SeongJae Park [Sat, 7 Oct 2023 20:04:32 +0000 (20:04 +0000)]
mm/damon/sysfs: check DAMOS regions update progress from before_terminate()
DAMON_SYSFS can receive DAMOS tried regions update request while kdamond
is already out of the main loop and before_terminate callback
(damon_sysfs_before_terminate() in this case) is not yet called. And
damon_sysfs_handle_cmd() can further be finished before the callback is
invoked. Then, damon_sysfs_before_terminate() unlocks damon_sysfs_lock,
which is not locked by anyone. This happens because the callback function
assumes damon_sysfs_cmd_request_callback() should be called before it.
Check if the assumption was true before doing the unlock, to avoid this
problem.
Arnd Bergmann [Mon, 16 Oct 2023 20:08:38 +0000 (22:08 +0200)]
kasan: disable kasan_non_canonical_hook() for HW tags
On arm64, building with CONFIG_KASAN_HW_TAGS now causes a compile-time
error:
mm/kasan/report.c: In function 'kasan_non_canonical_hook':
mm/kasan/report.c:637:20: error: 'KASAN_SHADOW_OFFSET' undeclared (first use in this function)
637 | if (addr < KASAN_SHADOW_OFFSET)
| ^~~~~~~~~~~~~~~~~~~
mm/kasan/report.c:637:20: note: each undeclared identifier is reported only once for each function it appears in
mm/kasan/report.c:640:77: error: expected expression before ';' token
640 | orig_addr = (addr - KASAN_SHADOW_OFFSET) << KASAN_SHADOW_SCALE_SHIFT;
This was caused by removing the dependency on CONFIG_KASAN_INLINE that
used to prevent this from happening. Use the more specific dependency
on KASAN_SW_TAGS || KASAN_GENERIC to only ignore the function for hwasan
mode.
Haibo Li [Mon, 9 Oct 2023 07:37:48 +0000 (15:37 +0800)]
kasan: print the original fault addr when access invalid shadow
when the checked address is illegal,the corresponding shadow address from
kasan_mem_to_shadow may have no mapping in mmu table. Access such shadow
address causes kernel oops. Here is a sample about oops on arm64(VA
39bit) with KASAN_SW_TAGS and KASAN_OUTLINE on:
Rik van Riel [Fri, 6 Oct 2023 03:59:08 +0000 (23:59 -0400)]
hugetlbfs: close race between MADV_DONTNEED and page fault
Malloc libraries, like jemalloc and tcalloc, take decisions on when to
call madvise independently from the code in the main application.
This sometimes results in the application page faulting on an address,
right after the malloc library has shot down the backing memory with
MADV_DONTNEED.
Usually this is harmless, because we always have some 4kB pages sitting
around to satisfy a page fault. However, with hugetlbfs systems often
allocate only the exact number of huge pages that the application wants.
Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
any lock taken on the page fault path, which can open up the following
race condition:
CPU 1 CPU 2
MADV_DONTNEED
unmap page
shoot down TLB entry
page fault
fail to allocate a huge page
killed with SIGBUS
free page
Fix that race by pulling the locking from __unmap_hugepage_final_range
into helper functions called from zap_page_range_single. This ensures
page faults stay locked out of the MADV_DONTNEED VMA until the huge pages
have actually been freed.
Rik van Riel [Fri, 6 Oct 2023 03:59:07 +0000 (23:59 -0400)]
hugetlbfs: extend hugetlb_vma_lock to private VMAs
Extend the locking scheme used to protect shared hugetlb mappings from
truncate vs page fault races, in order to protect private hugetlb mappings
(with resv_map) against MADV_DONTNEED.
Add a read-write semaphore to the resv_map data structure, and use that
from the hugetlb_vma_(un)lock_* functions, in preparation for closing the
race between MADV_DONTNEED and page faults.
Rik van Riel [Fri, 6 Oct 2023 03:59:06 +0000 (23:59 -0400)]
hugetlbfs: clear resv_map pointer if mmap fails
Patch series "hugetlbfs: close race between MADV_DONTNEED and page fault", v7.
Malloc libraries, like jemalloc and tcalloc, take decisions on when to
call madvise independently from the code in the main application.
This sometimes results in the application page faulting on an address,
right after the malloc library has shot down the backing memory with
MADV_DONTNEED.
Usually this is harmless, because we always have some 4kB pages sitting
around to satisfy a page fault. However, with hugetlbfs systems often
allocate only the exact number of huge pages that the application wants.
Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
any lock taken on the page fault path, which can open up the following
race condition:
CPU 1 CPU 2
MADV_DONTNEED
unmap page
shoot down TLB entry
page fault
fail to allocate a huge page
killed with SIGBUS
free page
Fix that race by extending the hugetlb_vma_lock locking scheme to also
cover private hugetlb mappings (with resv_map), and pulling the locking
from __unmap_hugepage_final_range into helper functions called from
zap_page_range_single. This ensures page faults stay locked out of the
MADV_DONTNEED VMA until the huge pages have actually been freed.
This patch (of 3):
Hugetlbfs leaves a dangling pointer in the VMA if mmap fails. This has
not been a problem so far, but other code in this patch series tries to
follow that pointer.
Johannes Weiner [Fri, 6 Oct 2023 16:00:24 +0000 (12:00 -0400)]
mm: zswap: fix pool refcount bug around shrink_worker()
When a zswap store fails due to the limit, it acquires a pool reference
and queues the shrinker. When the shrinker runs, it drops the reference.
However, there can be multiple store attempts before the shrinker wakes up
and runs once. This results in reference leaks and eventual saturation
warnings for the pool refcount.
Fix this by dropping the reference again when the shrinker is already
queued. This ensures one reference per shrinker run.
Stefan Roesch [Tue, 26 Sep 2023 04:09:37 +0000 (21:09 -0700)]
mm/ksm: add pages_skipped metric
This change adds the "pages skipped" metric. To be able to evaluate how
successful smart page scanning is, the pages skipped metric can be
compared to the pages scanned metric.
The pages skipped metric is a cumulative counter. The counter is stored
under /sys/kernel/mm/ksm/pages_skipped.
Stefan Roesch [Tue, 26 Sep 2023 04:09:36 +0000 (21:09 -0700)]
mm/ksm: add "smart" page scanning mode
Patch series "Smart scanning mode for KSM", v3.
This patch series adds "smart scanning" for KSM.
What is smart scanning?
=======================
KSM evaluates all the candidate pages for each scan. It does not use historic
information from previous scans. This has the effect that candidate pages that
couldn't be used for KSM de-duplication continue to be evaluated for each scan.
The idea of "smart scanning" is to keep historic information. With the historic
information we can temporarily skip the candidate page for one or several scans.
Details:
========
"Smart scanning" is to keep two small counters to store if the page has been
used for KSM. One counter stores how often we already tried to use the page for
KSM and the other counter stores how often we skip a page.
How often we skip the candidate page depends how often a page failed KSM
de-duplication. The code skips a maximum of 8 times. During testing this has
shown to be a good compromise for different workloads.
New sysfs knob:
===============
Smart scanning is not enabled by default. With /sys/kernel/mm/ksm/smart_scan
smart scanning can be enabled.
Monitoring:
===========
To monitor how effective smart scanning is a new sysfs knob has been introduced.
/sys/kernel/mm/pages_skipped report how many pages have been skipped by smart
scanning.
Results:
========
- Various workloads have shown a 20% - 25% reduction in page scans
For the instagram workload for instance, the number of pages scanned has been
reduced from over 20M pages per scan to less than 15M pages.
- Less pages scans also resulted in an overall higher de-duplication rate as
some shorter lived pages could be de-duplicated additionally
- Less pages scanned allows to reduce the pages_to_scan parameter
and this resulted in a 25% reduction in terms of CPU.
- The improvements have been observed for workloads that enable KSM with
madvise as well as prctl
This patch (of 4):
This change adds a "smart" page scanning mode for KSM. So far all the
candidate pages are continuously scanned to find candidates for
de-duplication. There are a considerably number of pages that cannot be
de-duplicated. This is costly in terms of CPU. By using smart scanning
considerable CPU savings can be achieved.
This change takes the history of scanning pages into account and skips the
page scanning of certain pages for a while if de-deduplication for this
page has not been successful in the past.
To do this it introduces two new fields in the ksm_rmap_item structure:
age and remaining_skips. age, is the KSM age and remaining_skips
determines how often scanning of this page is skipped. The age field is
incremented each time the page is scanned and the page cannot be de-
duplicated. age updated is capped at U8_MAX.
How often a page is skipped is dependent how often de-duplication has been
tried so far and the number of skips is currently limited to 8. This
value has shown to be effective with different workloads.
The feature is currently disable by default and can be enabled with the
new smart_scan knob.
The feature has shown to be very effective: upt to 25% of the page scans
can be eliminated; the pages_to_scan rate can be reduced by 40 - 50% and a
similar de-duplication rate can be maintained.
dax, kmem: calculate abstract distance with general interface
Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
used for slow memory type in kmem driver. This limits the usage of kmem
driver, for example, it cannot be used for HBM (high bandwidth memory).
So, we use the general abstract distance calculation mechanism in kmem
drivers to get more accurate abstract distance on systems with proper
support. The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as fallback
only.
Now, multiple memory types may be managed by kmem. These memory types are
put into the "kmem_memory_types" list and protected by
kmem_memory_type_lock.
A memory tiering abstract distance calculation algorithm based on ACPI
HMAT is implemented. The basic idea is as follows.
The performance attributes of system default DRAM nodes are recorded as
the base line. Whose abstract distance is MEMTIER_ADISTANCE_DRAM. Then,
the ratio of the abstract distance of a memory node (target) to
MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance
attributes of the node to that of the default DRAM nodes.
The functions to record the read/write latency/bandwidth of the default
DRAM nodes and calculate abstract distance according to read/write
latency/bandwidth ratio will be used by CXL CDAT (Coherent Device
Attribute Table) and other memory device drivers. So, they are put in
memory-tiers.c.
Previously, in hmat_register_target_initiators(), the performance
attributes are calculated and the corresponding sysfs links and files are
created too. Which is called during memory onlining.
But now, to calculate the abstract distance of a memory target before
memory onlining, we need to calculate the performance attributes for a
memory target without creating sysfs links and files.
To do that, hmat_register_target_initiators() is refactored to make it
possible to calculate performance attributes separately.
Patch series "memory tiering: calculate abstract distance based on ACPI
HMAT", v4.
We have the explicit memory tiers framework to manage systems with
multiple types of memory, e.g., DRAM in DIMM slots and CXL memory devices.
Where, same kind of memory devices will be grouped into memory types,
then put into memory tiers. To describe the performance of a memory type,
abstract distance is defined. Which is in direct proportion to the memory
latency and inversely proportional to the memory bandwidth. To keep the
code as simple as possible, fixed abstract distance is used in dax/kmem to
describe slow memory such as Optane DCPMM.
To support more memory types, in this series, we added the abstract
distance calculation algorithm management mechanism, provided a algorithm
implementation based on ACPI HMAT, and used the general abstract distance
calculation interface in dax/kmem driver. So, dax/kmem can support HBM
(high bandwidth memory) in addition to the original Optane DCPMM.
This patch (of 4):
The abstract distance may be calculated by various drivers, such as ACPI
HMAT, CXL CDAT, etc. While it may be used by various code which hot-add
memory node, such as dax/kmem etc. To decouple the algorithm users and
the providers, the abstract distance calculation algorithms management
mechanism is implemented in this patch. It provides interface for the
providers to register the implementation, and interface for the users.
Multiple algorithm implementations can cooperate via calculating abstract
distance for different memory nodes. The preference of algorithm
implementations can be specified via priority (notifier_block.priority).