Peter Xu [Tue, 5 Mar 2024 04:37:46 +0000 (12:37 +0800)]
mm/kasan: use pXd_leaf() in shadow_mapped()
There is an old trick in shadow_mapped() to use pXd_bad() to detect huge
pages. After commit 93fab1b22ef7 ("mm: add generic p?d_leaf() macros") we
have a global API for huge mappings. Use that to replace the trick.
Peter Xu [Tue, 5 Mar 2024 04:37:42 +0000 (12:37 +0800)]
mm/powerpc: replace pXd_is_leaf() with pXd_leaf()
They're the same macros underneath. Drop pXd_is_leaf(), instead always use
pXd_leaf().
At the meantime, instead of renames, drop the pXd_is_leaf() fallback
definitions directly in arch/powerpc/include/asm/pgtable.h. because
similar fallback macros for pXd_leaf() are already defined in
include/linux/pgtable.h.
Peter Xu [Tue, 5 Mar 2024 04:37:41 +0000 (12:37 +0800)]
mm/powerpc: define pXd_large() with pXd_leaf()
Patch series "mm/treewide: Replace pXd_large() with pXd_leaf()", v3.
These two APIs are mostly always the same. It's confusing to have both of
them. Merge them into one. Here I used pXd_leaf() only because
pXd_leaf() is a global API which is always defined, while pXd_large() is
not.
We have yet one more API that is similar which is pXd_huge(), but that's
even trickier, so let's do it step by step.
Some special cares are taken for ppc and x86, they're done as separate
cleanups first.
This patch (of 10):
The two definitions are the same. The only difference is that pXd_large()
is only defined with THP selected, and only on book3s 64bits.
Instead of implementing it twice, make pXd_large() a macro to pXd_leaf().
Define it unconditionally just like pXd_leaf(). This helps to prepare
merging the two APIs.
Chengming Zhou [Tue, 5 Mar 2024 07:53:45 +0000 (07:53 +0000)]
mm/zswap: global lru and shrinker shared by all zswap_pools fix
Commit bf9b7df23cb3 ("mm/zswap: global lru and shrinker shared by all
zswap_pools") introduced a new lock to protect zswap_next_shrink, instead
of reusing zswap_pools_lock.
But the problem is that it's initialized only when zswap enabled, which
causes bug if zswap_memcg_offline_cleanup() called without zswap enabled.
Fix it by using DEFINE_SPINLOCK() to statically initialize them and define
them as multiple static variables to keep in consistent with the existing
global variables in zswap.
Qi Zheng [Mon, 4 Mar 2024 11:07:20 +0000 (19:07 +0800)]
s390: supplement for ptdesc conversion
After commit 6326c26c1514 ("s390: convert various pgalloc functions to use
ptdescs"), there are still some positions that use page->{lru, index}
instead of ptdesc->{pt_list, pt_index}. In order to make the use of
ptdesc->{pt_list, pt_index} clearer, it would be better to convert them as
well.
Qi Zheng [Mon, 4 Mar 2024 11:07:19 +0000 (19:07 +0800)]
mm: pgtable: add missing pt_index to struct ptdesc
In s390, the page->index field is used for gmap (see gmap_shadow_pgt()),
so add the corresponding pt_index to struct ptdesc and add a comment to
clarify this.
Qi Zheng [Mon, 4 Mar 2024 11:07:18 +0000 (19:07 +0800)]
mm: pgtable: correct the wrong comment about ptdesc->__page_flags
Patch series "minor fixes and supplement for ptdesc".
In this series, the [PATCH 1/3] and [PATCH 2/3] are fixes for some issues
discovered during code inspection.
The [PATCH 3/3] is a supplement to ptdesc conversion in s390, I don't know
why this is not done in the commit 6326c26c1514 ("s390: convert various
pgalloc functions to use ptdescs"), maybe I missed something. And since I
don't have an s390 environment, I hope kernel test robot can help compile
and test, and this is why I did not fold [PATCH 2/3] and [PATCH 3/3] into
one patch.
This patch (of 3):
The commit 32cc0b7c9d50 ("powerpc: add pte_free_defer() for pgtables
sharing page") introduced the use of PageActive flag to page table
fragments tracking, so the ptdesc->__page_flags is not unused, so correct
the wrong comment.
We actually add folios to the pagelist already, but then work with them as
pages. Removes a call to compound_head() in PageKsm() and removes a
reference to page->index.
Barry Song [Tue, 27 Feb 2024 10:42:01 +0000 (23:42 +1300)]
mm: make folio_pte_batch available outside of mm/memory.c
madvise, mprotect and some others might need folio_pte_batch to check if a
range of PTEs are completely mapped to a large folio with contiguous
physical addresses. Let's make it available in mm/internal.h.
While at it, add proper kernel doc and sanity-check more input parameters
using two additional VM_WARN_ON_FOLIO().
Turn __dump_page() into a wrapper around __dump_folio(). Snapshot the
page & folio into a stack variable so we don't hit BUG_ON() if an
allocation is freed under us and what was a folio pointer becomes a
pointer to a tail page.
We have now successfully removed all of the uses of some of the PageFlags
from the kernel, but there's nothing to stop somebody reintroducing them.
By splitting out FOLIO_FLAGS from PAGEFLAGS, we can stop defining the old
flags; and we do that in some of the later patches.
After doing this, I realised that dump_page() was living dangerously; we
could end up calling folio_test_foo() on a pointer which no longer pointed
to a folio (as dump_page() is not necessarily called when the caller has a
reference to the page). So I fixed that up.
And then I realised that this was the key to making dump_page() take a
const argument, which means we can constify the page flags testing, which
means we can remove more cast-away-the-const bad code.
And here's where I ended up.
This patch (of 8):
We've progressed far enough with the folio transition that some flags are
now no longer checked on pages, but only on folios. To prevent new users
appearing, prepare to only define the folio versions of the flag
test/set/clear.
Gang Li [Thu, 22 Feb 2024 14:04:21 +0000 (22:04 +0800)]
hugetlb: parallelize 1G hugetlb initialization
Optimizing the initialization speed of 1G huge pages through
parallelization.
1G hugetlbs are allocated from bootmem, a process that is already very
fast and does not currently require optimization. Therefore, we focus on
parallelizing only the initialization phase in `gather_bootmem_prealloc`.
Here are some test results:
test case no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2T(4 node) 1G 4745 2024 57.34%
128c1T(2 node) 1G 3358 1712 49.02%
12T 1G 77000 18300 76.23%
Gang Li [Thu, 22 Feb 2024 14:04:20 +0000 (22:04 +0800)]
hugetlb: parallelize 2M hugetlb allocation and initialization
By distributing both the allocation and the initialization tasks across
multiple threads, the initialization of 2M hugetlb will be faster, thereby
improving the boot speed.
Here are some test results:
test case no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2T(4 node) 2M 3336 1051 68.52%
128c1T(2 node) 2M 1943 716 63.15%
Gang Li [Thu, 22 Feb 2024 14:04:18 +0000 (22:04 +0800)]
padata: downgrade padata_do_multithreaded to serial execution for non-SMP
hugetlb parallelization depends on PADATA, and PADATA depends on SMP.
PADATA consists of two distinct functionality: One part is
padata_do_multithreaded which disregards order and simply divides tasks
into several groups for parallel execution. Hugetlb init parallelization
depends on padata_do_multithreaded.
The other part is composed of a set of APIs that, while handling data in
an out-of-order parallel manner, can eventually return the data with
ordered sequence. Currently Only `crypto/pcrypt.c` use them.
All users of PADATA of non-SMP case currently only use
padata_do_multithreaded. It is easy to implement a serial one in
include/linux/padata.h. And it is not necessary to implement another
functionality unless the only user of crypto/pcrypt.c does not depend on
SMP in the future.
different nodes Date: Thu, 22 Feb 2024 22:04:17 +0800
When a group of tasks that access different nodes are scheduled on the
same node, they may encounter bandwidth bottlenecks and access latency.
Thus, numa_aware flag is introduced here, allowing tasks to be distributed
across different nodes to fully utilize the advantage of multi-node
systems.
Gang Li [Thu, 22 Feb 2024 14:04:16 +0000 (22:04 +0800)]
hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc
With parallelization of hugetlb allocation across different threads, each
thread works on a differnet node to allocate pages from, instead of all
allocating from a common node h->next_nid_to_alloc. To address this, it's
necessary to assign a separate next_nid_to_alloc for each thread.
Consequently, the hstate_next_node_to_alloc and
for_each_node_mask_to_alloc have been modified to directly accept a
*next_nid_to_alloc parameter, ensuring thread-specific allocation and
avoiding concurrent access issues.
Gang Li [Thu, 22 Feb 2024 14:04:15 +0000 (22:04 +0800)]
hugetlb: split hugetlb_hstate_alloc_pages
1G and 2M huge pages have different allocation and initialization logic,
which leads to subtle differences in parallelization. Therefore, it is
appropriate to split hugetlb_hstate_alloc_pages into gigantic and
non-gigantic.
Gang Li [Thu, 22 Feb 2024 14:04:14 +0000 (22:04 +0800)]
hugetlb: code clean for hugetlb_hstate_alloc_pages
Patch series "hugetlb: parallelize hugetlb page init on boot", v6.
Introduction
------------
Hugetlb initialization during boot takes up a considerable amount of time.
For instance, on a 2TB system, initializing 1,800 1GB huge pages takes
1-2 seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB
Intel host takes more than 1 minute[1]. This is a noteworthy figure.
Inspired by [2] and [3], hugetlb initialization can also be accelerated
through parallelization. Kernel already has infrastructure like
padata_do_multithreaded, this patch uses it to achieve effective results
by minimal modifications.
To fully utilize the CPU, the number of parallel threads needs to be
carefully considered. `max_threads = num_node_state(N_MEMORY)` does not
fully utilize the CPU, so we need to multiply it by a multiplier.
Tests below indicate that a multiplier of 2 significantly improves
performance, and although larger values also provide improvements, the
gains are marginal.
Therefore, choosing 2 as the multiplier strikes a good balance between
enhancing parallel processing capabilities and maintaining efficient
resource management.
Test result
-----------
test case no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2T(4 node) 1G 4745 2024 57.34%
128c1T(2 node) 1G 3358 1712 49.02%
12T 1G 77000 18300 76.23%
The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the
code, its readability can be improved, facilitating future modifications.
This patch extracts two functions to reduce the complexity of
`hugetlb_hstate_alloc_pages` and has no functional changes.
- hugetlb_hstate_alloc_pages_node_specific() to handle iterates through
each online node and performs allocation if necessary.
- hugetlb_hstate_alloc_pages_report() report error during allocation.
And the value of h->max_huge_pages is updated accordingly.
Chengming Zhou [Wed, 28 Feb 2024 02:38:54 +0000 (02:38 +0000)]
mm/zsmalloc: don't need to reserve LSB in handle
We will save allocated tag in the object header to indicate that it's
allocated.
handle |= OBJ_ALLOCATED_TAG;
So the object header needs to reserve LSB for this tag bit.
But the handle itself doesn't need to reserve LSB to save tag, since it's
only used to find the position of object, by (pfn + obj_idx). So remove
LSB reserve from handle, one more bit can be used as obj_idx.
John Hubbard [Wed, 28 Feb 2024 03:41:51 +0000 (19:41 -0800)]
mm/memory.c: do_numa_page(): remove a redundant page table read
do_numa_page() is reading from the same page table entry, twice, while
holding the page table lock: once while checking that the pte hasn't
changed, and again in order to modify the pte.
Instead, just read the pte once, and save it in the same old_pte variable
that already exists. This has no effect on behavior, other than to
provide a tiny potential improvement to performance, by avoiding the
redundant memory read (which the compiler cannot elide, due to
READ_ONCE()).
alloc_contig_migrate_range has every information to be able to understand
big contiguous allocation latency. For example, how many pages are
migrated, how many times they were needed to unmap from page tables.
This patch adds the trace event to collect the allocation statistics. In
the field, it was quite useful to understand CMA allocation latency.
We already have a folio; use it instead of the head page where reasonable.
Saves a couple of calls to compound_head() and elimimnates a few
references to page->mapping.
Huang Shijie [Tue, 27 Feb 2024 01:49:52 +0000 (09:49 +0800)]
crash_core: export vmemmap when CONFIG_SPARSEMEM_VMEMMAP is enabled
In memory_model.h, if CONFIG_SPARSEMEM_VMEMMAP is configed, kernel will
use vmemmap to do the __pfn_to_page/page_to_pfn, and kernel will not use
the "classic sparse" to do the __pfn_to_page/page_to_pfn.
So export the vmemmap when CONFIG_SPARSEMEM_VMEMMAP is configed. This
makes the user applications (crash, etc) get faster
pfn_to_page/page_to_pfn operations too.
Changbin Du [Tue, 27 Feb 2024 02:35:46 +0000 (10:35 +0800)]
modules: wait do_free_init correctly
The synchronization here is to ensure the ordering of freeing of a module
init so that it happens before W+X checking. It is worth noting it is not
that the freeing was not happening, it is just that our sanity checkers
raced against the permission checkers which assume init memory is already
gone.
Commit 1a7b7d922081 ("modules: Use vmalloc special flag") moved calling
do_free_init() into a global workqueue instead of relying on it being
called through call_rcu(..., do_free_init), which used to allowed us call
do_free_init() asynchronously after the end of a subsequent grace period.
The move to a global workqueue broke the gaurantees for code which needed
to be sure the do_free_init() would complete with rcu_barrier(). To fix
this callers which used to rely on rcu_barrier() must now instead use
flush_work(&init_free_wq).
Without this fix, we still could encounter false positive reports in W+X
checking since the rcu_barrier() here can not ensure the ordering now.
Even worse, the rcu_barrier() can introduce significant delay. Eric
Chanudet reported that the rcu_barrier introduces ~0.1s delay on a
PREEMPT_RT kernel.
[ 0.291444] Freeing unused kernel memory: 5568K
[ 0.402442] Run /sbin/init as init process
mm: use a folio in __collapse_huge_page_copy_succeeded()
These pages are all chained together through the lru list, so we know
they're folios. Use the folio APIs to save three hidden calls to
compound_head().
The few folios which can't be moved to the LRU list (because their
refcount dropped to zero) used to be returned to the caller to dispose of.
Make this simpler to call by freeing the folios directly through
free_unref_folios().
Use free_unref_page_batch() to free the folios. This may increase the
number of IPIs from calling try_to_unmap_flush() more often, but that's
going to be very workload-dependent. It may even reduce the number of
IPIs as we now batch-free large folios instead of freeing them one at a
time.
mm: allow non-hugetlb large folios to be batch processed
Hugetlb folios still get special treatment, but normal large folios can
now be freed by free_unref_folios(). This should have a reasonable
performance impact, TBD.
Call folio_undo_large_rmappable() if needed. free_unref_page_prepare()
destroys the ability to call folio_order(), so stash the order in
folio->private for the benefit of the second loop.
Pass a pointer to the lruvec so we can take advantage of the
folio_lruvec_relock_irqsave(). Adjust the calling convention of
folio_lruvec_relock_irqsave() to suit and add a page_cache_release()
wrapper.
Iterate over a folio_batch rather than a linked list. This is easier for
the CPU to prefetch and has a batch count naturally built in so we don't
need to track it. Again, this lowers the maximum lock hold time from
32 folios to 15, but I do not expect this to have a significant effect.
Most of its callees are not yet ready to accept a folio, but we know all
of the pages passed in are actually folios because they're linked through
->lru.
mm: make folios_put() the basis of release_pages()
Patch series "Rearrange batched folio freeing", v3.
Other than the obvious "remove calls to compound_head" changes, the
fundamental belief here is that iterating a linked list is much slower
than iterating an array (5-15x slower in my testing). There's also an
associated belief that since we iterate the batch of folios three times,
we do better when the array is small (ie 15 entries) than we do with a
batch that is hundreds of entries long, which only gives us the
opportunity for the first pages to fall out of cache by the time we get to
the end.
It is possible we should increase the size of folio_batch. Hopefully the
bots let us know if this introduces any performance regressions.
This patch (of 3):
By making release_pages() call folios_put(), we can get rid of the calls
to compound_head() for the callers that already know they have folios. We
can also get rid of the lock_batch tracking as we know the size of the
batch is limited by folio_batch. This does reduce the maximum number of
pages for which the lruvec lock is held, from SWAP_CLUSTER_MAX (32) to
PAGEVEC_SIZE (15). I do not expect this to make a significant difference,
but if it does, we can increase PAGEVEC_SIZE to 31.
Lance Yang [Tue, 27 Feb 2024 03:51:35 +0000 (11:51 +0800)]
mm/khugepaged: keep mm in mm_slot without MMF_DISABLE_THP check
Previously, we removed the mm from mm_slot and dropped mm_count
if the MMF_THP_DISABLE flag was set. However, we didn't re-add
the mm back after clearing the MMF_THP_DISABLE flag. Additionally,
We add a check for the MMF_THP_DISABLE flag in hugepage_vma_revalidate().
mm/memfd: refactor memfd_tag_pins() and memfd_wait_for_pins()
Patch series "mm: remove total_mapcount()", v2.
Let's remove the remaining user from mm/memfd.c so we can get rid of
total_mapcount().
This patch (of 2):
Both functions are the remaining users of total_mapcount(). Let's get rid
of the calls by converting the code to folios.
As it turns out, the code is unnecessarily complicated, especially:
1) We can query the number of pagecache references for a folio simply via
folio_nr_pages(). This will handle other folio sizes in the future
correctly.
2) The xas_set(xas, page->index + cache_count) call to increment the
iterator for large folios is not required. Remove it.
Further, simplify the XA_CHECK_SCHED check, counting each entry exactly
once.
Memfd pages can be swapped out when using shmem; leave xa_is_value()
checks in place.
Zi Yan [Mon, 26 Feb 2024 20:55:33 +0000 (15:55 -0500)]
mm: thp: split huge page to any lower order pages
To split a THP to any lower order pages, we need to reform THPs on
subpages at given order and add page refcount based on the new page order.
Also we need to reinitialize page_deferred_list after removing the page
from the split_queue, otherwise a subsequent split will see list
corruption when checking the page_deferred_list again.
Note: Anonymous order-1 folio is not supported because _deferred_list,
which is used by partially mapped folios, is stored in subpage 2 and an
order-1 folio only has subpage 0 and 1. File-backed order-1 folios are
fine, since they do not use _deferred_list.
Zi Yan [Mon, 26 Feb 2024 20:55:31 +0000 (15:55 -0500)]
mm: memcg: make memcg huge page split support any order split
It sets memcg information for the pages after the split. A new parameter
new_order is added to tell the order of subpages in the new page, always 0
for now. It prepares for upcoming changes to support split huge page to
any lower order.
Folios of order 1 have no space to store the deferred list. This is not a
problem for the page cache as file-backed folios are never placed on the
deferred list. All we need to do is prevent the core MM from touching the
deferred list for order 1 folios and remove the code which prevented us
from allocating order 1 folios.
Zi Yan [Mon, 26 Feb 2024 20:55:27 +0000 (15:55 -0500)]
mm/huge_memory: only split PMD mapping when necessary in unmap_folio()
Patch series "Split a folio to any lower order folios", v5.
File folio supports any order and multi-size THP is upstreamed[1], so both
file and anonymous folios can be >0 order. Currently, split_huge_page()
only splits a huge page to order-0 pages, but splitting to orders higher
than 0 might better utilize large folios, if done properly. In addition,
Large Block Sizes in XFS support would benefit from it during truncate[2].
This patchset adds support for splitting a large folio to any lower order
folios.
In addition to this implementation of split_huge_page_to_list_to_order(),
a possible optimization could be splitting a large folio to arbitrary
smaller folios instead of a single order. As both Hugh and Ryan pointed
out [3,5] that split to a single order might not be optimal, an order-9
folio might be better split into 1 order-8, 1 order-7, ..., 1 order-1, and
2 order-0 folios, depending on subsequent folio operations. Leave this as
future work.
As multi-size THP support is added, not all THPs are PMD-mapped, thus
during a huge page split, there is no need to always split PMD mapping in
unmap_folio(). Make it conditional.
Introduce GFP bits enumeration to let compiler track the number of used
bits (which depends on the config options) instead of hardcoding them.
That simplifies __GFP_BITS_SHIFT calculation.
Barry Song [Mon, 26 Feb 2024 00:57:39 +0000 (13:57 +1300)]
mm: madvise: pageout: ignore references rather than clearing young
While doing MADV_PAGEOUT, the current code will clear PTE young so that
vmscan won't read young flags to allow the reclamation of madvised folios
to go ahead. It seems we can do it by directly ignoring references, thus
we can remove tlb flush in madvise and rmap overhead in vmscan.
Regarding the side effect, in the original code, if a parallel thread runs
side by side to access the madvised memory with the thread doing madvise,
folios will get a chance to be re-activated by vmscan (though the time gap
is actually quite small since checking PTEs is done immediately after
clearing PTEs young). But with this patch, they will still be reclaimed.
But this behaviour doing PAGEOUT and doing access at the same time is
quite silly like DoS. So probably, we don't need to care. Or ignoring
the new access during the quite small time gap is even better.
For DAMON's DAMOS_PAGEOUT based on physical address region, we still keep
its behaviour as is since a physical address might be mapped by multiple
processes. MADV_PAGEOUT based on virtual address is actually much more
aggressive on reclamation. To untouch paddr's DAMOS_PAGEOUT, we simply
pass ignore_references as false in reclaim_pages().
A microbench as below has shown 6% decrement on the latency of
MADV_PAGEOUT,
for (i = 0; i < SIZE/sizeof(long); i += PGSIZE / sizeof(long))
p[i] = 0x11;
madvise(p, SIZE, MADV_PAGEOUT);
}
w/o patch w/ patch
root@10:~# time ./a.out root@10:~# time ./a.out
real 0m49.634s real 0m46.334s
user 0m0.637s user 0m0.648s
sys 0m47.434s sys 0m44.265s
The contpte symbols must be exported since some of the public inline
ptep_* APIs are called from modules and these inlines now call the contpte
functions. Originally they were exported as EXPORT_SYMBOL() for fear of
breaking out-of-tree modules. But we subsequently concluded that
EXPORT_SYMBOL_GPL() should be safe since these functions are deeply core
mm routines, and any module operating at this level is not going to be
able to survive on EXPORT_SYMBOL alone.
Barry Song [Sat, 24 Feb 2024 22:47:51 +0000 (11:47 +1300)]
Docs/mm/damon/design: remove the details for pageout as paddr doesn't use MADV_PAGEOUT
The doc needs a fix. As only in the case of virtual address, we are
calling madvise() with MADV_PAGEOUT. But in the case of physical address,
we are calling reclaim_pages() directly. MADV_PAGEOUT on virtual address
is much more aggresive to reclaim memory compared to reclaim_pages() on
paddr region. This patch removes the details so that the description can
apply to both cases. And we don't need to couple with the implementation
details.
kasan: fix a2 allocation and remove explicit cast in atomic tests
Address the additional feedback since 4e76c8cc3378 kasan: add atomic tests
(""kasan: add atomic tests") by removing an explicit cast and fixing the
size as well as the check of the allocation of `a2`.
Dan Carpenter [Fri, 23 Feb 2024 14:20:13 +0000 (17:20 +0300)]
lib/stackdepot: off by one in depot_fetch_stack()
The stack_pools[] array has DEPOT_MAX_POOLS. The "pools_num" tracks the
number of pools which are initialized. See depot_init_pool() for more
details.
If pool_index == pools_num_cached, this will read one element beyond what
we want. If not all the pools are initialized, then the pool will be
NULL, triggering a WARN(), and if they are all initialized it will read
one element beyond the end of the array.
Carlos Galo [Fri, 23 Feb 2024 17:32:49 +0000 (17:32 +0000)]
mm: update mark_victim tracepoints fields
The current implementation of the mark_victim tracepoint provides only the
process ID (pid) of the victim process. This limitation poses challenges
for userspace tools requiring real-time OOM analysis and intervention.
Although this information is available from the kernel logs, it’s not
the appropriate format to provide OOM notifications. In Android, BPF
programs are used with the mark_victim trace events to notify userspace of
an OOM kill. For consistency, update the trace event to include the same
information about the OOMed victim as the kernel logs.
- UID
In Android each installed application has a unique UID. Including
the `uid` assists in correlating OOM events with specific apps.
- Process Name (comm)
Enables identification of the affected process.
- OOM Score
Will allow userspace to get additional insight of the relative kill
priority of the OOM victim. In Android, the oom_score_adj is used to
categorize app state (foreground, background, etc.), which aids in
analyzing user-perceptible impacts of OOM events [1].
- Total VM, RSS Stats, and pgtables
Amount of memory used by the victim that will, potentially, be freed up
by killing it.
hugetlb: allow faults to be handled under the VMA lock
Hugetlb can now safely handle faults under the VMA lock, so allow it to do
so.
This patch may cause ltp hugemmap10 to "fail". Hugemmap10 tests hugetlb
counters, and expects the counters to remain unchanged on failure to
handle a fault.
In hugetlb_no_page(), vmf_anon_prepare() may bailout with no anon_vma
under the VMA lock after allocating a folio for the hugepage. In
free_huge_folio(), this folio is completely freed on bailout iff there is
a surplus of hugetlb pages. This will remove a folio off the freelist and
decrement the number of hugepages while ltp expects these counters to
remain unchanged on failure.
Originally this could only happen due to OOM failures, but now it may also
occur after we allocate a hugetlb folio without a suitable anon_vma under
the VMA lock. This should only happen for the first freshly allocated
hugepage in this vma.
hugetlb: use vmf_anon_prepare() instead of anon_vma_prepare()
hugetlb_no_page() and hugetlb_wp() call anon_vma_prepare(). In
preparation for hugetlb to safely handle faults under the VMA lock, use
vmf_anon_prepare() here instead.
Additionally, passing hugetlb_wp() the vm_fault struct from
hugetlb_fault() works toward cleaning up the hugetlb code and function
stack.
hugetlb: move vm_fault declaration to the top of hugetlb_fault()
hugetlb_fault() currently defines a vm_fault to pass to the generic
handle_userfault() function. We can move this definition to the top of
hugetlb_fault() so that it can be used throughout the rest of the hugetlb
fault path.
This will help cleanup a number of excess variables and function arguments
throughout the stack. Also, since vm_fault already has space to store the
page offset, use that instead and get rid of idx.
mm/memory: change vmf_anon_prepare() to be non-static
Patch series "Handle hugetlb faults under the VMA lock", v2.
It is generally safe to handle hugetlb faults under the VMA lock. The
only time this is unsafe is when no anon_vma has been allocated to this
vma yet, so we can use vmf_anon_prepare() instead of anon_vma_prepare() to
bailout if necessary. This should only happen for the first hugetlb page
in the vma.
Additionally, this patchset begins to use struct vm_fault within
hugetlb_fault(). This works towards cleaning up hugetlb code, and should
significantly reduce the number of arguments passed to functions.
The last patch in this series may cause ltp hugemmap10 to "fail". This is
because vmf_anon_prepare() may bailout with no anon_vma under the VMA lock
after allocating a folio for the hugepage. In free_huge_folio(), this
folio is completely freed on bailout iff there is a surplus of hugetlb
pages. This will remove a folio off the freelist and decrement the number
of hugepages while ltp expects these counters to remain unchanged on
failure. The rest of the ltp testcases pass.
This patch (of 2):
In order to handle hugetlb faults under the VMA lock, hugetlb can use
vmf_anon_prepare() to ensure we can safely prepare an anon_vma. Change it
to be a non-static function so it can be used within hugetlb as well.
Yosry Ahmed [Thu, 22 Feb 2024 19:09:11 +0000 (19:09 +0000)]
x86/mm: always pass NULL as the first argument of switch_mm_irqs_off()
The first argument of switch_mm_irqs_off() is unused by the x86
implementation. Make sure that x86 code never passes a non-NULL value to
make this clear. Update the only non violating caller, switch_mm().
Yosry Ahmed [Thu, 22 Feb 2024 19:09:10 +0000 (19:09 +0000)]
x86/mm: further clarify switch_mm_irqs_off() documentation
Commit accf6b23d1e5a ("x86/mm: clarify "prev" usage in
switch_mm_irqs_off()") attempted to clarify x86's usage of the arguments
passed by generic code, specifically the "prev" argument the is unused by
x86. However, it could have done a better job with the comment above
switch_mm_irqs_off(). Rewrite this comment according to Dave Hansen's
suggestion.
Matthew Cassell [Thu, 22 Feb 2024 19:46:17 +0000 (19:46 +0000)]
mm/util.c: add byte count to __vm_enough_memory failure warning
Commit 44b414c8715c5dcf53288 ("mm/util.c: add warning if
__vm_enough_memory fails") adds debug information which gives the process
id and executable name should __vm_enough_memory() fail. Adding the
number of pages to the failure message would benefit application
developers and system administrators in debugging overambitious memory
requests by providing a point of reference to the amount of memory causing
__vm_enough_memory() to fail.
1. Set appropriate kernel tunable to reach code path for failure
message:
# echo 2 > /proc/sys/vm/overcommit_memory
2. Test program to generate failure - requests 1 gibibyte per
iteration:
Chengming Zhou [Fri, 16 Feb 2024 08:55:05 +0000 (08:55 +0000)]
mm/zswap: change zswap_pool kref to percpu_ref
All zswap entries will take a reference of zswap_pool when zswap_store(),
and drop it when free. Change it to use the percpu_ref is better for
scalability performance.
Although percpu_ref use a bit more memory which should be ok for our use
case, since we almost have only one zswap_pool to be using. The
performance gain is for zswap_store/load hotpath.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
Chengming Zhou [Fri, 16 Feb 2024 08:55:04 +0000 (08:55 +0000)]
mm/zswap: global lru and shrinker shared by all zswap_pools
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3.
Dynamic pool creation has been supported for a long time, which maybe not
used so much in practice. But with the per-memcg lru merged, the current
structure of zswap_pool's lru and shrinker become less optimal.
In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.
1. When memory has pressure, all shrinkers of zswap_pools will try to
shrink its lru list, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work will
try to shrink its own lru, which is inefficient.
A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.
Another optimization is changing zswap_pool kref to percpu_ref, which will
be taken reference by every zswap entry. So the scalability is better.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
This patch (of 3):
Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools
in a list, only the first will be current used.
Each zswap_pool has its own lru and shrinker, which is not necessary and
has its problem:
1. When memory has pressure, all shrinker of zswap_pools will
try to shrink its own lru, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work
will try to shrink its lru list. The rationale here was to
try and empty the old pool first so that we can completely
drop it. However, since we only support exclusive loads now,
the LRU ordering should be entirely decided by the order of
stores, so the oldest entries on the LRU will naturally be
from the oldest pool.
Anyway, having a global lru and shrinker shared by all zswap_pools is
better and efficient.
The twist here is that the error value is passed by reference, so that the
iterator can restore it when breaking out of the loop.
Handling of the magic AOP_WRITEPAGE_ACTIVATE value stays outside the
iterator and needs is just kept in the write_cache_pages legacy wrapper.
in preparation for eventually killing it off.
Heavily based on a for_each* based iterator from Matthew Wilcox.
Add a loop counter inside the folio_batch to let us iterate from 0-nr
instead of decrementing nr and treating the batch as a stack. It would
generate some very weird and suboptimal I/O patterns for page writeback to
iterate over the batch as a stack.
writeback: simplify the loops in write_cache_pages()
Collapse the two nested loops into one. This is needed as a step towards
turning this into an iterator.
Note that this drops the "index <= end" check in the previous outer loop
and just relies on filemap_get_folios_tag() to return 0 entries when index
> end. This actually has a subtle implication when end == -1 because then
the returned index will be -1 as well and thus if there is page present on
index -1, we could be looping indefinitely. But as the comment in
filemap_get_folios_tag documents this as already broken anyway we should
not worry about it here either. The fix for that would probably a change
to the filemap_get_folios_tag() calling convention.
writeback: factor writeback_get_batch() out of write_cache_pages()
This simple helper will be the basis of the writeback iterator. To make
this work, we need to remember the current index and end positions in
writeback_control.