Patch series "Increase the number of bits available in page_type".
Kent wants more than 16 bits in page_type, so I resurrected this old patch
and expanded it a bit. It's a bit more efficient than our current scheme
(1 4-byte insn vs 3 insns of 13 bytes total) to test a single page type.
This patch (of 4):
An upcoming patch will convert page type from being a bitfield to a
single byte, so we will not be able to use %pG to print the page type
any more. The printing of the symbolic name will be restored in that
patch.
Pedro Falcato [Sat, 17 Aug 2024 00:18:34 +0000 (01:18 +0100)]
selftests/mm: add more mseal traversal tests
Add more mseal traversal tests across VMAs, where we could possibly screw
up sealing checks. These test more across-vma traversal for mprotect,
munmap and madvise. Particularly, we test for the case where a regular
VMA is followed by a sealed VMA.
Pedro Falcato [Sat, 17 Aug 2024 00:18:32 +0000 (01:18 +0100)]
mseal: replace can_modify_mm_madv with a vma variant
Replace can_modify_mm_madv() with a single vma variant, and associated
checks in madvise.
While we're at it, also invert the order of checks in:
if (unlikely(is_ro_anon(vma) && !can_modify_vma(vma))
Checking if we can modify the vma itself (through vm_flags) is certainly
cheaper than is_ro_anon() due to arch_vma_access_permitted() looking at
e.g pkeys registers (with extra branches) in some architectures.
This patch allows for partial madvise success when finding a sealed VMA,
which historically has been allowed in Linux.
Pedro Falcato [Sat, 17 Aug 2024 00:18:31 +0000 (01:18 +0100)]
mm/mremap: replace can_modify_mm with can_modify_vma
Delegate all can_modify checks to the proper places. Unmap checks are
done in do_unmap (et al). The source VMA check is done purposefully
before unmapping, to keep the original mseal semantics.
Pedro Falcato [Sat, 17 Aug 2024 00:18:28 +0000 (01:18 +0100)]
mm: move can_modify_vma to mm/vma.h
Patch series "mm: Optimize mseal checks", v3.
Optimize mseal checks by removing the separate can_modify_mm() step, and
just doing checks on the individual vmas, when various operations are
themselves iterating through the tree. This provides a nice speedup and
restores performance parity with pre-mseal[3].
Yafang Shao [Tue, 20 Aug 2024 02:26:39 +0000 (10:26 +0800)]
mm: allow read-ahead with IOCB_NOWAIT set
Readahead support for IOCB_NOWAIT was introduced in commit 2e85abf053b9
("mm: allow read-ahead with IOCB_NOWAIT set"). However, this
implementation broke the semantics of IOCB_NOWAIT by potentially causing
it to wait on I/O during memory reclamation. This behavior was later
modified in commit efa8480a8316 ("fs: RWF_NOWAIT should imply IOCB_NOIO").
To resolve the blocking issue during memory reclamation, we can use
memalloc_noio_{save,restore} to ensure non-blocking behavior. This change
restores the original functionality, allowing preadv2(IOCB_NOWAIT) to
trigger readahead if the file content is not present in the page cache.
While this process may trigger direct memory reclamation, the
__GFP_NORETRY flag is set in the readahead GFP flags, ensuring it won't
block.
A use case for this change is when we want to trigger readahead in the
preadv2(2) syscall if the file cache is absent, but without waiting for
certain filesystem locks, like xfs_ilock. A simple example is as follows:
Kefeng Wang [Tue, 20 Aug 2024 03:26:30 +0000 (11:26 +0800)]
mm: remove migration for HugePage in isolate_single_pageblock()
The gigantic page size may larger than memory block size, so memory
offline always fails in this case after commit b2c9e2fbba32 ("mm: make
alloc_contig_range work at pageblock granularity"),
[ 15.815756] memory offlining [mem 0x3c0000000-0x3c7ffffff] failed due to failure to isolate range
Gigantic PageHuge is bigger than a pageblock, but since it is freed as
order-0 pages, its pageblocks after being freed will get to the right
free list. There is no need to have special handling code for them in
start_isolate_page_range(). For both alloc_contig_range() and memory
offline cases, the migration code after start_isolate_page_range() will
be able to migrate gigantic PageHuge when possible. Let's clean up
start_isolate_page_range() and fix the aforementioned memory offline
failure issue all together.
Let's clean up start_isolate_page_range() and fix the aforementioned
memory offline failure issue all together.
Baolin Wang [Tue, 20 Aug 2024 09:49:16 +0000 (17:49 +0800)]
mm: khugepaged: support shmem mTHP collapse
Shmem already supports the allocation of mTHP, but khugepaged does not yet
support collapsing mTHP folios. Now khugepaged is ready to support mTHP,
and this patch enables the collapse of shmem mTHP.
mm: always inline _compound_head() with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
We already force-inline page_fixed_fake_head(), page_is_fake_head() and
PageTail(), however the compiler might decide that _compound_head() is not
worthy to be inlined, because of page_fixed_fake_head().
The result is that, for example, PageAnonExclusive() now might involve a
function call when checking PageHuge(), which performs a
page_folio()->_compound_head() call. This can lead to a slight regression
of the stress-ng.clone benchmark.
This is not super-urgent to fix, but always inlining _compound_head()
seems like the obvious thing to do for this primitive, similar to the
other ones.
This change restores the slight regression and a compilation with
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y shows no relevant bloat [2]:
Uros Bizjak [Sun, 18 Aug 2024 21:01:52 +0000 (23:01 +0200)]
mm/kmemleak: use IS_ERR_PCPU() for pointer in the percpu address space
Use IS_ERR_PCPU() instead of IS_ERR() for pointers in the percpu address
space. The patch also fixes following sparse warnings:
kmemleak.c:1063:39: warning: cast removes address space '__percpu' of expression
kmemleak.c:1138:37: warning: cast removes address space '__percpu' of expression
Jinjiang Tu [Mon, 19 Aug 2024 13:06:09 +0000 (21:06 +0800)]
selftests/mm: remove unnecessary ia64 code and comment
IA64 has gone with commit cf8e8658100d ("arch: Remove Itanium (IA-64)
architecture"), so remove unnecessary ia64 special mm code and comment in
selftests too.
Danilo Krummrich [Mon, 12 Aug 2024 22:34:35 +0000 (00:34 +0200)]
mm: krealloc: clarify valid usage of __GFP_ZERO
Properly document that if __GFP_ZERO logic is requested, callers must
ensure that, starting with the initial memory allocation, every subsequent
call to this API for the same memory allocation is flagged with
__GFP_ZERO. Otherwise, it is possible that __GFP_ZERO is not fully
honored by this API.
Danilo Krummrich [Mon, 12 Aug 2024 22:34:34 +0000 (00:34 +0200)]
mm: krealloc: consider spare memory for __GFP_ZERO
As long as krealloc() is called with __GFP_ZERO consistently, starting
with the initial memory allocation, __GFP_ZERO should be fully honored.
However, if for an existing allocation krealloc() is called with a
decreased size, it is not ensured that the spare portion the allocation is
zeroed. Thus, if krealloc() is subsequently called with a larger size
again, __GFP_ZERO can't be fully honored, since we don't know the previous
size, but only the bucket size.
We have some cases left whereby we operate on small folios and still refer
to page->_mapcount. Let's just use folio->_mapcount instead, which
currently still overlays page->_mapcount, so no change.
This change will make it easier to later spot any remaining users of
page->_mapcount that target tail pages.
Yu Zhao [Wed, 14 Aug 2024 03:54:51 +0000 (21:54 -0600)]
mm/hugetlb: use __GFP_COMP for gigantic folios
Use __GFP_COMP for gigantic folios to greatly reduce not only the amount
of code but also the allocation and free time.
LOC (approximately): +60, -240
Allocate and free 500 1GB hugeTLB memory without HVO by:
time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
Before After
Alloc ~13s ~10s
Free ~15s <1s
The above magnitude generally holds for multiple x86 and arm64 CPU models.
Yu Zhao [Wed, 14 Aug 2024 03:54:50 +0000 (21:54 -0600)]
mm/cma: add cma_{alloc,free}_folio()
With alloc_contig_range() and free_contig_range() supporting large folios,
CMA can allocate and free large folios too, by cma_alloc_folio() and
cma_free_folio().
Yu Zhao [Wed, 14 Aug 2024 03:54:49 +0000 (21:54 -0600)]
mm/contig_alloc: support __GFP_COMP
Patch series "mm/hugetlb: alloc/free gigantic folios", v2.
Use __GFP_COMP for gigantic folios can greatly reduce not only the amount
of code but also the allocation and free time.
Approximate LOC to mm/hugetlb.c: +60, -240
Allocate and free 500 1GB hugeTLB memory without HVO by:
time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
Before After
Alloc ~13s ~10s
Free ~15s <1s
The above magnitude generally holds for multiple x86 and arm64 CPU
models.
Support __GFP_COMP in alloc_contig_range(). When the flag is set, upon
success the function returns a large folio prepared by prep_new_page(),
rather than a range of order-0 pages prepared by split_free_pages() (which
is renamed from split_map_pages()).
alloc_contig_range() can be used to allocate folios larger than
MAX_PAGE_ORDER, e.g., gigantic hugeTLB folios. So on the free path,
free_one_page() needs to handle that by split_large_buddy().
Kaiyang Zhao [Wed, 14 Aug 2024 17:42:27 +0000 (17:42 +0000)]
mm,memcg: provide per-cgroup counters for NUMA balancing operations
The ability to observe the demotion and promotion decisions made by the
kernel on a per-cgroup basis is important for monitoring and tuning
containerized workloads on machines equipped with tiered memory.
Different containers in the system may experience drastically different
memory tiering actions that cannot be distinguished from the global
counters alone.
For example, a container running a workload that has a much hotter memory
accesses will likely see more promotions and fewer demotions, potentially
depriving a colocated container of top tier memory to such an extent that
its performance degrades unacceptably.
For another example, some containers may exhibit longer periods between
data reuse, causing much more numa_hint_faults than numa_pages_migrated.
In this case, tuning hot_threshold_ms may be appropriate, but the signal
can easily be lost if only global counters are available.
In the long term, we hope to introduce per-cgroup control of promotion and
demotion actions to implement memory placement policies in tiering.
This patch set adds seven counters to memory.stat in a cgroup:
numa_pages_migrated, numa_pte_updates, numa_hint_faults, pgdemote_kswapd,
pgdemote_khugepaged, pgdemote_direct and pgpromote_success. pgdemote_*
and pgpromote_success are also available in memory.numa_stat.
count_memcg_events_mm() is added to count multiple event occurrences at
once, and get_mem_cgroup_from_folio() is added because we need to get a
reference to the memcg of a folio before it's migrated to track
numa_pages_migrated. The accounting of PGDEMOTE_* is moved to
shrink_inactive_list() before being changed to per-cgroup.
Andrey Konovalov [Tue, 13 Aug 2024 22:40:27 +0000 (00:40 +0200)]
kasan: simplify and clarify Makefile
When KASAN support was being added to the Linux kernel, GCC did not yet
support all of the KASAN-related compiler options. Thus, the KASAN
Makefile had to probe the compiler for supported options.
Nowadays, the Linux kernel GCC version requirement is 5.1+, and thus we
don't need the probing of the -fasan-shadow-offset parameter: it exists in
all 5.1+ GCCs.
Simplify the KASAN Makefile to drop CFLAGS_KASAN_MINIMAL.
Also add a few more comments and unify the indentation.
Baolin Wang [Mon, 12 Aug 2024 07:42:10 +0000 (15:42 +0800)]
mm: shmem: support large folio swap out
Shmem will support large folio allocation [1] [2] to get a better
performance, however, the memory reclaim still splits the precious large
folios when trying to swap out shmem, which may lead to the memory
fragmentation issue and can not take advantage of the large folio for
shmeme.
Moreover, the swap code already supports for swapping out large folio
without split, hence this patch set supports the large folio swap out for
shmem.
Note the i915_gem_shmem driver still need to be split when swapping, thus
add a new flag 'split_large_folio' for writeback_control to indicate
spliting the large folio.
Baolin Wang [Mon, 12 Aug 2024 07:42:09 +0000 (15:42 +0800)]
mm: shmem: split large entry if the swapin folio is not large
Now the swap device can only swap-in order 0 folio, even though a large
folio is swapped out. This requires us to split the large entry
previously saved in the shmem pagecache to support the swap in of small
folios.
Baolin Wang [Mon, 12 Aug 2024 07:42:08 +0000 (15:42 +0800)]
mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache()
To support large folio swapin/swapout for shmem in the following patches,
drop the folio's reference count by the number of pages contained in the
folio when a shmem folio is deleted from shmem pagecache after adding into
swap cache.
Baolin Wang [Mon, 12 Aug 2024 07:42:07 +0000 (15:42 +0800)]
mm: shmem: support large folio allocation for shmem_replace_folio()
To support large folio swapin for shmem in the following patches, add
large folio allocation for the new replacement folio in
shmem_replace_folio(). Moreover large folios occupy N consecutive entries
in the swap cache instead of using multi-index entries like the page
cache, therefore we should replace each consecutive entries in the swap
cache instead of using the shmem_replace_entry().
As well as updating statistics and folio reference count using the number
of pages in the folio.
Baolin Wang [Mon, 12 Aug 2024 07:42:06 +0000 (15:42 +0800)]
mm: shmem: use swap_free_nr() to free shmem swap entries
As a preparation for supporting shmem large folio swapout, use
swap_free_nr() to free some continuous swap entries of the shmem large
folio when the large folio was swapped in from the swap cache. In
addition, the index should also be round down to the number of pages when
adding the swapin folio into the pagecache.
Baolin Wang [Mon, 12 Aug 2024 07:42:05 +0000 (15:42 +0800)]
mm: filemap: use xa_get_order() to get the swap entry order
In the following patches, shmem will support the swap out of large folios,
which means the shmem mappings may contain large order swap entries, so
using xa_get_order() to get the folio order of the shmem swap entry to
update the '*start' correctly.
Daniel Gomez [Mon, 12 Aug 2024 07:42:04 +0000 (15:42 +0800)]
mm: shmem: return number of pages beeing freed in shmem_free_swap
Both shmem_free_swap callers expect the number of pages being freed. In
the large folios context, this needs to support larger values other than 0
(used as 1 page being freed) and -ENOENT (used as 0 pages being freed).
In preparation for large folios adoption, make shmem_free_swap routine
return the number of pages being freed. So, returning 0 in this context,
means 0 pages being freed.
While we are at it, changing to use free_swap_and_cache_nr() to free large
order swap entry by Baolin Wang.
Baolin Wang [Mon, 12 Aug 2024 07:42:03 +0000 (15:42 +0800)]
mm: shmem: extend shmem_partial_swap_usage() to support large folio swap
To support shmem large folio swapout in the following patches, using
xa_get_order() to get the order of the swap entry to calculate the swap
usage of shmem.
Baolin Wang [Mon, 12 Aug 2024 07:42:02 +0000 (15:42 +0800)]
mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting
Patch series "support large folio swap-out and swap-in for shmem", v5.
Shmem will support large folio allocation [1] [2] to get a better
performance, however, the memory reclaim still splits the precious large
folios when trying to swap-out shmem, which may lead to the memory
fragmentation issue and can not take advantage of the large folio for
shmeme.
Moreover, the swap code already supports for swapping out large folio
without split, and large folio swap-in[3] series is queued into
mm-unstable branch. Hence this patch set also supports the large folio
swap-out and swap-in for shmem.
This patch (of 9):
To support shmem large folio swap operations, add a new parameter to
swap_shmem_alloc() that allows batch SWAP_MAP_SHMEM flag setting for shmem
swap entries.
While we are at it, using folio_nr_pages() to get the number of pages of
the folio as a preparation.
Barry Song [Wed, 7 Aug 2024 21:58:58 +0000 (09:58 +1200)]
mm: rename instances of swap_info_struct to meaningful 'si'
Patch series "mm: batch free swaps for zap_pte_range()", v3.
Batch free swap slots for zap_pte_range(), making munmap three times
faster when the page table entries are filled with swap entries to
be freed. This is likely another advantage of using mTHP.
This patch (of 3):
"p" means "pointer to something", rename it to a more meaningful
identifier - "si". We also have a case with the name "sis", rename it to
"si" as well.
mm: make range-to-target_node lookup facility a part of numa_memblks
The x86 implementation of range-to-target_node lookup (i.e.
phys_to_target_node() and memory_add_physaddr_to_nid()) relies on
numa_memblks.
Since numa_memblks are now part of the generic code, move these functions
from x86 to mm/numa_memblks.c and select CONFIG_NUMA_KEEP_MEMINFO when
CONFIG_NUMA_MEMBLKS=y for dax and cxl.
Until now arch_numa was directly translating firmware NUMA information
to memblock.
Using numa_memblks as an intermediate step has a few advantages:
* alignment with more battle tested x86 implementation
* availability of NUMA emulation
* maintaining node information for not yet populated memory
Adjust a few places in numa_memblks to compile with 32-bit phys_addr_t and
replace current functionality related to numa_add_memblk() and
__node_distance() in arch_numa with the implementation based on
numa_memblks and add functions required by numa_emulation.
of, numa: return -EINVAL when no numa-node-id is found
Currently of_numa_parse_memory_nodes() returns 0 if no "memory" node in
device tree contains "numa-node-id" property. This makes of_numa_init()
to return "success" despite no NUMA nodes were actually parsed and set up.
arch_numa workarounds this by returning an error if numa_nodes_parsed is
empty.
numa_memblks however would WARN() in such case and since it will be used
by arch_numa shortly, such warning is not desirable.
Make sure of_numa_init() returns -EINVAL when no NUMA node information was
found in the device tree.
mm: numa_memblks: use memblock_{start,end}_of_DRAM() when sanitizing meminfo
numa_cleanup_meminfo() moves blocks outside system RAM to
numa_reserved_meminfo and it uses 0 and PFN_PHYS(max_pfn) to determine the
memory boundaries.
Replace the memory range boundaries with more portable
memblock_start_of_DRAM() and memblock_end_of_DRAM().
x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned
CPU id cannot be negative.
Making it unsigned also aligns with declarations in
include/asm-generic/numa.h used by arm64 and riscv and allows sharing numa
emulation code with these architectures.
By the time numa_emulation() is called, all physical memory is already
mapped in the direct map and there is no need to define limits for
memblock allocation.
Replace memblock_phys_alloc_range() with memblock_alloc().
The definitions of FAKE_NODE_MIN_SIZE and FAKE_NODE_MIN_HASH_MASK are only
used by numa emulation code, make them local to
arch/x86/mm/numa_emulation.c
Allocation of numa_distance uses memblock_phys_alloc_range() to limit
allocation to be below the last mapped page.
But NUMA initializaition runs after the direct map is populated and there
is also code in setup_arch() that adjusts memblock limit to reflect how
much memory is already mapped in the direct map.
Simplify the allocation of numa_distance and use plain memblock_alloc().
arch, mm: pull out allocation of NODE_DATA to generic code
Architectures that support NUMA duplicate the code that allocates
NODE_DATA on the node-local memory with slight variations in reporting of
the addresses where the memory was allocated.
Use x86 version as the basis for the generic alloc_node_data() function
and call this function in architecture specific numa initialization.
Round up node data size to SMP_CACHE_BYTES rather than to PAGE_SIZE like
x86 used to do since the bootmem era when allocation granularity was
PAGE_SIZE anyway.
There are no users of HAVE_ARCH_NODEDATA_EXTENSION left, so
arch_alloc_nodedata() and arch_refresh_nodedata() are not needed anymore.
Replace the call to arch_alloc_nodedata() in free_area_init() with a new
helper alloc_offline_node_data(), remove arch_refresh_nodedata() and
cleanup include/linux/memory_hotplug.h from the associated ifdefery.
arch, mm: move definition of node_data to generic code
Every architecture that supports NUMA defines node_data in the same way:
struct pglist_data *node_data[MAX_NUMNODES];
No reason to keep multiple copies of this definition and its forward
declarations, especially when such forward declaration is the only thing
in include/asm/mmzone.h for many architectures.
Add definition and declaration of node_data to generic code and drop
architecture-specific versions.
MIPS: loongson64: drop HAVE_ARCH_NODEDATA_EXTENSION
Commit f8f9f21c7848 ("MIPS: Fix build error for loongson64 and sgi-ip27")
added HAVE_ARCH_NODEDATA_EXTENSION to loongson64 to silence a compilation
error that happened because loongson64 didn't define array of pg_data_t as
node_data like most other architectures did.
After rename of __node_data to node_data arch_alloc_nodedata() and
HAVE_ARCH_NODEDATA_EXTENSION can be dropped from loongson64.
Since it was the only user of HAVE_ARCH_NODEDATA_EXTENSION config option
also remove this option from arch/mips/Kconfig.
Make definition of node_data match other architectures. This will allow
pulling declaration of node_data to the generic mm code in the following
commit.
Commit f8f9f21c7848 ("MIPS: Fix build error for loongson64 and sgi-ip27")
added HAVE_ARCH_NODEDATA_EXTENSION to sgi-ip27 to silence a compilation
error that happened because sgi-ip27 didn't define array of pg_data_t as
node_data like most other architectures did.
After addition of node_data array that matches other architectures and
after ensuring that offline nodes do not appear on node_possible_map, it
is safe to drop arch_alloc_nodedata() and HAVE_ARCH_NODEDATA_EXTENSION
from sgi-ip27.
Following the discussion about handling of CXL fixed memory windows on
arm64 [1] I decided to bite the bullet and move numa_memblks from x86 to
the generic code so they will be available on arm64/riscv and maybe on
loongarch sometime later.
While it could be possible to use memblock to describe CXL memory windows,
it currently lacks notion of unpopulated memory ranges and numa_memblks
does implement this.
Another reason to make numa_memblks generic is that both arch_numa (arm64
and riscv) and loongarch use trimmed copy of x86 code although there is no
fundamental reason why the same code cannot be used on all these
platforms. Having numa_memblks in mm/ will make it's interaction with
ACPI and FDT more consistent and I believe will reduce maintenance burden.
And with generic numa_memblks it is (almost) straightforward to enable
NUMA emulation on arm64 and riscv.
The first 9 commits in this series are cleanups that are not strictly
related to numa_memblks.
Commits 10-16 slightly reorder code in x86 to allow extracting numa_memblks
and NUMA emulation to the generic code.
Commits 17-19 actually move the code from arch/x86/ to mm/ and commits 20-22
does some aftermath cleanups.
Commit 23 updates of_numa_init() to return error of no NUMA nodes were
found in the device tree.
Commit 24 switches arch_numa to numa_memblks.
Commit 25 enables usage of phys_to_target_node() and
memory_add_physaddr_to_nid() with numa_memblks.
Commit 26 moves the description for numa=fake from x86 to admin-guide.
Kairui Song [Wed, 31 Jul 2024 06:49:21 +0000 (23:49 -0700)]
mm: swap: add a adaptive full cluster cache reclaim
Link all full cluster with one full list, and reclaim from it when the
allocation have ran out of all usable clusters.
There are many reason a folio can end up being in the swap cache while
having no swap count reference. So the best way to search for such slots
is still by iterating the swap clusters.
With the list as an LRU, iterating from the oldest cluster and keep them
rotating is a very doable and clean way to free up potentially not inuse
clusters.
When any allocation failure, try reclaim and rotate only one cluster.
This is adaptive for high order allocations they can tolerate fallback.
So this avoids latency, and give the full cluster list an fair chance to
get reclaimed. It release the usage stress for the fallback order 0
allocation or following up high order allocation.
If the swap device is getting very full, reclaim more aggresively to
ensure no OOM will happen. This ensures order 0 heavy workload won't go
OOM as order 0 won't fail if any cluster still have any space.
Kairui Song [Wed, 31 Jul 2024 06:49:20 +0000 (23:49 -0700)]
mm: swap: relaim the cached parts that got scanned
This commit implements reclaim during scan for cluster allocator.
Cluster scanning were unable to reuse SWAP_HAS_CACHE slots, which could
result in low allocation success rate or early OOM.
So to ensure maximum allocation success rate, integrate reclaiming with
scanning. If found a range of suitable swap slots but fragmented due to
HAS_CACHE, just try to reclaim the slots.
Kairui Song [Wed, 31 Jul 2024 06:49:19 +0000 (23:49 -0700)]
mm: swap: add a fragment cluster list
Now swap cluster allocator arranges the clusters in LRU style, so the
"cold" cluster stay at the head of nonfull lists are the ones that were
used for allocation long time ago and still partially occupied. So if
allocator can't find enough contiguous slots to satisfy an high order
allocation, it's unlikely there will be slot being free on them to satisfy
the allocation, at least in a short period.
As a result, nonfull cluster scanning will waste time repeatly scanning
the unusable head of the list.
Also, multiple CPUs could content on the same head cluster of nonfull
list. Unlike free clusters which are removed from the list when any CPU
starts using it, nonfull cluster stays on the head.
So introduce a new list frag list, all scanned nonfull clusters will be
moved to this list. Both for avoiding repeated scanning and contention.
Frag list is still used as fallback for allocations, so if one CPU failed
to allocate one order of slots, it can still steal other CPU's clusters.
And order 0 will favor the fragmented clusters to better protect nonfull
clusters
If any slots on a fragment list are being freed, move the fragment list
back to nonfull list indicating it worth another scan on the cluster.
Compared to scan upon freeing a slot, this keep the scanning lazy and save
some CPU if there are still other clusters to use.
It may seems unneccessay to keep the fragmented cluster on list at all if
they can't be used for specific order allocation. But this will start to
make sense once reclaim dring scanning is ready.
Kairui Song [Wed, 31 Jul 2024 06:49:18 +0000 (23:49 -0700)]
mm: swap: allow cache reclaim to skip slot cache
Currently we free the reclaimed slots through slot cache even if the slot
is required to be empty immediately. As a result the reclaim caller will
see the slot still occupied even after a successful reclaim, and need to
keep reclaiming until slot cache get flushed. This caused ineffective or
over reclaim when SWAP is under stress.
So introduce a new flag allowing the slot to be emptied bypassing the slot
cache.
Kairui Song [Wed, 31 Jul 2024 06:49:17 +0000 (23:49 -0700)]
mm: swap: skip slot cache on freeing for mTHP
Currently when we are freeing mTHP folios from swap cache, we free then
one by one and put each entry into swap slot cache. Slot cache is
designed to reduce the overhead by batching the freeing, but mTHP swap
entries are already continuous so they can be batch freed without it
already, it saves litle overhead, or even increase overhead for larger
mTHP.
What's more, mTHP entries could stay in swap cache for a while.
Contiguous swap entry is an rather rare resource so releasing them
directly can help improve mTHP allocation success rate when under
pressure.
Chris Li [Wed, 31 Jul 2024 06:49:15 +0000 (23:49 -0700)]
mm: swap: separate SSD allocation from scan_swap_map_slots()
Previously the SSD and HDD share the same swap_map scan loop in
scan_swap_map_slots(). This function is complex and hard to flow the
execution flow.
scan_swap_map_try_ssd_cluster() can already do most of the heavy lifting
to locate the candidate swap range in the cluster. However it needs to go
back to scan_swap_map_slots() to check conflict and then perform the
allocation.
When scan_swap_map_try_ssd_cluster() failed, it still depended on the
scan_swap_map_slots() to do brute force scanning of the swap_map. When
the swapfile is large and almost full, it will take some CPU time to go
through the swap_map array.
Get rid of the cluster allocation dependency on the swap_map scan loop in
scan_swap_map_slots(). Streamline the cluster allocation code path. No
more conflict checks.
For order 0 swap entry, when run out of free and nonfull list. It will
allocate from the higher order nonfull cluster list.
Users should see less CPU time spent on searching the free swap slot when
swapfile is almost full.
Chris Li [Wed, 31 Jul 2024 06:49:14 +0000 (23:49 -0700)]
mm: swap: mTHP allocate swap entries from nonfull list
Track the nonfull cluster as well as the empty cluster on lists. Each
order has one nonfull cluster list.
The cluster will remember which order it was used during new cluster
allocation.
When the cluster has free entry, add to the nonfull[order] list. Â When
the free cluster list is empty, also allocate from the nonempty list of
that order.
This improves the mTHP swap allocation success rate.
There are limitations if the distribution of numbers of different orders
of mTHP changes a lot. e.g. there are a lot of nonfull cluster assign to
order A while later time there are a lot of order B allocation while very
little allocation in order A. Currently the cluster used by order A will
not reused by order B unless the cluster is 100% empty.
Chris Li [Wed, 31 Jul 2024 06:49:13 +0000 (23:49 -0700)]
mm: swap: swap cluster switch to double link list
Patch series "mm: swap: mTHP swap allocator base on swap cluster order",
v5.
This is the short term solutions "swap cluster order" listed in my "Swap
Abstraction" discussion slice 8 in the recent LSF/MM conference.
When commit 845982eb264bc "mm: swap: allow storage of all mTHP orders" is
introduced, it only allocates the mTHP swap entries from the new empty
cluster list. Â It has a fragmentation issue reported by Barry.
The reason is that all the empty clusters have been exhausted while there
are plenty of free swap entries in the cluster that are not 100% free.
Remember the swap allocation order in the cluster. Keep track of the per
order non full cluster list for later allocation.
This series gives the swap SSD allocation a new separate code path from
the HDD allocation. The new allocator use cluster list only and do not
global scan swap_map[] without lock any more.
This streamline the swap allocation for SSD. The code matches the
execution flow much better.
User impact: For users that allocate and free mix order mTHP swapping, It
greatly improves the success rate of the mTHP swap allocation after the
initial phase.
It also performs faster when the swapfile is close to full, because the
allocator can get the non full cluster from a list rather than scanning a
lot of swap_map entries.Â
Previously, the swap cluster used a cluster index as a pointer to
construct a custom single link list type "swap_cluster_list". The next
cluster pointer is shared with the cluster->count. It prevents puting the
non free cluster into a list.
Change the cluster to use the standard double link list instead. This
allows tracing the nonfull cluster in the follow up patch. That way, it
is faster to get to the nonfull cluster of that order.
Remove the cluster getter/setter for accessing the cluster struct member.
The list operation is protected by the swap_info_struct->lock.
Change cluster code to use "struct swap_cluster_info *" to reference the
cluster rather than by using index. That is more consistent with the list
manipulation. It avoids the repeat adding index to the cluser_info. The
code is easier to understand.
Remove the cluster next pointer is NULL flag, the double link list can
handle the empty list pretty well.
The "swap_cluster_info" struct is two pointer bigger, because 512 swap
entries share one swap_cluster_info struct, it has very little impact on
the average memory usage per swap entry. For 1TB swapfile, the swap
cluster data structure increases from 8MB to 24MB.
Other than the list conversion, there is no real function change in this
patch.
Shakeel Butt [Wed, 14 Aug 2024 22:00:21 +0000 (15:00 -0700)]
memcg: initiate deprecation of pressure_level
The pressure_level in memcg v1 provides memory pressure notifications to
the user space. At the moment it provides notifications for three levels
of memory pressure i.e. low, medium and critical, which are defined based
on internal memory reclaim implementation details. More specifically the
ratio of scanned and reclaimed pages during a memory reclaim. However
this is not robust as there are workloads with mostly unreclaimable user
memory or kernel memory.
For v2, the users can use PSI for memory pressure status of the system or
the cgroup. Let's start the deprecation process for pressure_level and
add warnings to gather the info on how the current users are using this
interface and how they can be used to PSI.
Shakeel Butt [Wed, 14 Aug 2024 22:00:20 +0000 (15:00 -0700)]
memcg: initiate deprecation of oom_control
The oom_control provides functionality to disable memcg oom-killer,
notifications on oom-kill and reading the stats regarding oom-kills. This
interface was mainly introduced to provide functionality for userspace
oom-killers. However it is not robust enough and only supports OOM
handling in the page fault path.
For v2, the users can use the combination of memory.events notifications,
memory.high and PSI to provide userspace OOM-killing functionality.
Actually LMKD in Android and OOMd in systemd and Meta infrastructure
already use PSI in combination with other stats to implement userspace
OOM-killing.
Let's start the deprecation process for v1 and gather the info on how the
current users are using this interface and work on providing a more robust
functionality in v2.
Shakeel Butt [Wed, 14 Aug 2024 22:00:19 +0000 (15:00 -0700)]
memcg: initiate deprecation of v1 soft limit
Memcg v1 provides soft limit functionality for the best effort memory
sharing between multiple workloads on a system. It is usually triggered
through kswapd and at the moment does not reclaim kernel memory.
Memcg v2 provides more straightforward best effort (memory.low) and hard
protection (memory.min) functionalities. Let's initiate the deprecation
of soft limit from v1 and gather if v2 needs something more to move the
existing v1 users to v2 regarding soft limit.
Shakeel Butt [Wed, 14 Aug 2024 22:00:18 +0000 (15:00 -0700)]
memcg: initiate deprecation of v1 tcp accounting
Patch series "memcg: initiate deprecation of v1 features", v2.
Start the deprecation process of the memcg v1 features which we discussed
during LSFMMBPF 2024 [1]. For now add the warnings to collect the
information on how the current users are using these features. Next we
will work on providing better alternatives in v2 (if needed) and fully
deprecate these features.
Memcg v1 provides opt-in TCP memory accounting feature. However it is
mostly unused due to its performance impact on the network traffic. In
v2, the TCP memory is accounted in the regular memory usage and is
transparent to the users but they can observe the TCP memory usage through
memcg stats.
Let's initiate the deprecation process of memcg v1's tcp accounting
functionality and add warnings to gather if there are any users and if
there are, collect how they are using it and plan to provide them better
alternative in v2.
Shakeel Butt [Thu, 15 Aug 2024 05:04:50 +0000 (22:04 -0700)]
memcg: move v1 events and statistics code to v1 file
Currently the common code path for charge commit, swapout and batched
uncharge are executing v1 only code which is completely useless for the v2
deployments where CONFIG_MEMCG_V1 is disabled. In addition, it is mucking
with IRQs which might be slow on some architectures. Let's move all of
this code to v1 only code and remove them from v2 only deployments.
Shakeel Butt [Thu, 15 Aug 2024 05:04:49 +0000 (22:04 -0700)]
memcg: move mem_cgroup_charge_statistics to v1 code
There are no callers of mem_cgroup_charge_statistics() in the v2 code
base, so move it to the v1 only code and rename it to
memcg1_charge_statistics().
Shakeel Butt [Thu, 15 Aug 2024 05:04:47 +0000 (22:04 -0700)]
memcg: move v1 only percpu stats in separate struct
Patch series "memcg: further decouple v1 code from v2".
Some of the v1 code is still in v2 code base due to v1 fields in the
struct memcg_vmstats_percpu. This field decouples those fileds from v2
struct and move all the related code into v1 only code base.
This patch (of 7):
At the moment struct memcg_vmstats_percpu contains two v1 only fields
which consumes memory even when CONFIG_MEMCG_V1 is not enabled. In
addition there are v1 only functions accessing them and are in the main
memcontrol source file and can not be moved to v1 only source file due to
these fields. Let's move these fields into their own struct. Later
patches will move the functions accessing them to v1 source file and only
allocate these fields when CONFIG_MEMCG_V1 is enabled.
Ryan Roberts [Wed, 14 Aug 2024 02:02:47 +0000 (14:02 +1200)]
mm: override mTHP "enabled" defaults at kernel cmdline
Add thp_anon= cmdline parameter to allow specifying the default enablement
of each supported anon THP size. The parameter accepts the following
format and can be provided multiple times to configure each size:
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:43 +0000 (12:19 -0400)]
maple_tree: remove unneeded mas_wr_walk() in mas_store_prealloc()
Users of mas_store_prealloc() enter this function with nodes already
preallocated. This means the store type must be already set. We can then
remove the call to mas_wr_store_type() and initialize the write state to
continue the partial walk that was done when determining the store type.
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:42 +0000 (12:19 -0400)]
maple_tree: remove repeated sanity checks from write helper functions
These sanity checks are now redundant as they are already checked in
mas_wr_store_type(). We can remove them from mas_wr_append() and
mas_wr_node_store().
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:41 +0000 (12:19 -0400)]
maple_tree: remove node allocations from various write helper functions
These write helper functions are all called from store paths which
preallocate enough nodes that will be needed for the write. There is no
more need to allocate within the functions themselves.
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:40 +0000 (12:19 -0400)]
maple_tree: have mas_store() allocate nodes if needed
Not all users of mas_store() enter with nodes already preallocated.
Check for the MA_STATE_PREALLOC flag to decide whether to preallocate nodes
within mas_store() rather than relying on future write helper functions
to perform the allocations. This allows the write helper functions to be
simplified as they do not have to do checks to make sure there are
enough allocated nodes to perform the write.
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:38 +0000 (12:19 -0400)]
maple_tree: simplify mas_commit_b_node()
The only callers of mas_commit_b_node() are those with store type of
wr_rebalance and wr_split_store. Use mas->store_type to dispatch to the
correct helper function. This allows the removal of mas_reuse_node() as
it is no longer used.
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:37 +0000 (12:19 -0400)]
maple_tree: convert mas_insert() to preallocate nodes
By setting the store type in mas_insert(), we no longer need to use
mas_wr_modify() to determine the correct store function to use. Instead,
set the store type and call mas_wr_store_entry(). Also, pass in the
requested gfp flags to mas_insert() so they can be passed to the call to
mas_wr_preallocate().
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:36 +0000 (12:19 -0400)]
maple_tree: use store type in mas_wr_store_entry()
When storing an entry, we can read the store type that was set from a
previous partial walk of the tree. Now that the type of store is known,
select the correct write helper function to use to complete the store.
Also noinline mas_wr_spanning_store() to limit stack frame usage in
mas_wr_store_entry() as it allocates a maple_big_node on the stack.
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:33 +0000 (12:19 -0400)]
maple_tree: preallocate nodes in mas_erase()
Use mas_wr_preallocate() in mas_erase() to preallocate enough nodes to
complete the erase. Add error handling by skipping the store if the
preallocation lead to some error besides no memory.
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:32 +0000 (12:19 -0400)]
maple_tree: remove mas_destroy() from mas_nomem()
Separate call to mas_destroy() from mas_nomem() so we can check for no
memory errors without destroying the current maple state in
mas_store_gfp(). We then add calls to mas_destroy() to callers of
mas_nomem().