Now that the driver core can properly handle constant struct bus_type,
move the memory_tier_subsys variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.
There is no real difference between the global area, and other
additionally configured CMA areas via CONFIG_CMA_AREAS that always
defaults without user input. This makes MAX_CMA_AREAS same as
CONFIG_CMA_AREAS, also incrementing its default values, thus maintaining
current default for MAX_CMA_AREAS both for UMA and NUMA systems.
All pr_debug() prints in (mm/cma.c) could be enabled via standard Makefile
based method. Besides cma_debug_show_areas() should always be called
during cma_alloc() failure path. This seemingly redundant config,
CONFIG_CMA_DEBUG can be dropped without any problem.
Tiezhu Yang [Mon, 5 Feb 2024 06:09:22 +0000 (14:09 +0800)]
kasan: rename test_kasan_module_init to kasan_test_module_init
After commit f7e01ab828fd ("kasan: move tests to mm/kasan/"), the test
module file is renamed from lib/test_kasan_module.c to
mm/kasan/kasan_test_module.c, in order to keep consistent, rename
test_kasan_module_init to kasan_test_module_init.
Tiezhu Yang [Mon, 5 Feb 2024 06:09:21 +0000 (14:09 +0800)]
kasan: docs: update descriptions about test file and module
After commit f7e01ab828fd ("kasan: move tests to mm/kasan/"), the test
file is renamed to mm/kasan/kasan_test.c and the test module is renamed to
kasan_test.ko, so update the descriptions in the document.
While at it, update the line number and testcase number when the tests
kmalloc_large_oob_right and kmalloc_double_kzfree failed to sync with the
current code in mm/kasan/kasan_test.c.
In order to test this patch, I instrumented the kernel with LOCKDEP and
KASAN, and run the following tests, without any regression:
* The self test that reproduces the problem
* All mm hugetlb selftests
SUMMARY: PASS=9 SKIP=0 FAIL=0
* All libhugetlbfs tests
PASS: 0 86
FAIL: 0 0
This patch (of 2):
Currently there is a bug that a huge page could be stolen, and when the
original owner tries to fault in it, it causes a page fault.
You can achieve that by:
1) Creating a single page
echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
2) mmap() the page above with MAP_HUGETLB into (void *ptr1).
* This will mark the page as reserved
3) touch the page, which causes a page fault and allocates the page
* This will move the page out of the free list.
* It will also unreserved the page, since there is no more free
page
4) madvise(MADV_DONTNEED) the page
* This will free the page, but not mark it as reserved.
5) Allocate a secondary page with mmap(MAP_HUGETLB) into (void *ptr2).
* it should fail, but, since there is no more available page.
* But, since the page above is not reserved, this mmap() succeed.
6) Faulting at ptr1 will cause a SIGBUS
* it will try to allocate a huge page, but there is none
available
Fix this by restoring the reserved page if necessary.
These are the condition for the page restore:
* The system is not using surplus pages. The goal is to reduce the
surplus usage for this case.
* If the VMA has the HPAGE_RESV_OWNER flag set, and is PRIVATE. This is
safely checked using __vma_private_lock()
* The page is anonymous
Once this is scenario is found, set the `hugetlb_restore_reserve` bit in
the folio. Then check if the resv reservations need to be adjusted
later, done later, after the spinlock, since the vma_xxxx_reservation()
might touch the file system lock.
"page_counter.h" does not need <linux/kernel.h>. <linux/limits.h> is enough
to get LONG_MAX.
Files that include page_counter.h are limited. They have been compile
tested or checked.
$ git grep page_counter\.h
include/linux/hugetlb_cgroup.h: struct page_counter hugepage[HUGE_MAX_HSTATE];
--> all files that include it have been compile tested
include/linux/memcontrol.h:#include <linux/page_counter.h>
--> <linux/kernel.h> has been added, to be safe
include/net/sock.h:#include <linux/page_counter.h>
--> already include <linux/kernel.h>
T.J. Mercier [Fri, 2 Feb 2024 23:38:54 +0000 (23:38 +0000)]
mm: memcg: use larger batches for proactive reclaim
Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive
reclaim") we passed the number of pages for the reclaim request directly
to try_to_free_mem_cgroup_pages, which could lead to significant
overreclaim. After 0388536ac291 the number of pages was limited to a
maximum 32 (SWAP_CLUSTER_MAX) to reduce the amount of overreclaim.
However such a small batch size caused a regression in reclaim performance
due to many more reclaim start/stop cycles inside memory_reclaim. The
restart cost is amortized over more pages with larger batch sizes, and
becomes a significant component of the runtime if the batch size is too
small.
Reclaim tries to balance nr_to_reclaim fidelity with fairness across nodes
and cgroups over which the pages are spread. As such, the bigger the
request, the bigger the absolute overreclaim error. Historic in-kernel
users of reclaim have used fixed, small sized requests to approach an
appropriate reclaim rate over time. When we reclaim a user request of
arbitrary size, use decaying batch sizes to manage error while maintaining
reasonable throughput.
MGLRU enabled - memcg LRU used
root - full reclaim pages/sec time (sec)
pre-0388536ac291 : 68047 10.46
post-0388536ac291 : 13742 inf
(reclaim-reclaimed)/4 : 67352 10.51
mm/memory: ignore writable bit in folio_pte_batch()
... and conditionally return to the caller if any PTE except the first
one is writable. fork() has to make sure to properly write-protect in
case any PTE is writable. Other users (e.g., page unmaping) are expected
to not care.
mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
Let's always ignore the accessed/young bit: we'll always mark the PTE as
old in our child process during fork, and upcoming users will similarly
not care.
Ignore the dirty bit only if we don't want to duplicate the dirty bit into
the child process during fork. Maybe, we could just set all PTEs in the
child dirty if any PTE is dirty. For now, let's keep the behavior
unchanged, this can be optimized later if required.
Ignore the soft-dirty bit only if the bit doesn't have any meaning in the
src vma, and similarly won't have any in the copied dst vma.
Let's implement PTE batching when consecutive (present) PTEs map
consecutive pages of the same large folio, and all other PTE bits besides
the PFNs are equal.
We will optimize folio_pte_batch() separately, to ignore selected PTE
bits. This patch is based on work by Ryan Roberts.
Use __always_inline for __copy_present_ptes() and keep the handling for
single PTEs completely separate from the multi-PTE case: we really want
the compiler to optimize for the single-PTE case with small folios, to not
degrade performance.
Note that PTE batching will never exceed a single page table and will
always stay within VMA boundaries.
Further, processing PTE-mapped THP that maybe pinned and have
PageAnonExclusive set on at least one subpage should work as expected, but
there is room for improvement: We will repeatedly (1) detect a PTE batch
(2) detect that we have to copy a page (3) fall back and allocate a single
page to copy a single page. For now we won't care as pinned pages are a
corner case, and we should rather look into maintaining only a single
PageAnonExclusive bit for large folios.
mm/pgtable: make pte_next_pfn() independent of set_ptes()
Let's provide pte_next_pfn(), independently of set_ptes(). This allows
for using the generic pte_next_pfn() version in some arch-specific
set_ptes() implementations, and prepares for reusing pte_next_pfn() in
other context.
Ryan Roberts [Mon, 29 Jan 2024 12:46:35 +0000 (13:46 +0100)]
arm64/mm: make set_ptes() robust when OAs cross 48-bit boundary
Patch series "mm/memory: optimize fork() with PTE-mapped THP", v3.
Now that the rmap overhaul[1] is upstream that provides a clean interface
for rmap batching, let's implement PTE batching during fork when
processing PTE-mapped THPs.
This series is partially based on Ryan's previous work[2] to implement
cont-pte support on arm64, but its a complete rewrite based on [1] to
optimize all architectures independent of any such PTE bits, and to use
the new rmap batching functions that simplify the code and prepare for
further rmap accounting changes.
We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch and (c) perform batch PTE setting/updates.
While this series should be beneficial for adding cont-pte support on
ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
for large folios with minimal added overhead and further changes[4] that
build up on top of the total mapcount.
Independent of all that, this series results in a speedup during fork with
PTE-mapped THP, which is the default with THPs that are smaller than a PMD
(for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).
On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
of the same size (stddev < 1%) results in the following runtimes for
fork() (shorter is better):
Note that these numbers are even better than the ones from v1 (verified
over multiple reboots), even though there were only minimal code changes.
Well, I removed a pte_mkclean() call for anon folios, maybe that also
plays a role.
But my experience is that fork() is extremely sensitive to code size,
inlining, ... so I suspect we'll see on other architectures rather a
change of -20% instead of -30%, and it will be easy to "lose" some of that
speedup in the future by subtle code changes.
Next up is PTE batching when unmapping. Only tested on x86-64.
Compile-tested on most other architectures.
Since the high bits [51:48] of an OA are not stored contiguously in the
PTE, there is a theoretical bug in set_ptes(), which just adds PAGE_SIZE
to the pte to get the pte with the next pfn. This works until the pfn
crosses the 48-bit boundary, at which point we overflow into the upper
attributes.
Of course one could argue (and Matthew Wilcox has :) that we will never
see a folio cross this boundary because we only allow naturally aligned
power-of-2 allocation, so this would require a half-petabyte folio. So
its only a theoretical bug. But its better that the code is robust
regardless.
I've implemented pte_next_pfn() as part of the fix, which is an opt-in
core-mm interface. So that is now available to the core-mm, which will be
needed shortly to support forthcoming fork()-batching optimizations.
Baolin Wang [Tue, 20 Feb 2024 06:16:31 +0000 (14:16 +0800)]
mm: compaction: update the cc->nr_migratepages when allocating or freeing the freepages
Currently we will use 'cc->nr_freepages >= cc->nr_migratepages' comparison
to ensure that enough freepages are isolated in isolate_freepages(),
however it just decreases the cc->nr_freepages without updating
cc->nr_migratepages in compaction_alloc(), which will waste more CPU
cycles and cause too many freepages to be isolated.
So we should also update the cc->nr_migratepages when allocating or
freeing the freepages to avoid isolating excess freepages. And I can see
fewer free pages are scanned and isolated when running thpcompact on my
Arm64 server:
selftests/mm: thuge-gen: conform to TAP format output
Conform the layout, informational and status messages to TAP. No
functional change is intended other than the layout of output messages.
Also remove unneeded logging which isn't enabled. Skip a hugepage size if
it has less free pages to avoid unnecessary failures. For examples, some
systems may not have 1GB hugepage free. So skip 1GB for testing in this
test instead of failing the entire test.
selftests/mm: mlock2-tests: conform test to TAP format output
Conform the layout, informational and status messages to TAP. No
functional change is intended other than the layout of output messages.
I've done some cleanups as well.
selftests/mm: map_populate: conform test to TAP format output
Conform the layout, informational and status messages to TAP. No
functional change is intended other than the layout of output messages.
Minor cleanups have also been included.
selftests/mm: map_fixed_noreplace: conform test to TAP format output
Patch series "conform tests to TAP format output", v2.
This patch (of 12):
Conform the layout, informational and status messages to TAP. No
functional change is intended other than the layout of output messages.
While at it, convert commenting style from // to /**/.
Current implementation of UFFDIO_MOVE fails to move zeropages and returns
EBUSY when it encounters one. We can handle them by mapping a zeropage at
the destination and clearing the mapping at the source. This is done both
for ordinary and for huge zeropages.
Daniel Gomez [Wed, 31 Jan 2024 22:51:25 +0000 (14:51 -0800)]
XArray: add cmpxchg order test
XArray multi-index entries do not keep track of the order stored once the
entry is being marked as used with cmpxchg (conditionally replaced with
NULL). Add a test to check the order is actually lost. The test also
verifies the order and entries for all the tied indexes before and after
the NULL replacement with xa_cmpxchg.
Add another entry at 1 << order that keeps the node around and the order
information for the NULL-entry after xa_cmpxchg.
Luis Chamberlain [Wed, 31 Jan 2024 22:51:24 +0000 (14:51 -0800)]
test_xarray: add tests for advanced multi-index use
Patch series "test_xarray: advanced API multi-index tests", v2.
This is a respin of the test_xarray multi-index tests [0] which use and
demonstrate the advanced API which is used by the page cache. This should
let folks more easily follow how we use multi-index to support for example
a min order later in the page cache. It also lets us grow the selftests
to mimic more of what we do in the page cache.
This patch (of 2):
The multi index selftests are great but they don't replicate how we deal
with the page cache exactly, which makes it a bit hard to follow as the
page cache uses the advanced API.
Add tests which use the advanced API, mimicking what we do in the page
cache, while at it, extend the example to do what is needed for min order
support.
mm/cma: don't treat bad input arguments for cma_alloc() as its failure
Invalid cma_alloc() input scenarios - including excess allocation request
should neither be counted as CMA_ALLOC_FAIL nor 'cma->nr_pages_failed' be
updated when applicable with CONFIG_CMA_SYSFS. This also drops 'out' jump
label which has become redundant.
Christophe Leroy [Tue, 30 Jan 2024 10:34:34 +0000 (11:34 +0100)]
powerpc,s390: ptdump: define ptdump_check_wx() regardless of CONFIG_DEBUG_WX
Following patch will use ptdump_check_wx() regardless of CONFIG_DEBUG_WX,
so define it at all times on powerpc and s390 just like other
architectures. Though keep the WARN_ON_ONCE() only when CONFIG_DEBUG_WX
is set.
All architectures using the core ptdump functionality also implement
CONFIG_DEBUG_WX, and they all do it more or less the same way, with a
function called debug_checkwx() that is called by mark_rodata_ro(), which
is a substitute to ptdump_check_wx() when CONFIG_DEBUG_WX is set and a
no-op otherwise.
Refactor by centrally defining debug_checkwx() in linux/ptdump.h and call
debug_checkwx() immediately after calling mark_rodata_ro() instead of
calling it at the end of every mark_rodata_ro().
On x86_32, mark_rodata_ro() first checks __supported_pte_mask has _PAGE_NX
before calling debug_checkwx(). Now the check is inside the callee
ptdump_walk_pgd_level_checkwx().
On powerpc_64, mark_rodata_ro() bails out early before calling
ptdump_check_wx() when the MMU doesn't have KERNEL_RO feature. The check
is now also done in ptdump_check_wx() as it is called outside
mark_rodata_ro().
Christophe Leroy [Tue, 30 Jan 2024 10:34:32 +0000 (11:34 +0100)]
arm: ptdump: rename CONFIG_DEBUG_WX to CONFIG_ARM_DEBUG_WX
Patch series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages
debugfs attribute", v2.
This series refactors CONFIG_DEBUG_WX for the 5 architectures implementing
CONFIG_GENERIC_PTDUMP
First rename stuff in ARM which uses similar names while not implementing
CONFIG_GENERIC_PTDUMP.
Then define a generic version of debug_checkwx() that calls
ptdump_check_wx() when CONFIG_DEBUG_WX is set. Call it immediately after
calling mark_rodata_ro() instead of calling it at the end of every
mark_rodata_ro().
Then implement a debugfs attribute that can be used to trigger a W^X test
at anytime and regardless of CONFIG_DEBUG_WX
This patch (of 5):
CONFIG_DEBUG_WX is a core option defined in mm/Kconfig.debug
To avoid any future conflict, rename ARM version into CONFIG_ARM_DEBUG_WX.
Gregory Price [Fri, 2 Feb 2024 17:02:38 +0000 (12:02 -0500)]
mm/mempolicy: protect task interleave functions with tsk->mems_allowed_seq
In the event of rebind, pol->nodemask can change at the same time as an
allocation occurs. We can detect this with tsk->mems_allowed_seq and
prevent a miscount or an allocation failure from occurring.
The same thing happens in the allocators to detect failure, but this can
prevent spurious failures in a much smaller critical section.
Gregory Price [Fri, 2 Feb 2024 17:02:37 +0000 (12:02 -0500)]
mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
When a system has multiple NUMA nodes and it becomes bandwidth hungry,
using the current MPOL_INTERLEAVE could be an wise option.
However, if those NUMA nodes consist of different types of memory such as
socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based
interleave policy does not optimally distribute data to make use of their
different bandwidth characteristics.
Instead, interleave is more effective when the allocation policy follows
each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.
This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
enabling weighted interleave between NUMA nodes. Weighted interleave
allows for proportional distribution of memory across multiple numa nodes,
preferably apportioned to match the bandwidth of each node.
For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight
distribution is (2:1).
Weights for each node can be assigned via the new sysfs extension:
/sys/kernel/mm/mempolicy/weighted_interleave/
For now, the default value of all nodes will be `1`, which matches the
behavior of standard 1:1 round-robin interleave. An extension will be
added in the future to allow default values to be registered at kernel and
device bringup time.
The policy allocates a number of pages equal to the set weights. For
example, if the weights are (2,1), then 2 pages will be allocated on node0
for every 1 page allocated on node1.
The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
and mbind(2).
Some high level notes about the pieces of weighted interleave:
current->il_prev:
Tracks the node previously allocated from.
current->il_weight:
The active weight of the current node (current->il_prev)
When this reaches 0, current->il_prev is set to the next node
and current->il_weight is set to the next weight.
weighted_interleave_nodes:
Counts the number of allocations as they occur, and applies the
weight for the current node. When the weight reaches 0, switch
to the next node. Operates only on task->mempolicy.
weighted_interleave_nid:
Gets the total weight of the nodemask as well as each individual
node weight, then calculates the node based on the given index.
Operates on VMA policies.
bulk_array_weighted_interleave:
Gets the total weight of the nodemask as well as each individual
node weight, then calculates the number of "interleave rounds" as
well as any delta ("partial round"). Calculates the number of
pages for each node and allocates them.
If a node was scheduled for interleave via interleave_nodes, the
current weight will be allocated first.
Operates only on the task->mempolicy.
One piece of complexity is the interaction between a recent refactor which
split the logic to acquire the "ilx" (interleave index) of an allocation
and the actually application of the interleave. If a call to
alloc_pages_mpol() were made with a weighted-interleave policy and ilx set
to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would operate on a VMA
policy - violating the description above.
An inspection of all callers of alloc_pages_mpol() shows that all external
callers set ilx to `0`, an index value, or will call get_vma_policy() to
acquire the ilx.
For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks
all set (pgoff_t ilx) or end up in `get_vma_policy()`. This enforces the
`weighted_interleave_nodes()` and `weighted_interleave_nid()` policy
requirements (task/vma respectively).
Rakie Kim [Fri, 2 Feb 2024 17:02:35 +0000 (12:02 -0500)]
mm/mempolicy: implement the sysfs-based weighted_interleave interface
Patch series "mm/mempolicy: weighted interleave mempolicy and sysfs
extension", v5.
Weighted interleave is a new interleave policy intended to make use of
heterogeneous memory environments appearing with CXL.
The existing interleave mechanism does an even round-robin distribution of
memory across all nodes in a nodemask, while weighted interleave
distributes memory across nodes according to a provided weight. (Weight =
# of page allocations per round)
Weighted interleave is intended to reduce average latency when bandwidth
is pressured - therefore increasing total throughput.
In other words: It allows greater use of the total available bandwidth in
a heterogeneous hardware environment (different hardware provides
different bandwidth capacity).
As bandwidth is pressured, latency increases - first linearly and then
exponentially. By keeping bandwidth usage distributed according to
available bandwidth, we therefore can reduce the average latency of a
cacheline fetch.
A good explanation of the bandwidth vs latency response curve:
https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/
From the article:
```
Constant region:
The latency response is fairly constant for the first 40%
of the sustained bandwidth.
Linear region:
In between 40% to 80% of the sustained bandwidth, the
latency response increases almost linearly with the bandwidth
demand of the system due to contention overhead by numerous
memory requests.
Exponential region:
Between 80% to 100% of the sustained bandwidth, the memory
latency is dominated by the contention latency which can be
as much as twice the idle latency or more.
Maximum sustained bandwidth :
Is 65% to 75% of the theoretical maximum bandwidth.
```
As a general rule of thumb:
* If bandwidth usage is low, latency does not increase. It is
optimal to place data in the nearest (lowest latency) device.
* If bandwidth usage is high, latency increases. It is optimal
to place data such that bandwidth use is optimized per-device.
This is the top line goal: Provide a user a mechanism to target using the
"maximum sustained bandwidth" of each hardware component in a heterogenous
memory system.
For example, the stream benchmark demonstrates that 1:1 (default)
interleave is actively harmful, while weighted interleave can be
beneficial. Default interleave distributes data such that too much
pressure is placed on devices with lower available bandwidth.
Stream Benchmark (vs DRAM, 1 Socket + 1 CXL Device)
Default interleave : -78% (slower than DRAM)
Global weighting : -6% to +4% (workload dependant)
Targeted weights : +2.5% to +4% (consistently better than DRAM)
Global means the task-policy was set (set_mempolicy), while targeted means
VMA policies were set (mbind2). We see weighted interleave is not always
beneficial when applied globally, but is always beneficial when applied to
bandwidth-driving memory regions.
There are 4 patches in this set:
1) Implement system-global interleave weights as sysfs extension
in mm/mempolicy.c. These weights are RCU protected, and a
default weight set is provided (all weights are 1 by default).
In future work, we intend to expose an interface for HMAT/CDAT
code to set reasonable default values based on the memory
configuration of the system discovered at boot/hotplug.
2) A mild refactor of some interleave-logic for re-use in the
new weighted interleave logic.
3) MPOL_WEIGHTED_INTERLEAVE extension for set_mempolicy/mbind
4) Protect interleave logic (weighted and normal) with the
mems_allowed seq cookie. If the nodemask changes while
accessing it during a rebind, just retry the access.
Included below are some performance and LTP test information,
and a sample numactl branch which can be used for testing.
= Performance summary =
(tests may have different configurations, see extended info below)
1) MLC (W2) : +38% over DRAM. +264% over default interleave.
MLC (W5) : +40% over DRAM. +226% over default interleave.
2) Stream : -6% to +4% over DRAM, +430% over default interleave.
3) XSBench : +19% over DRAM. +47% over default interleave.
= version history
v5:
- style fixes
- mems_allowed cookie protection to detect rebind issues,
prevents spurious allocation failures and/or mis-allocations
- sparse warning fixes related to __rcu on local variables
=====================================================================
Performance tests - MLC
From - Ravi Jonnalagadda <[email protected]>
Workload: W2
Data Signature: 2:1 read:write
DRAM only bandwidth (GBps): 298.8
DRAM + CXL (default interleave) (GBps): 113.04
DRAM + CXL (weighted interleave)(GBps): 412.5
Gain over DRAM only: 1.38x
Gain over default interleave: 2.64x
Workload: W5
Data Signature: 1:1 read:write
DRAM only bandwidth (GBps): 273.2
DRAM + CXL (default interleave) (GBps): 117.23
DRAM + CXL (weighted interleave)(GBps): 382.7
Gain over DRAM only: 1.4x
Gain over default interleave: 2.26x
=====================================================================
Performance test - Stream
From - Gregory Price <[email protected]>
Hardware: Single socket, single CXL expander
numactl extension: https://github.com/gmprice/numactl/tree/weighted_interleave_master
Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times
Default interleave : -78% (slower than DRAM)
Global weighting : -6% to +4% (workload dependant)
mbind2 weights : +2.5% to +4% (consistently better than DRAM)
Based on the above, the optimal weights are ~9:1
echo 9 > /sys/kernel/mm/mempolicy/weighted_interleave/node1
echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2
MPOL_WEIGHTED_INTERLEAVE added manually to test basic functionality but
did not adjust tests for weighting. Basically the weights were set to 1,
which is the default, and it should behave the same as MPOL_INTERLEAVE if
logic is correct.
=====================================================================
numactl (set_mempolicy) w/ global weighting test
numactl fork: https://github.com/gmprice/numactl/tree/weighted_interleave_master
command: numactl -w --interleave=0,1 ./eatmem
result (weights 1:1): 0176a000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=32897 N1=32896 kernelpagesize_kB=4 7fceeb9ff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=32768 N1=32769 kernelpagesize_kB=4
50% distribution is correct
result (weights 5:1): 01b14000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=54828 N1=10965 kernelpagesize_kB=4 7f47a1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54614 N1=10923 kernelpagesize_kB=4
16.666% distribution is correct
result (weights 1:5): 01f07000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=10966 N1=54827 kernelpagesize_kB=4 7f17b1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=10923 N1=54614 kernelpagesize_kB=4
16.666% distribution is correct
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main (void)
{
char* mem = malloc(1024*1024*256);
memset(mem, 1, 1024*1024*256);
for (int i = 0; i < ((1024*1024*256)/4096); i++)
{
mem = malloc(4096);
mem[0] = 1;
}
printf("done\n");
getchar();
return 0;
}
This patch (of 4):
This patch provides a way to set interleave weight information under sysfs
at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN
SeongJae Park [Tue, 30 Jan 2024 01:35:48 +0000 (17:35 -0800)]
Docs/translations/damon/usage: update for monitor_on renaming
Update DAMON debugfs interface sections on the translated usage documents
to reflect the fact that 'monitor_on' file has renamed to
'monitor_on_DEPRECATED'.
SeongJae Park [Tue, 30 Jan 2024 01:35:46 +0000 (17:35 -0800)]
mm/damon/dbgfs: rename monitor_on file to monitor_on_DEPRECATED
Kernel builders could silently enable CONFIG_DAMON_DBGFS_DEPRECATED.
Users who manually check the files under the DAMON debugfs directory could
notice the deprecation owing to the 'DEPRECATED' DAMON debugfs file, but
there could be users who doesn't manually check the files.
Make the deprecation cannot be ignored in the case by renaming
'monitor_on' file, which is essential for real use of DAMON on runtime, to
'monitor_on_DEPRECATED'. Still users who control DAMON via only
user-space tool could ignore the deprecation, but that's what the tool
developers should take care of. DAMON user-space tool, damo, has also
made a change[1] for the purpose.
[1] commit 935dae76f2aee ("_damon_args: Rename --damon_interface to
--damon_interface_DEPRECATED") of https://github.com/awslabs/damo
SeongJae Park [Tue, 30 Jan 2024 01:35:45 +0000 (17:35 -0800)]
selftets/damon: prepare for monitor_on file renaming
Following change will rename 'monitor_on' DAMON debugfs file to
'monitor_on_DEPRECATED', to make the deprecation unignorable in runtime.
Since it could make DAMON selftests fail and disturb future bisects,
update DAMON selftests to support the change.
SeongJae Park [Tue, 30 Jan 2024 01:35:43 +0000 (17:35 -0800)]
mm/damon/dbgfs: make debugfs interface deprecation message a macro
DAMON debugfs interface deprecation message is written twice, once for the
warning, and again for DEPRECATED file's read output. De-duplicate those
by defining the message as a macro and reuse.
SeongJae Park [Tue, 30 Jan 2024 01:35:42 +0000 (17:35 -0800)]
mm/damon/dbgfs: implement deprecation notice file
Implement a read-only file for DAMON debugfs interface deprecation notice,
to let users who manually read/write the DAMON debugfs files from their
shell command line easily notice the fact.
SeongJae Park [Tue, 30 Jan 2024 01:35:41 +0000 (17:35 -0800)]
mm/damon: rename CONFIG_DAMON_DBGFS to DAMON_DBGFS_DEPRECATED
DAMON debugfs interface is deprecated. The fact has documented by commit 5445fcbc4cda ("Docs/admin-guide/mm/damon/usage: add DAMON debugfs
interface deprecation notice"). Commit 620932cd2852 ("mm/damon/dbgfs:
print DAMON debugfs interface deprecation message") further started
printing a warning message when users still use it. Many people don't
read documentation or kernel log, though.
Make the deprecation harder to be ignored using the approach of commit eb07c4f39c3e ("mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED").
'make oldconfig' with 'CONFIG_DAMON_DBGFS=y' will get a new prompt with
the explicit deprecation notice on the name. 'make olddefconfig' with
'CONFIG_DAMON_DBGFS=y' will result in not building DAMON debugfs
interface. If there is a real user of DAMON debugfs interface, they will
complain the change to the builder.
SeongJae Park [Tue, 30 Jan 2024 01:35:40 +0000 (17:35 -0800)]
Docs/admin-guide/mm/damon/usage: use sysfs interface for tracepoints example
Patch series "mm/damon: make DAMON debugfs interface deprecation
unignorable".
DAMON debugfs interface is deprecated in February 2023, by commit 5445fcbc4cda ("Docs/admin-guide/mm/damon/usage: add DAMON debugfs
interface deprecation notice"). Make the fact unable to be easily ignored
by removing an example usage from the document (patch 1), renaming the
config (patch 2), adding a deprecation notice file to the debugfs
directory (patches 3-5), and renaming the debugfs file that essnetial to
be used for real use of DAMON (patches 6-9).
This patch (of 9):
DAMON tracepoints example on the DAMON usage document is using DAMON
debugfs interface, which is deprecated. Use its alternative, DAMON sysfs
interface.
Johannes Weiner [Tue, 30 Jan 2024 01:36:47 +0000 (20:36 -0500)]
mm: zswap: function ordering: pool refcounting
Move pool refcounting functions into the pool section. First the
destroy functions, then the get and put which uses them.
__zswap_pool_empty() has an upward reference to the global
zswap_pools, to sanity check it's not the currently active pool that's
being freed. That gets the forward decl for zswap_pool_current().
This puts the get and put function above all callers, so kill the
forward decls as well.
Johannes Weiner [Tue, 30 Jan 2024 01:36:46 +0000 (20:36 -0500)]
mm: zswap: function ordering: pool alloc & free
The function ordering in zswap.c is a little chaotic, which requires
jumping in unexpected directions when following related code. This is
a series of patches that brings the file into the following order:
Johannes Weiner [Tue, 30 Jan 2024 01:36:45 +0000 (20:36 -0500)]
mm: zswap: simplify zswap_invalidate()
The branching is awkward and duplicates code. The comment about
writeback is also misleading: yes, the entry might have been written
back. Or it might have never been stored in zswap to begin with due to
a rejection - zswap_invalidate() is called on all exiting swap entries.
Johannes Weiner [Tue, 30 Jan 2024 01:36:44 +0000 (20:36 -0500)]
mm: zswap: further cleanup zswap_store()
- Remove dupentry, reusing entry works just fine.
- Rename pool to shrink_pool, as this one actually is confusing.
- Remove page, use folio_nid() and kmap_local_folio() directly.
- Set entry->swpentry in a common path.
- Move value and src to local scope of use.
Johannes Weiner [Tue, 30 Jan 2024 01:36:43 +0000 (20:36 -0500)]
mm: zswap: break out zwap_compress()
zswap_store() is long and mixes work at the zswap layer with work at
the backend and compression layer. Move compression & backend work to
zswap_compress(), mirroring zswap_decompress().
The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.
We can fix it by moving the swap cache allocation ahead before referencing
the tree, then check invalidate race with tree lock, only after that we
can safely deref the entry. Note we couldn't deref entry or tree anymore
after we unlock the folio, since we depend on this to hold on swapoff.
So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache and if
returned with folio locked we can reference the tree safely. Then we can
check invalidate race with tree lock, the following things is much the
same like zswap_load().
Since we can't deref the entry after zswap_writeback_entry(), we can't use
zswap_lru_putback() anymore, instead we rotate the entry in the beginning.
And it will be unlinked and freed when invalidated if writeback success.
Another change is we don't update the memcg nr_zswap_protected in the
-ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with
swapin or concurrent shrinker action, since swapin already have memcg
nr_zswap_protected updated, don't need double counts here. For concurrent
shrinker, the folio will be writeback and freed anyway. -ENOMEM case is
extremely rare and doesn't happen spuriously either, so don't bother
distinguishing this case.
Yosry Ahmed [Fri, 26 Jan 2024 08:06:44 +0000 (08:06 +0000)]
x86/mm: clarify "prev" usage in switch_mm_irqs_off()
In the x86 implementation of switch_mm_irqs_off(), we do not use the
"prev" argument passed in by the caller, we use exclusively use
"real_prev", which is cpu_tlbstate.loaded_mm. This is not obvious at the
first sight.
Furthermore, a comment describes a condition that happens when called with
prev == next, but this should not affect the function in any way since
prev is unused. Apparently, the comment is intended to clarify why we
don't rely on prev == next to decide whether we need to update CR3, but
again, it is not obvious. The comment also references the fact that
leave_mm() calls with prev == NULL and tsk == NULL, but this also
shouldn't matter because prev is unused and tsk is only used in one
function which has a NULL check.
Clarify things by renaming (prev -> unused) and (real_prev -> prev), also
move and rewrite the comment as an explanation for why we don't rely on
"prev" supplied by the caller in x86 code and use our own. Hopefully this
makes reading the code easier.
Huang Ying [Fri, 26 Jan 2024 08:19:44 +0000 (16:19 +0800)]
mm and cache_info: remove unnecessary CPU cache info update
For each CPU hotplug event, we will update per-CPU data slice size and
corresponding PCP configuration for every online CPU to make the
implementation simple. But, Kyle reported that this takes tens seconds
during boot on a machine with 34 zones and 3840 CPUs.
So, in this patch, for each CPU hotplug event, we only update per-CPU data
slice size and corresponding PCP configuration for the CPUs that share
caches with the hotplugged CPU. With the patch, the system boot time
reduces 67 seconds on the machine.
T.J. Mercier [Fri, 26 Jan 2024 21:19:25 +0000 (21:19 +0000)]
mm: memcg: don't periodically flush stats when memcg is disabled
The root memcg is onlined even when memcg is disabled. When it's onlined
a 2 second periodic stat flush is started, but no stat flushing is
required when memcg is disabled because there can be no child memcgs.
Most calls to flush memcg stats are avoided when memcg is disabled as a
result of the mem_cgroup_disabled check added in 7d7ef0a4686a ("mm: memcg:
restore subtree stats flushing"), but the periodic flushing started in
mem_cgroup_css_online is not. Skip it.
Breno Leitao [Fri, 5 Jan 2024 15:54:19 +0000 (07:54 -0800)]
selftests/mm: new test that steals pages
This test stresses the race between of madvise(DONTNEED), a page fault
and a parallel huge page mmap, which should fail due to lack of
available page available for mapping.
This test case must run on a system with one and only one huge page
available.
During setup, the test allocates the only available page, and starts
three threads:
- thread 1:
* madvise(MADV_DONTNEED) on the allocated huge page
- thread 2:
* Write to the allocated huge page
- thread 3:
* Tries to allocated (steal) an extra huge page (which is not
available)
thread 3 should never succeed in the allocation, since the only huge
page was never unmapped, and should be reserved.
Touching the old page after thread3 allocation will raise a SIGBUS.
mm: kmsan: remove runtime checks from kmsan_unpoison_memory()
Similarly to what's been done in commit 85716a80c16d ("kmsan: allow using
__msan_instrument_asm_store() inside runtime"), it should be safe to call
kmsan_unpoison_memory() from within the runtime, as it does not allocate
memory or take locks. Remove the redundant runtime checks.
This should fix false positives seen with CONFIG_DEBUG_LIST=y when
the non-instrumented lib/stackdepot.c failed to unpoison the memory
chunks later checked by the instrumented lib/list_debug.c
Also replace the implementation of kmsan_unpoison_entry_regs() with
a call to kmsan_unpoison_memory().
Vishal Verma [Wed, 24 Jan 2024 20:03:50 +0000 (12:03 -0800)]
dax: add a sysfs knob to control memmap_on_memory behavior
Add a sysfs knob for dax devices to control the memmap_on_memory setting
if the dax device were to be hotplugged as system memory.
The default memmap_on_memory setting for dax devices originating via pmem
or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
preserve legacy behavior. For dax devices via CXL, the default is on.
The sysfs control allows the administrator to override the above defaults
if needed.
In preparation for adding sysfs ABI to toggle memmap_on_memory semantics
for drivers adding memory, export the mhp_supports_memmap_on_memory()
helper. This allows drivers to check if memmap_on_memory support is
available before trying to request it, and display an appropriate
message if it isn't available. As part of this, remove the size argument
to this - with recent updates to allow memmap_on_memory for larger
ranges, and the internal splitting of altmaps into respective memory
blocks, the size argument is meaningless.
Vishal Verma [Wed, 24 Jan 2024 20:03:48 +0000 (12:03 -0800)]
Documentatiion/ABI: add ABI documentation for sys-bus-dax
Add the missing sysfs ABI documentation for the device DAX subsystem.
Various ABI attributes under this have been present since v5.1, and more
have been added over time. In preparation for adding a new attribute,
add this file with the historical details.
Vishal Verma [Wed, 24 Jan 2024 20:03:47 +0000 (12:03 -0800)]
dax/bus.c: replace several sprintf() with sysfs_emit()
There were several places where drivers/dax/bus.c uses 'sprintf' to print
sysfs data. Since a sysfs_emit() helper is available specifically for
this purpose, replace all the sprintf() usage for sysfs with sysfs_emit()
in this file.
Vishal Verma [Wed, 24 Jan 2024 20:03:46 +0000 (12:03 -0800)]
dax/bus.c: replace driver-core lock usage by a local rwsem
Patch series "Add DAX ABI for memmap_on_memory", v7.
This series adds sysfs ABI to control memmap_on_memory behavior for DAX
devices.
Patch 1 replaces incorrect device_lock() usage with a local rwsem - this
was identified during review.
Patch 2 is also a preparatory patch that replaces sprintf() for sysfs
operations with sysfs_emit()
Patch 3 adds the missing documentation for the sysfs ABI for DAX regions
and Dax devices.
Patch 4 exports mhp_supports_memmap_on_memory().
Patch 5 adds the new ABI for toggling memmap_on_memory semantics for dax
devices.
This patch (of 5):
The dax driver incorrectly used driver-core device locks to protect
internal dax region and dax device configuration structures. Replace the
device lock usage with a local rwsem, one each for dax region
configuration and dax device configuration. As a result of this
conversion, no device_lock() usage remains in dax/bus.c.