Git Repo - linux.git/log

]> Git Repo - linux.git/log

Daniel Gomez [Wed, 31 Jan 2024 22:51:25 +0000 (14:51 -0800)]

XArray: add cmpxchg order test

XArray multi-index entries do not keep track of the order stored once the
entry is being marked as used with cmpxchg (conditionally replaced with
NULL). Add a test to check the order is actually lost. The test also
verifies the order and entries for all the tied indexes before and after
the NULL replacement with xa_cmpxchg.

Add another entry at 1 << order that keeps the node around and the order
information for the NULL-entry after xa_cmpxchg.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Daniel Gomez <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Luis Chamberlain [Wed, 31 Jan 2024 22:51:24 +0000 (14:51 -0800)]

test_xarray: add tests for advanced multi-index use

Patch series "test_xarray: advanced API multi-index tests", v2.

This is a respin of the test_xarray multi-index tests [0] which use and
demonstrate the advanced API which is used by the page cache. This should
let folks more easily follow how we use multi-index to support for example
a min order later in the page cache. It also lets us grow the selftests
to mimic more of what we do in the page cache.

This patch (of 2):

The multi index selftests are great but they don't replicate how we deal
with the page cache exactly, which makes it a bit hard to follow as the
page cache uses the advanced API.

Add tests which use the advanced API, mimicking what we do in the page
cache, while at it, extend the example to do what is needed for min order
support.

[[email protected]: fix soft lockup for advanced-api tests]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: s/i/loops/, make non-static]
[[email protected]: restore static storage for loop counter]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Luis Chamberlain <[email protected]>
Tested-by: Daniel Gomez <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Anshuman Khandual [Thu, 1 Feb 2024 02:37:14 +0000 (08:07 +0530)]

mm/cma: don't treat bad input arguments for cma_alloc() as its failure

Invalid cma_alloc() input scenarios - including excess allocation request
should neither be counted as CMA_ALLOC_FAIL nor 'cma->nr_pages_failed' be
updated when applicable with CONFIG_CMA_SYSFS. This also drops 'out' jump
label which has become redundant.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Christophe Leroy [Tue, 30 Jan 2024 10:34:36 +0000 (11:34 +0100)]

mm: ptdump: add check_wx_pages debugfs attribute

Add a readable attribute in debugfs to trigger a W^X pages check at any
time.

To trigger the test, just read /sys/kernel/debug/check_wx_pages It will
report FAILED if the test failed, SUCCESS otherwise.

Detailed result is provided into dmesg.

Link: https://lkml.kernel.org/r/e947fb1a9f3f5466344823e532d343ff194ae03d.1706610398.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: "Aneesh Kumar K.V (IBM)" <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Phong Tran <[email protected]>
Cc: Russell King <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Christophe Leroy [Tue, 30 Jan 2024 10:34:35 +0000 (11:34 +0100)]

mm: ptdump: have ptdump_check_wx() return bool

Have ptdump_check_wx() return true when the check is successful or false
otherwise.

[[email protected]: fix a couple of build issues (x86_64 allmodconfig)]
Link: https://lkml.kernel.org/r/7943149fe955458cb7b57cd483bf41a3aad94684.1706610398.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: "Aneesh Kumar K.V (IBM)" <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Phong Tran <[email protected]>
Cc: Russell King <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Christophe Leroy [Tue, 30 Jan 2024 10:34:34 +0000 (11:34 +0100)]

powerpc,s390: ptdump: define ptdump_check_wx() regardless of CONFIG_DEBUG_WX

Following patch will use ptdump_check_wx() regardless of CONFIG_DEBUG_WX,
so define it at all times on powerpc and s390 just like other
architectures. Though keep the WARN_ON_ONCE() only when CONFIG_DEBUG_WX
is set.

Link: https://lkml.kernel.org/r/07bfb04c7fec58e84413e91d2533581be357a696.1706610398.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: "Aneesh Kumar K.V (IBM)" <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Phong Tran <[email protected]>
Cc: Russell King <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Christophe Leroy [Tue, 30 Jan 2024 10:34:33 +0000 (11:34 +0100)]

arm64, powerpc, riscv, s390, x86: ptdump: refactor CONFIG_DEBUG_WX

All architectures using the core ptdump functionality also implement
CONFIG_DEBUG_WX, and they all do it more or less the same way, with a
function called debug_checkwx() that is called by mark_rodata_ro(), which
is a substitute to ptdump_check_wx() when CONFIG_DEBUG_WX is set and a
no-op otherwise.

Refactor by centrally defining debug_checkwx() in linux/ptdump.h and call
debug_checkwx() immediately after calling mark_rodata_ro() instead of
calling it at the end of every mark_rodata_ro().

On x86_32, mark_rodata_ro() first checks __supported_pte_mask has _PAGE_NX
before calling debug_checkwx(). Now the check is inside the callee
ptdump_walk_pgd_level_checkwx().

On powerpc_64, mark_rodata_ro() bails out early before calling
ptdump_check_wx() when the MMU doesn't have KERNEL_RO feature. The check
is now also done in ptdump_check_wx() as it is called outside
mark_rodata_ro().

Link: https://lkml.kernel.org/r/a59b102d7964261d31ead0316a9f18628e4e7a8e.1706610398.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: "Aneesh Kumar K.V (IBM)" <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Phong Tran <[email protected]>
Cc: Russell King <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Christophe Leroy [Tue, 30 Jan 2024 10:34:32 +0000 (11:34 +0100)]

arm: ptdump: rename CONFIG_DEBUG_WX to CONFIG_ARM_DEBUG_WX

Patch series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages
debugfs attribute", v2.

This series refactors CONFIG_DEBUG_WX for the 5 architectures implementing
CONFIG_GENERIC_PTDUMP

First rename stuff in ARM which uses similar names while not implementing
CONFIG_GENERIC_PTDUMP.

Then define a generic version of debug_checkwx() that calls
ptdump_check_wx() when CONFIG_DEBUG_WX is set. Call it immediately after
calling mark_rodata_ro() instead of calling it at the end of every
mark_rodata_ro().

Then implement a debugfs attribute that can be used to trigger a W^X test
at anytime and regardless of CONFIG_DEBUG_WX

This patch (of 5):

CONFIG_DEBUG_WX is a core option defined in mm/Kconfig.debug

To avoid any future conflict, rename ARM version into CONFIG_ARM_DEBUG_WX.

Link: https://lore.kernel.org/lkml/20200422152656.GF676@willie-the-truck/T/#m802eaf33efd6f8d575939d157301b35ac0d4a64f
Link: https://github.com/KSPP/linux/issues/35
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/fa297aa90caeb61eee2b70c6c5897a2ab58a9562.1706610398.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: "Aneesh Kumar K.V (IBM)" <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Phong Tran <[email protected]>
Cc: Russell King <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Gregory Price [Fri, 2 Feb 2024 17:02:38 +0000 (12:02 -0500)]

mm/mempolicy: protect task interleave functions with tsk->mems_allowed_seq

In the event of rebind, pol->nodemask can change at the same time as an
allocation occurs. We can detect this with tsk->mems_allowed_seq and
prevent a miscount or an allocation failure from occurring.

The same thing happens in the allocators to detect failure, but this can
prevent spurious failures in a much smaller critical section.

[[email protected]: weighted interleave checks wrong parameter]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gregory Price <[email protected]>
Suggested-by: "Huang, Ying" <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hasan Al Maruf <[email protected]>
Cc: Honggyu Kim <[email protected]>
Cc: Hyeongtak Ji <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Rakie Kim <[email protected]>
Cc: Ravi Jonnalagadda <[email protected]>
Cc: Srinivasulu Thanneeru <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Gregory Price [Fri, 2 Feb 2024 17:02:37 +0000 (12:02 -0500)]

mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving

When a system has multiple NUMA nodes and it becomes bandwidth hungry,
using the current MPOL_INTERLEAVE could be an wise option.

However, if those NUMA nodes consist of different types of memory such as
socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based
interleave policy does not optimally distribute data to make use of their
different bandwidth characteristics.

Instead, interleave is more effective when the allocation policy follows
each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.

This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
enabling weighted interleave between NUMA nodes.  Weighted interleave
allows for proportional distribution of memory across multiple numa nodes,
preferably apportioned to match the bandwidth of each node.

For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight
distribution is (2:1).

Weights for each node can be assigned via the new sysfs extension:
/sys/kernel/mm/mempolicy/weighted_interleave/

For now, the default value of all nodes will be `1`, which matches the
behavior of standard 1:1 round-robin interleave.  An extension will be
added in the future to allow default values to be registered at kernel and
device bringup time.

The policy allocates a number of pages equal to the set weights.  For
example, if the weights are (2,1), then 2 pages will be allocated on node0
for every 1 page allocated on node1.

The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
and mbind(2).

Some high level notes about the pieces of weighted interleave:

current->il_prev:
    Tracks the node previously allocated from.

current->il_weight:
    The active weight of the current node (current->il_prev)
    When this reaches 0, current->il_prev is set to the next node
    and current->il_weight is set to the next weight.

weighted_interleave_nodes:
    Counts the number of allocations as they occur, and applies the
    weight for the current node.  When the weight reaches 0, switch
    to the next node.  Operates only on task->mempolicy.

weighted_interleave_nid:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the node based on the given index.
    Operates on VMA policies.

bulk_array_weighted_interleave:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the number of "interleave rounds" as
    well as any delta ("partial round").  Calculates the number of
    pages for each node and allocates them.

    If a node was scheduled for interleave via interleave_nodes, the
    current weight will be allocated first.

    Operates only on the task->mempolicy.

One piece of complexity is the interaction between a recent refactor which
split the logic to acquire the "ilx" (interleave index) of an allocation
and the actually application of the interleave.  If a call to
alloc_pages_mpol() were made with a weighted-interleave policy and ilx set
to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would operate on a VMA
policy - violating the description above.

An inspection of all callers of alloc_pages_mpol() shows that all external
callers set ilx to `0`, an index value, or will call get_vma_policy() to
acquire the ilx.

For example, mm/shmem.c may call into alloc_pages_mpol.  The call stacks
all set (pgoff_t ilx) or end up in `get_vma_policy()`.  This enforces the
`weighted_interleave_nodes()` and `weighted_interleave_nid()` policy
requirements (task/vma respectively).

Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Hasan Al Maruf <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Co-developed-by: Rakie Kim <[email protected]>
Signed-off-by: Rakie Kim <[email protected]>
Co-developed-by: Honggyu Kim <[email protected]>
Signed-off-by: Honggyu Kim <[email protected]>
Co-developed-by: Hyeongtak Ji <[email protected]>
Signed-off-by: Hyeongtak Ji <[email protected]>
Co-developed-by: Srinivasulu Thanneeru <[email protected]>
Signed-off-by: Srinivasulu Thanneeru <[email protected]>
Co-developed-by: Ravi Jonnalagadda <[email protected]>
Signed-off-by: Ravi Jonnalagadda <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Gregory Price [Fri, 2 Feb 2024 17:02:36 +0000 (12:02 -0500)]

mm/mempolicy: refactor a read-once mechanism into a function for re-use

Move the use of barrier() to force policy->nodemask onto the stack into a
function `read_once_policy_nodemask` so that it may be re-used.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gregory Price <[email protected]>
Suggested-by: "Huang, Ying" <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hasan Al Maruf <[email protected]>
Cc: Honggyu Kim <[email protected]>
Cc: Hyeongtak Ji <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Rakie Kim <[email protected]>
Cc: Ravi Jonnalagadda <[email protected]>
Cc: Srinivasulu Thanneeru <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Rakie Kim [Fri, 2 Feb 2024 17:02:35 +0000 (12:02 -0500)]

mm/mempolicy: implement the sysfs-based weighted_interleave interface

Patch series "mm/mempolicy: weighted interleave mempolicy and sysfs
extension", v5.

Weighted interleave is a new interleave policy intended to make use of
heterogeneous memory environments appearing with CXL.

The existing interleave mechanism does an even round-robin distribution of
memory across all nodes in a nodemask, while weighted interleave
distributes memory across nodes according to a provided weight.  (Weight =
# of page allocations per round)

Weighted interleave is intended to reduce average latency when bandwidth
is pressured - therefore increasing total throughput.

In other words: It allows greater use of the total available bandwidth in
a heterogeneous hardware environment (different hardware provides
different bandwidth capacity).

As bandwidth is pressured, latency increases - first linearly and then
exponentially.  By keeping bandwidth usage distributed according to
available bandwidth, we therefore can reduce the average latency of a
cacheline fetch.

A good explanation of the bandwidth vs latency response curve:
https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/

From the article:
```
Constant region:
    The latency response is fairly constant for the first 40%
    of the sustained bandwidth.
Linear region:
    In between 40% to 80% of the sustained bandwidth, the
    latency response increases almost linearly with the bandwidth
    demand of the system due to contention overhead by numerous
    memory requests.
Exponential region:
    Between 80% to 100% of the sustained bandwidth, the memory
    latency is dominated by the contention latency which can be
    as much as twice the idle latency or more.
Maximum sustained bandwidth :
    Is 65% to 75% of the theoretical maximum bandwidth.
```

As a general rule of thumb:
* If bandwidth usage is low, latency does not increase. It is
  optimal to place data in the nearest (lowest latency) device.
* If bandwidth usage is high, latency increases. It is optimal
  to place data such that bandwidth use is optimized per-device.

This is the top line goal: Provide a user a mechanism to target using the
"maximum sustained bandwidth" of each hardware component in a heterogenous
memory system.

For example, the stream benchmark demonstrates that 1:1 (default)
interleave is actively harmful, while weighted interleave can be
beneficial.  Default interleave distributes data such that too much
pressure is placed on devices with lower available bandwidth.

Stream Benchmark (vs DRAM, 1 Socket + 1 CXL Device)
Default interleave : -78% (slower than DRAM)
Global weighting   : -6% to +4% (workload dependant)
Targeted weights   : +2.5% to +4% (consistently better than DRAM)

Global means the task-policy was set (set_mempolicy), while targeted means
VMA policies were set (mbind2).  We see weighted interleave is not always
beneficial when applied globally, but is always beneficial when applied to
bandwidth-driving memory regions.

There are 4 patches in this set:
1) Implement system-global interleave weights as sysfs extension
   in mm/mempolicy.c.  These weights are RCU protected, and a
   default weight set is provided (all weights are 1 by default).

   In future work, we intend to expose an interface for HMAT/CDAT
   code to set reasonable default values based on the memory
   configuration of the system discovered at boot/hotplug.

2) A mild refactor of some interleave-logic for re-use in the
   new weighted interleave logic.

3) MPOL_WEIGHTED_INTERLEAVE extension for set_mempolicy/mbind

4) Protect interleave logic (weighted and normal) with the
   mems_allowed seq cookie.  If the nodemask changes while
   accessing it during a rebind, just retry the access.

Included below are some performance and LTP test information,
and a sample numactl branch which can be used for testing.

= Performance summary =
(tests may have different configurations, see extended info below)
1) MLC (W2) : +38% over DRAM. +264% over default interleave.
   MLC (W5) : +40% over DRAM. +226% over default interleave.
2) Stream   : -6% to +4% over DRAM, +430% over default interleave.
3) XSBench  : +19% over DRAM. +47% over default interleave.

= LTP Testing Summary =
existing mempolicy & mbind tests: pass
mempolicy & mbind + weighted interleave (global weights): pass

= version history
v5:
- style fixes
- mems_allowed cookie protection to detect rebind issues,
  prevents spurious allocation failures and/or mis-allocations
- sparse warning fixes related to __rcu on local variables

=====================================================================
Performance tests - MLC
From - Ravi Jonnalagadda <[email protected]>

Hardware: Single-socket, multiple CXL memory expanders.

Workload:                               W2
Data Signature:                         2:1 read:write
DRAM only bandwidth (GBps):             298.8
DRAM + CXL (default interleave) (GBps): 113.04
DRAM + CXL (weighted interleave)(GBps): 412.5
Gain over DRAM only:                    1.38x
Gain over default interleave:           2.64x

Workload:                               W5
Data Signature:                         1:1 read:write
DRAM only bandwidth (GBps):             273.2
DRAM + CXL (default interleave) (GBps): 117.23
DRAM + CXL (weighted interleave)(GBps): 382.7
Gain over DRAM only:                    1.4x
Gain over default interleave:           2.26x

=====================================================================
Performance test - Stream
From - Gregory Price <[email protected]>

Hardware: Single socket, single CXL expander
numactl extension: https://github.com/gmprice/numactl/tree/weighted_interleave_master

Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times
Default interleave : -78% (slower than DRAM)
Global weighting   : -6% to +4% (workload dependant)
mbind2 weights     : +2.5% to +4% (consistently better than DRAM)

dram only:
numactl --cpunodebind=1 --membind=1 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Function     Direction    BestRateMBs     AvgTime      MinTime      MaxTime
Copy:        0->0            200923.2     0.032662     0.031853     0.033301
Scale:       0->0            202123.0     0.032526     0.031664     0.032970
Add:         0->0            208873.2     0.047322     0.045961     0.047884
Triad:       0->0            208523.8     0.047262     0.046038     0.048414

CXL-only:
numactl --cpunodebind=1 -w --membind=2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy:        0->0             22209.7     0.288661     0.288162     0.289342
Scale:       0->0             22288.2     0.287549     0.287147     0.288291
Add:         0->0             24419.1     0.393372     0.393135     0.393735
Triad:       0->0             24484.6     0.392337     0.392083     0.394331

Based on the above, the optimal weights are ~9:1
echo 9 > /sys/kernel/mm/mempolicy/weighted_interleave/node1
echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2

default interleave:
numactl --cpunodebind=1 --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy:        0->0             44666.2     0.143671     0.143285     0.144174
Scale:       0->0             44781.6     0.143256     0.142916     0.143713
Add:         0->0             48600.7     0.197719     0.197528     0.197858
Triad:       0->0             48727.5     0.197204     0.197014     0.197439

global weighted interleave:
numactl --cpunodebind=1 -w --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy:        0->0            190085.9     0.034289     0.033669     0.034645
Scale:       0->0            207677.4     0.031909     0.030817     0.033061
Add:         0->0            202036.8     0.048737     0.047516     0.053409
Triad:       0->0            217671.5     0.045819     0.044103     0.046755

targted regions w/ global weights (modified stream to mbind2 malloc'd regions))
numactl --cpunodebind=1 --membind=1 ./stream_c.exe -b --ntimes 100 --array-size 400M --malloc
Copy:        0->0            205827.0     0.031445     0.031094     0.031984
Scale:       0->0            208171.8     0.031320     0.030744     0.032505
Add:         0->0            217352.0     0.045087     0.044168     0.046515
Triad:       0->0            216884.8     0.045062     0.044263     0.046982

=====================================================================
Performance tests - XSBench
From - Hyeongtak Ji <[email protected]>

Hardware: Single socket, Single CXL memory Expander

NUMA node 0: 56 logical cores, 128 GB memory
NUMA node 2: 96 GB CXL memory
Threads:     56
Lookups:     170,000,000

Summary: +19% over DRAM. +47% over default interleave.

Performance tests - XSBench
1. dram only
$ numactl -m 0 ./XSBench -s XL –p 5000000
Runtime:     36.235 seconds
Lookups/s:   4,691,618

2. default interleave
$ numactl –i 0,2 ./XSBench –s XL –p 5000000
Runtime:     55.243 seconds
Lookups/s:   3,077,293

3. weighted interleave
numactl –w –i 0,2 ./XSBench –s XL –p 5000000
Runtime:     29.262 seconds
Lookups/s:   5,809,513

=====================================================================
LTP Tests: https://github.com/gmprice/ltp/tree/mempolicy2

= Existing tests
set_mempolicy, get_mempolicy, mbind

MPOL_WEIGHTED_INTERLEAVE added manually to test basic functionality but
did not adjust tests for weighting.  Basically the weights were set to 1,
which is the default, and it should behave the same as MPOL_INTERLEAVE if
logic is correct.

== set_mempolicy01 : passed   18, failed   0
== set_mempolicy02 : passed   10, failed   0
== set_mempolicy03 : passed   64, failed   0
== set_mempolicy04 : passed   32, failed   0
== set_mempolicy05 - n/a on non-x86
== set_mempolicy06 : passed   10, failed   0
   this is set_mempolicy02 + MPOL_WEIGHTED_INTERLEAVE
== set_mempolicy07 : passed   32, failed   0
   set_mempolicy04 + MPOL_WEIGHTED_INTERLEAVE
== get_mempolicy01 : passed   12, failed   0
   change: added MPOL_WEIGHTED_INTERLEAVE
== get_mempolicy02 : passed   2, failed   0
== mbind01 : passed   15, failed   0
   added MPOL_WEIGHTED_INTERLEAVE
== mbind02 : passed   4, failed   0
   added MPOL_WEIGHTED_INTERLEAVE
== mbind03 : passed   16, failed   0
   added MPOL_WEIGHTED_INTERLEAVE
== mbind04 : passed   48, failed   0
   added MPOL_WEIGHTED_INTERLEAVE

=====================================================================
numactl (set_mempolicy) w/ global weighting test
numactl fork: https://github.com/gmprice/numactl/tree/weighted_interleave_master

command: numactl -w --interleave=0,1 ./eatmem

result (weights 1:1):
0176a000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=32897 N1=32896 kernelpagesize_kB=4
7fceeb9ff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=32768 N1=32769 kernelpagesize_kB=4
50% distribution is correct

result (weights 5:1):
01b14000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=54828 N1=10965 kernelpagesize_kB=4
7f47a1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54614 N1=10923 kernelpagesize_kB=4
16.666% distribution is correct

result (weights 1:5):
01f07000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=10966 N1=54827 kernelpagesize_kB=4
7f17b1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=10923 N1=54614 kernelpagesize_kB=4
16.666% distribution is correct

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main (void)
{
        char* mem = malloc(1024*1024*256);
        memset(mem, 1, 1024*1024*256);
        for (int i = 0; i  < ((1024*1024*256)/4096); i++)
        {
                mem = malloc(4096);
                mem[0] = 1;
        }
        printf("done\n");
        getchar();
        return 0;
}

This patch (of 4):

This patch provides a way to set interleave weight information under sysfs
at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN

The sysfs structure is designed as follows.

  $ tree /sys/kernel/mm/mempolicy/
  /sys/kernel/mm/mempolicy/ [1]
  └── weighted_interleave [2]
      ├── node0 [3]
      └── node1

Each file above can be explained as follows.

[1] mm/mempolicy: configuration interface for mempolicy subsystem

[2] weighted_interleave/: config interface for weighted interleave policy

[3] weighted_interleave/nodeN: weight for nodeN

If a node value is set to `0`, the system-default value will be used.
As of this patch, the system-default for all nodes is always 1.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: "Huang, Ying" <[email protected]>
Signed-off-by: Rakie Kim <[email protected]>
Signed-off-by: Honggyu Kim <[email protected]>
Co-developed-by: Gregory Price <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Co-developed-by: Hyeongtak Ji <[email protected]>
Signed-off-by: Hyeongtak Ji <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Gregory Price <[email protected]>
Cc: Hasan Al Maruf <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Srinivasulu Thanneeru <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Yajun Deng [Wed, 31 Jan 2024 03:19:13 +0000 (11:19 +0800)]

mm/mmap: use SZ_{8K, 128K} helper macro

Use SZ_{8K, 128K} helper macro instead of the number in init_user_reserve
and reserve_mem_notifier. This is more readable.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yajun Deng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

SeongJae Park [Tue, 30 Jan 2024 01:35:48 +0000 (17:35 -0800)]

Docs/translations/damon/usage: update for monitor_on renaming

Update DAMON debugfs interface sections on the translated usage documents
to reflect the fact that 'monitor_on' file has renamed to
'monitor_on_DEPRECATED'.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Alex Shi <[email protected]>
Cc: Hu Haowen <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Yanteng Si <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

SeongJae Park [Tue, 30 Jan 2024 01:35:47 +0000 (17:35 -0800)]

Docs/admin-guide/mm/damon/usage: update for monitor_on renaming

Update DAMON debugfs interface sections on the usage document to reflect
the fact that 'monitor_on' file has renamed to 'monitor_on_DEPRECATED'.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Hu Haowen <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Yanteng Si <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

SeongJae Park [Tue, 30 Jan 2024 01:35:46 +0000 (17:35 -0800)]

mm/damon/dbgfs: rename monitor_on file to monitor_on_DEPRECATED

Kernel builders could silently enable CONFIG_DAMON_DBGFS_DEPRECATED.
Users who manually check the files under the DAMON debugfs directory could
notice the deprecation owing to the 'DEPRECATED' DAMON debugfs file, but
there could be users who doesn't manually check the files.

Make the deprecation cannot be ignored in the case by renaming
'monitor_on' file, which is essential for real use of DAMON on runtime, to
'monitor_on_DEPRECATED'.  Still users who control DAMON via only
user-space tool could ignore the deprecation, but that's what the tool
developers should take care of.  DAMON user-space tool, damo, has also
made a change[1] for the purpose.

[1] commit 935dae76f2aee ("_damon_args: Rename --damon_interface to
    --damon_interface_DEPRECATED") of https://github.com/awslabs/damo

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Hu Haowen <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Yanteng Si <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

SeongJae Park [Tue, 30 Jan 2024 01:35:45 +0000 (17:35 -0800)]

selftets/damon: prepare for monitor_on file renaming

Following change will rename 'monitor_on' DAMON debugfs file to
'monitor_on_DEPRECATED', to make the deprecation unignorable in runtime.
Since it could make DAMON selftests fail and disturb future bisects,
update DAMON selftests to support the change.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Hu Haowen <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Yanteng Si <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

SeongJae Park [Tue, 30 Jan 2024 01:35:44 +0000 (17:35 -0800)]

Docs/admin-guide/mm/damon/usage: document 'DEPRECATED' file of DAMON debugfs interface

Document the newly added DAMON debugfs interface deprecation notice file
on the usage document.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Hu Haowen <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Yanteng Si <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

SeongJae Park [Tue, 30 Jan 2024 01:35:43 +0000 (17:35 -0800)]

mm/damon/dbgfs: make debugfs interface deprecation message a macro

DAMON debugfs interface deprecation message is written twice, once for the
warning, and again for DEPRECATED file's read output. De-duplicate those
by defining the message as a macro and reuse.

[[email protected]: s/comnst/const/]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Hu Haowen <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Yanteng Si <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

SeongJae Park [Tue, 30 Jan 2024 01:35:42 +0000 (17:35 -0800)]

mm/damon/dbgfs: implement deprecation notice file

Implement a read-only file for DAMON debugfs interface deprecation notice,
to let users who manually read/write the DAMON debugfs files from their
shell command line easily notice the fact.

[[email protected]: fix bogus string length]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Hu Haowen <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Yanteng Si <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

SeongJae Park [Tue, 30 Jan 2024 01:35:41 +0000 (17:35 -0800)]

mm/damon: rename CONFIG_DAMON_DBGFS to DAMON_DBGFS_DEPRECATED

DAMON debugfs interface is deprecated.  The fact has documented by commit
5445fcbc4cda ("Docs/admin-guide/mm/damon/usage: add DAMON debugfs
interface deprecation notice").  Commit 620932cd2852 ("mm/damon/dbgfs:
print DAMON debugfs interface deprecation message") further started
printing a warning message when users still use it.  Many people don't
read documentation or kernel log, though.

Make the deprecation harder to be ignored using the approach of commit
eb07c4f39c3e ("mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED").
'make oldconfig' with 'CONFIG_DAMON_DBGFS=y' will get a new prompt with
the explicit deprecation notice on the name.  'make olddefconfig' with
'CONFIG_DAMON_DBGFS=y' will result in not building DAMON debugfs
interface.  If there is a real user of DAMON debugfs interface, they will
complain the change to the builder.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Hu Haowen <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Yanteng Si <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

SeongJae Park [Tue, 30 Jan 2024 01:35:40 +0000 (17:35 -0800)]

Docs/admin-guide/mm/damon/usage: use sysfs interface for tracepoints example

Patch series "mm/damon: make DAMON debugfs interface deprecation
unignorable".

DAMON debugfs interface is deprecated in February 2023, by commit
5445fcbc4cda ("Docs/admin-guide/mm/damon/usage: add DAMON debugfs
interface deprecation notice"). Make the fact unable to be easily ignored
by removing an example usage from the document (patch 1), renaming the
config (patch 2), adding a deprecation notice file to the debugfs
directory (patches 3-5), and renaming the debugfs file that essnetial to
be used for real use of DAMON (patches 6-9).

This patch (of 9):

DAMON tracepoints example on the DAMON usage document is using DAMON
debugfs interface, which is deprecated. Use its alternative, DAMON sysfs
interface.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Hu Haowen <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Yanteng Si <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:56 +0000 (20:36 -0500)]

mm: zswap: function ordering: shrink_memcg_cb

shrink_memcg_cb() is called by the shrinker and is based on
zswap_writeback_entry(). Move it in between. Save one fwd decl.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:55 +0000 (20:36 -0500)]

mm: zswap: function ordering: writeback

Shrinking needs writeback. Naturally, move the writeback code above
the shrinking code. Delete the forward decl.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:54 +0000 (20:36 -0500)]

mm: zswap: function ordering: per-cpu compression infra

The per-cpu compression init/exit callbacks are awkwardly in the
middle of the shrinker code. Move them up to the compression section.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:53 +0000 (20:36 -0500)]

mm: zswap: function ordering: compress & decompress functions

Writeback needs to decompress. Move the (de)compression API above what
will be the consolidated shrinking/writeback code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:52 +0000 (20:36 -0500)]

mm: zswap: function ordering: move entry section out of tree section

The higher-level entry operations modify the tree, so move the entry
API after the tree section.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:51 +0000 (20:36 -0500)]

mm: zswap: function ordering: move entry sections out of LRU section

This completes consolidation of the LRU section.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:50 +0000 (20:36 -0500)]

mm: zswap: function ordering: public lru api

The zswap entry section sits awkwardly in the middle of LRU-related
functions. Group the external LRU API functions first.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:49 +0000 (20:36 -0500)]

mm: zswap: function ordering: pool params

Patch series "mm: zswap: cleanups".

Cleanups and maintenance items that accumulated while reviewing zswap
patches.

This patch (of 20):

The parameters primarily control pool attributes. Move those
operations up to the pool section.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:48 +0000 (20:36 -0500)]

mm: zswap: function ordering: zswap_pools

Move the operations against the global zswap_pools list (current pool,
last, find) to the pool section.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:47 +0000 (20:36 -0500)]

mm: zswap: function ordering: pool refcounting

Move pool refcounting functions into the pool section. First the
destroy functions, then the get and put which uses them.

__zswap_pool_empty() has an upward reference to the global
zswap_pools, to sanity check it's not the currently active pool that's
being freed. That gets the forward decl for zswap_pool_current().

This puts the get and put function above all callers, so kill the
forward decls as well.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:46 +0000 (20:36 -0500)]

mm: zswap: function ordering: pool alloc & free

The function ordering in zswap.c is a little chaotic, which requires
jumping in unexpected directions when following related code. This is
a series of patches that brings the file into the following order:

- pool functions
- lru functions
- rbtree functions
- zswap entry functions
- compression/backend functions
- writeback & shrinking functions
- store, load, invalidate, swapon, swapoff
- debugfs
- init

But it has to be split up such the moving still produces halfway
readable diffs.

In this patch, move pool allocation and freeing functions.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:45 +0000 (20:36 -0500)]

mm: zswap: simplify zswap_invalidate()

The branching is awkward and duplicates code. The comment about
writeback is also misleading: yes, the entry might have been written
back. Or it might have never been stored in zswap to begin with due to
a rejection - zswap_invalidate() is called on all exiting swap entries.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:44 +0000 (20:36 -0500)]

mm: zswap: further cleanup zswap_store()

- Remove dupentry, reusing entry works just fine.
- Rename pool to shrink_pool, as this one actually is confusing.
- Remove page, use folio_nid() and kmap_local_folio() directly.
- Set entry->swpentry in a common path.
- Move value and src to local scope of use.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:43 +0000 (20:36 -0500)]

mm: zswap: break out zwap_compress()

zswap_store() is long and mixes work at the zswap layer with work at
the backend and compression layer. Move compression & backend work to
zswap_compress(), mirroring zswap_decompress().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:42 +0000 (20:36 -0500)]

mm: zswap: rename __zswap_load() to zswap_decompress()

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:41 +0000 (20:36 -0500)]

mm: zswap: clean up zswap_entry_put()

Remove stale comment and unnecessary local variable.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:40 +0000 (20:36 -0500)]

mm: zswap: warn when referencing a dead entry

Put a standard sanity check on zswap_entry_get() for UAF scenario.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:39 +0000 (20:36 -0500)]

mm: zswap: move zswap_invalidate_entry() to related functions

Move it up to the other tree and refcounting functions.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:38 +0000 (20:36 -0500)]

mm: zswap: inline and remove zswap_entry_find_get()

There is only one caller and the function is trivial. Inline it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Johannes Weiner [Tue, 30 Jan 2024 01:36:37 +0000 (20:36 -0500)]

mm: zswap: rename zswap_free_entry to zswap_entry_free

There is a zswap_entry_ namespace with multiple functions already.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Nhat Pham <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Chengming Zhou [Sun, 28 Jan 2024 13:28:51 +0000 (13:28 +0000)]

mm/list_lru: remove list_lru_putback()

Since the only user zswap_lru_putback() has gone, remove
list_lru_putback() too.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nhat Pham <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Chengming Zhou [Sun, 28 Jan 2024 13:28:50 +0000 (13:28 +0000)]

mm/zswap: fix race between lru writeback and swapoff

LRU writeback has race problem with swapoff, as spotted by Yosry [1]:

CPU1 CPU2
shrink_memcg_cb swap_off
  list_lru_isolate   zswap_invalidate
  zswap_swapoff
    kfree(tree)
  // UAF
  spin_lock(&tree->lock)

The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.

We can fix it by moving the swap cache allocation ahead before referencing
the tree, then check invalidate race with tree lock, only after that we
can safely deref the entry.  Note we couldn't deref entry or tree anymore
after we unlock the folio, since we depend on this to hold on swapoff.

So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache and if
returned with folio locked we can reference the tree safely.  Then we can
check invalidate race with tree lock, the following things is much the
same like zswap_load().

Since we can't deref the entry after zswap_writeback_entry(), we can't use
zswap_lru_putback() anymore, instead we rotate the entry in the beginning.
And it will be unlinked and freed when invalidated if writeback success.

Another change is we don't update the memcg nr_zswap_protected in the
-ENOMEM and -EEXIST cases anymore.  -EEXIST case means we raced with
swapin or concurrent shrinker action, since swapin already have memcg
nr_zswap_protected updated, don't need double counts here.  For concurrent
shrinker, the folio will be writeback and freed anyway.  -ENOMEM case is
extremely rare and doesn't happen spuriously either, so don't bother
distinguishing this case.

[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Nhat Pham <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Yosry Ahmed [Fri, 26 Jan 2024 08:06:44 +0000 (08:06 +0000)]

x86/mm: clarify "prev" usage in switch_mm_irqs_off()

In the x86 implementation of switch_mm_irqs_off(), we do not use the
"prev" argument passed in by the caller, we use exclusively use
"real_prev", which is cpu_tlbstate.loaded_mm.  This is not obvious at the
first sight.

Furthermore, a comment describes a condition that happens when called with
prev == next, but this should not affect the function in any way since
prev is unused.  Apparently, the comment is intended to clarify why we
don't rely on prev == next to decide whether we need to update CR3, but
again, it is not obvious.  The comment also references the fact that
leave_mm() calls with prev == NULL and tsk == NULL, but this also
shouldn't matter because prev is unused and tsk is only used in one
function which has a NULL check.

Clarify things by renaming (prev -> unused) and (real_prev -> prev), also
move and rewrite the comment as an explanation for why we don't rely on
"prev" supplied by the caller in x86 code and use our own.  Hopefully this
makes reading the code easier.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Yosry Ahmed [Fri, 26 Jan 2024 08:06:43 +0000 (08:06 +0000)]

x86/mm: delete unused cpu argument to leave_mm()

The argument is unused since commit 3d28ebceaffa ("x86/mm: Rework lazy
TLB to track the actual loaded mm"), delete it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Huang Ying [Fri, 26 Jan 2024 08:19:44 +0000 (16:19 +0800)]

mm and cache_info: remove unnecessary CPU cache info update

For each CPU hotplug event, we will update per-CPU data slice size and
corresponding PCP configuration for every online CPU to make the
implementation simple. But, Kyle reported that this takes tens seconds
during boot on a machine with 34 zones and 3840 CPUs.

So, in this patch, for each CPU hotplug event, we only update per-CPU data
slice size and corresponding PCP configuration for the CPUs that share
caches with the hotplugged CPU. With the patch, the system boot time
reduces 67 seconds on the machine.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 362d37a106dd ("mm, pcp: reduce lock contention for draining high-order pages")
Signed-off-by: "Huang, Ying" <[email protected]>
Originally-by: Kyle Meyer <[email protected]>
Reported-and-tested-by: Kyle Meyer <[email protected]>
Cc: Sudeep Holla <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Levi Yun [Fri, 26 Jan 2024 15:25:54 +0000 (15:25 +0000)]

kswapd: replace try_to_freeze() with kthread_freezable_should_stop()

Instead of using try_to_freeze, use kthread_freezable_should_stop in
kswapd. By this, we can avoid unnecessary freezing when kswapd should
stop.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Levi Yun <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

T.J. Mercier [Fri, 26 Jan 2024 21:19:25 +0000 (21:19 +0000)]

mm: memcg: don't periodically flush stats when memcg is disabled

The root memcg is onlined even when memcg is disabled. When it's onlined
a 2 second periodic stat flush is started, but no stat flushing is
required when memcg is disabled because there can be no child memcgs.
Most calls to flush memcg stats are avoided when memcg is disabled as a
result of the mem_cgroup_disabled check added in 7d7ef0a4686a ("mm: memcg:
restore subtree stats flushing"), but the periodic flushing started in
mem_cgroup_css_online is not. Skip it.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: aa48e47e3906 ("memcg: infrastructure to flush memcg stats")
Signed-off-by: T.J. Mercier <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Chris Li <[email protected]>
Reported-by: Minchan Kim <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Michal Koutn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Breno Leitao [Fri, 5 Jan 2024 15:54:19 +0000 (07:54 -0800)]

selftests/mm: new test that steals pages

This test stresses the race between of madvise(DONTNEED), a page fault
and a parallel huge page mmap, which should fail due to lack of
available page available for mapping.

This test case must run on a system with one and only one huge page
available.

# echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

During setup, the test allocates the only available page, and starts
three threads:

  - thread 1:
      * madvise(MADV_DONTNEED) on the allocated huge page
  - thread 2:
      * Write to the allocated huge page
  - thread 3:
      * Tries to allocated (steal) an extra huge page (which is not
        available)

thread 3 should never succeed in the allocation, since the only huge
page was never unmapped, and should be reserved.

Touching the old page after thread3 allocation will raise a SIGBUS.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Breno Leitao <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Vegard Nossum <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Alexander Potapenko [Wed, 24 Jan 2024 17:31:34 +0000 (18:31 +0100)]

mm: kmsan: remove runtime checks from kmsan_unpoison_memory()

Similarly to what's been done in commit 85716a80c16d ("kmsan: allow using
__msan_instrument_asm_store() inside runtime"), it should be safe to call
kmsan_unpoison_memory() from within the runtime, as it does not allocate
memory or take locks. Remove the redundant runtime checks.

This should fix false positives seen with CONFIG_DEBUG_LIST=y when
the non-instrumented lib/stackdepot.c failed to unpoison the memory
chunks later checked by the instrumented lib/list_debug.c

Also replace the implementation of kmsan_unpoison_entry_regs() with
a call to kmsan_unpoison_memory().

Link: https://lkml.kernel.org/r/[email protected]
Fixes: f80be4571b19 ("kmsan: add KMSAN runtime core")
Signed-off-by: Alexander Potapenko <[email protected]>
Tested-by: Marco Elver <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Cc: Nicholas Miehlbradt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Matthew Wilcox (Oracle) [Wed, 24 Jan 2024 18:12:15 +0000 (18:12 +0000)]

highmem: add kernel-doc for memcpy_*_folio()

This was inadvertently skipped when adding the new functions.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Vishal Verma [Wed, 24 Jan 2024 20:03:50 +0000 (12:03 -0800)]

dax: add a sysfs knob to control memmap_on_memory behavior

Add a sysfs knob for dax devices to control the memmap_on_memory setting
if the dax device were to be hotplugged as system memory.

The default memmap_on_memory setting for dax devices originating via pmem
or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
preserve legacy behavior. For dax devices via CXL, the default is on.
The sysfs control allows the administrator to override the above defaults
if needed.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Verma <[email protected]>
Tested-by: Li Zhijian <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Huang, Ying <[email protected]>
Reviewed-by: Alison Schofield <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Vishal Verma [Wed, 24 Jan 2024 20:03:49 +0000 (12:03 -0800)]

mm/memory_hotplug: export mhp_supports_memmap_on_memory()

In preparation for adding sysfs ABI to toggle memmap_on_memory semantics
for drivers adding memory, export the mhp_supports_memmap_on_memory()
helper. This allows drivers to check if memmap_on_memory support is
available before trying to request it, and display an appropriate
message if it isn't available. As part of this, remove the size argument
to this - with recent updates to allow memmap_on_memory for larger
ranges, and the internal splitting of altmaps into respective memory
blocks, the size argument is meaningless.

[[email protected]: fix build]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Verma <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Li Zhijian <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Huang Ying <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Vishal Verma [Wed, 24 Jan 2024 20:03:48 +0000 (12:03 -0800)]

Documentatiion/ABI: add ABI documentation for sys-bus-dax

Add the missing sysfs ABI documentation for the device DAX subsystem.
Various ABI attributes under this have been present since v5.1, and more
have been added over time. In preparation for adding a new attribute,
add this file with the historical details.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Verma <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Li Zhijian <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Vishal Verma [Wed, 24 Jan 2024 20:03:47 +0000 (12:03 -0800)]

dax/bus.c: replace several sprintf() with sysfs_emit()

There were several places where drivers/dax/bus.c uses 'sprintf' to print
sysfs data. Since a sysfs_emit() helper is available specifically for
this purpose, replace all the sprintf() usage for sysfs with sysfs_emit()
in this file.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Verma <[email protected]>
Reported-by: Greg Kroah-Hartman <[email protected]>
Reviewed-by: Alison Schofield <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Li Zhijian <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Vishal Verma [Wed, 24 Jan 2024 20:03:46 +0000 (12:03 -0800)]

dax/bus.c: replace driver-core lock usage by a local rwsem

Patch series "Add DAX ABI for memmap_on_memory", v7.

This series adds sysfs ABI to control memmap_on_memory behavior for DAX
devices.

Patch 1 replaces incorrect device_lock() usage with a local rwsem - this
was identified during review.

Patch 2 is also a preparatory patch that replaces sprintf() for sysfs
operations with sysfs_emit()

Patch 3 adds the missing documentation for the sysfs ABI for DAX regions
and Dax devices.

Patch 4 exports mhp_supports_memmap_on_memory().

Patch 5 adds the new ABI for toggling memmap_on_memory semantics for dax
devices.

This patch (of 5):

The dax driver incorrectly used driver-core device locks to protect
internal dax region and dax device configuration structures. Replace the
device lock usage with a local rwsem, one each for dax region
configuration and dax device configuration. As a result of this
conversion, no device_lock() usage remains in dax/bus.c.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Verma <[email protected]>
Reported-by: Greg Kroah-Hartman <[email protected]>
Reviewed-by: Alison Schofield <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Li Zhijian <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Yosry Ahmed [Thu, 25 Jan 2024 08:14:23 +0000 (08:14 +0000)]

mm: zswap: remove unused tree argument in zswap_entry_put()

Commit 7310895779624 ("mm: zswap: tighten up entry invalidation") removed
the usage of tree argument, delete it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Yajun Deng [Wed, 24 Jan 2024 03:57:19 +0000 (11:57 +0800)]

mm/mmap: introduce vma_set_range()

There is a lot of code needs to set the range of vma in mmap.c, introduce
vma_set_range() to simplify the code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yajun Deng <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Yosry Ahmed [Wed, 24 Jan 2024 04:51:12 +0000 (04:51 +0000)]

mm: zswap: remove unnecessary trees cleanups in zswap_swapoff()

During swapoff, try_to_unuse() makes sure that zswap_invalidate() is
called for all swap entries before zswap_swapoff() is called. This means
that all zswap entries should already be removed from the tree. Simplify
zswap_swapoff() by removing the trees cleanup code, and leave an assertion
in its place.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nhat Pham <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Yosry Ahmed [Wed, 24 Jan 2024 04:51:11 +0000 (04:51 +0000)]

mm: swap: enforce updating inuse_pages at the end of swap_range_free()

Patch series "mm: zswap: simplify zswap_swapoff()", v2.

These patches aim to simplify zswap_swapoff() by removing the unnecessary
trees cleanup code.  Patch 1 makes sure that the order of operations
during swapoff is enforced correctly, making sure the simplification in
patch 2 is correct in a future-proof manner.

This patch (of 2):

In swap_range_free(), we update inuse_pages then do some cleanups (arch
invalidation, zswap invalidation, swap cache cleanups, etc).  During
swapoff, try_to_unuse() checks that inuse_pages is 0 to make sure all swap
entries are freed.  Make sure we only update inuse_pages after we are done
with the cleanups in swap_range_free(), and use the proper memory barriers
to enforce it.  This makes sure that code following try_to_unuse() can
safely assume that swap_range_free() ran for all entries in thr swapfile
(e.g.  swap cache cleanup, zswap_swapoff()).

In practice, this currently isn't a problem because swap_range_free() is
called with the swap info lock held, and the swapoff code happens to spin
for that after try_to_unuse().  However, this seems fragile and
unintentional, so make it more relable and future-proof.  This also
facilitates a following simplification of zswap_swapoff().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nhat Pham <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Chengming Zhou [Fri, 19 Jan 2024 11:22:23 +0000 (11:22 +0000)]

mm/zswap: split zswap rb-tree

Each swapfile has one rb-tree to search the mapping of swp_entry_t to
zswap_entry, that use a spinlock to protect, which can cause heavy lock
contention if multiple tasks zswap_store/load concurrently.

Optimize the scalability problem by splitting the zswap rb-tree into
multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M),
just like we did in the swap cache address_space splitting.

Although this method can't solve the spinlock contention completely, it
can mitigate much of that contention.  Below is the results of kernel
build in tmpfs with zswap shrinker enabled:

     linux-next  zswap-lock-optimize
real 1m9.181s    1m3.820s
user 17m44.036s  17m40.100s
sys  7m37.297s   4m54.622s

So there are clearly improvements.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Nhat Pham <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Cc: Chris Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Chengming Zhou [Fri, 19 Jan 2024 11:22:22 +0000 (11:22 +0000)]

mm/zswap: make sure each swapfile always have zswap rb-tree

Patch series "mm/zswap: optimize the scalability of zswap rb-tree", v2.

When testing the zswap performance by using kernel build -j32 in a tmpfs
directory, I found the scalability of zswap rb-tree is not good, which is
protected by the only spinlock.  That would cause heavy lock contention if
multiple tasks zswap_store/load concurrently.

So a simple solution is to split the only one zswap rb-tree into multiple
rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M).  This idea
is from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB
trunks").

Although this method can't solve the spinlock contention completely, it
can mitigate much of that contention.  Below is the results of kernel
build in tmpfs with zswap shrinker enabled:

     linux-next  zswap-lock-optimize
real 1m9.181s    1m3.820s
user 17m44.036s  17m40.100s
sys  7m37.297s   4m54.622s

So there are clearly improvements.  And it's complementary with the
ongoing zswap xarray conversion by Chris.  Anyway, I think we can also
merge this first, it's complementary IMHO.  So I just refresh and resend
this for further discussion.

This patch (of 2):

Not all zswap interfaces can handle the absence of the zswap rb-tree,
actually only zswap_store() has handled it for now.

To make things simple, we make sure each swapfile always have the zswap
rb-tree prepared before being enabled and used.  The preparation is
unlikely to fail in practice, this patch just make it explicit.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Nhat Pham <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Cc: Chris Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Lukas Bulwahn [Mon, 22 Jan 2024 09:25:04 +0000 (10:25 +0100)]

mempolicy: clean up minor dead code in queue_pages_test_walk()

Commit 2cafb582173f ("mempolicy: remove confusing MPOL_MF_LAZY dead code")
removes MPOL_MF_LAZY handling in queue_pages_test_walk(), and with that,
there is no effective use of the local variable endvma in that function
remaining.

Remove the local variable endvma and its dead code. No functional change.

This issue was identified with clang-analyzer's dead stores analysis.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lukas Bulwahn <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Lukas Bulwahn [Mon, 22 Jan 2024 10:20:00 +0000 (11:20 +0100)]

maple_tree: avoid duplicate variable init in mast_spanning_rebalance()

The local variables r_tmp and l_tmp in mast_spanning_rebalance() are
already initialized at its declaration; there is no need to assign the
value again.

Remove the duplicate initialization of {r,l}_tmp. No functional change.
Due to common compiler optimizations, also no change to object code.

This issue was identified with clang-analyzer's dead stores analysis.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lukas Bulwahn <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Nico Pache [Wed, 17 Jan 2024 18:00:37 +0000 (11:00 -0700)]

selftests: mm: perform some system cleanup before using hugepages

When running with CATEGORY= (thp | hugetlb) we see a large numbers of
tests failing. These failures are due to not being able to allocate a
hugepage and normally occur on memory contrainted systems or when using
large page sizes.

drop_cache and compact_memory before the tests for a higher chance at a
successful hugepage allocation.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Nico Pache <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Lokesh Gidra [Wed, 17 Jan 2024 22:39:21 +0000 (14:39 -0800)]

userfaultfd: fix return error if mmap_changing is non-zero in MOVE ioctl

To be consistent with other uffd ioctl's returning EAGAIN when
mmap_changing is detected, we should change UFFDIO_MOVE to do the same.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lokesh Gidra <[email protected]>
Acked-by: Suren Baghdasaryan <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Nicolas Geoffray <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Greg Thelen [Thu, 18 Jan 2024 09:50:57 +0000 (01:50 -0800)]

selftests/memfd: delete unused declarations

Commit 32d118ad50a5 ("selftests/memfd: add tests for F_SEAL_EXEC"):
- added several unused 'nbytes' local variables

Commit 6469b66e3f5a ("selftests: improve vm.memfd_noexec sysctl tests"):
- orphaned 'newpid_thread_fn2()' forward declaration
- orphaned 'join_newpid_thread()' forward declaration
- added unused 'pid' local in sysctl_simple_child()
- orphaned 'fd' local in sysctl_simple_child()
- added unused 'fd' in sysctl_nested_child()

Delete the unused locals and forward declarations.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Greg Thelen <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Daniel Verkamp <[email protected]>
Cc: Jeff Xu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Shakeel Butt [Thu, 18 Jan 2024 18:42:35 +0000 (18:42 +0000)]

mm: writeback: ratelimit stat flush from mem_cgroup_wb_stats

One of our workloads (Postgres 14) has regressed when migrated from 5.10
to 6.1 upstream kernel.  The regression can be reproduced by sysbench's
oltp_write_only benchmark.  It seems like the always on rstat flush in
mem_cgroup_wb_stats() is causing the regression.  So, rate limit that
specific rstat flush.  One potential consequence would be the dirty
throttling might be decided on stale memcg stats.  However from our
benchmarks and production traffic we have not observed any change in the
dirty throttling behavior of the application.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat")
Signed-off-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Kefeng Wang [Wed, 17 Jan 2024 10:39:54 +0000 (18:39 +0800)]

mm: memory: move mem_cgroup_charge() into alloc_anon_folio()

The GFP flags from vma_thp_gfp_mask() according to user configuration only
used for large folio allocation but not for memory cgroup charge, and
GFP_KERNEL is used for both order-0 and large order folio when memory
cgroup charge at present.  However, mem_cgroup_charge() uses the GFP flags
in a fairly sophisticated way.  In addition to checking
gfpflags_allow_blocking(), it pays attention to __GFP_NORETRY and
__GFP_RETRY_MAYFAIL to ensure that processes within this memcg do not
exceed their quotas.

So we'd better to move mem_cgroup_charge() into alloc_anon_folio(),

1) it will make us to allocate as much as possible large order folio,
   because we could try the next order if mem_cgroup_charge() fails,
   although the memcg's memory usage is close to its limits.

2) using same GFP flags for allocation and charge is to be consistent
   with PMD THP firstly, in addition, according to GFP flag returned from
   vma_thp_gfp_mask(), GFP_TRANSHUGE_LIGHT could make us skip direct
   reclaim, _GFP_NORETRY will make us skip mem_cgroup_oom() and won't
   trigger memory cgroup oom from large order(order <= COSTLY_ORDER) folio
   charging.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Ryan Roberts [Tue, 16 Jan 2024 14:12:35 +0000 (14:12 +0000)]

tools/mm: add thpmaps script to dump THP usage info

With the proliferation of large folios for file-backed memory, and more
recently the introduction of multi-size THP for anonymous memory, it is
becoming useful to be able to see exactly how large folios are mapped into
processes.  For some architectures (e.g.  arm64), if most memory is mapped
using contpte-sized and -aligned blocks, TLB usage can be optimized so
it's useful to see where these requirements are and are not being met.

thpmaps is a Python utility that reads /proc/<pid>/smaps,
/proc/<pid>/pagemap and /proc/kpageflags to print information about how
transparent huge pages (both file and anon) are mapped to a specified
process or cgroup.  It aims to help users debug and optimize their
workloads.  In future we may wish to introduce stats directly into the
kernel (e.g.  smaps or similar), but for now this provides a short term
solution without the need to introduce any new ABI.

Run with help option for a full listing of the arguments:

    # ./thpmaps --help

--8<--
usage: thpmaps [-h] [--pid pid | --cgroup path] [--rollup]
               [--cont size[KMG]] [--inc-smaps] [--inc-empty]
               [--periodic sleep_ms]

Prints information about how transparent huge pages are mapped, either
system-wide, or for a specified process or cgroup.

When run with --pid, the user explicitly specifies the set of pids to
scan.  e.g.  "--pid 10 [--pid 134 ...]".  When run with --cgroup, the user
passes either a v1 or v2 cgroup and all pids that belong to the cgroup
subtree are scanned.  When run with neither --pid nor --cgroup, the full
set of pids on the system is gathered from /proc and scanned as if the
user had provided "--pid 1 --pid 2 ...".

A default set of statistics is always generated for THP mappings.
However, it is also possible to generate additional statistics for
"contiguous block mappings" where the block size is user-defined.

Statistics are maintained independently for anonymous and file-backed
(pagecache) memory and are shown both in kB and as a percentage of either
total anonymous or total file-backed memory as appropriate.

THP Statistics
--------------

Statistics are always generated for fully- and contiguously-mapped THPs
whose mapping address is aligned to their size, for each <size> supported
by the system.  Separate counters describe THPs mapped by PTE vs those
mapped by PMD.  (Although note a THP can only be mapped by PMD if it is
PMD-sized):

- anon-thp-pte-aligned-<size>kB
- file-thp-pte-aligned-<size>kB
- anon-thp-pmd-aligned-<size>kB
- file-thp-pmd-aligned-<size>kB

Similarly, statistics are always generated for fully- and contiguously-
mapped THPs whose mapping address is *not* aligned to their size, for each
<size> supported by the system.  Due to the unaligned mapping, it is
impossible to map by PMD, so there are only PTE counters for this case:

- anon-thp-pte-unaligned-<size>kB
- file-thp-pte-unaligned-<size>kB

Statistics are also always generated for mapped pages that belong to a THP
but where the is THP is *not* fully- and contiguously- mapped.  These
"partial" mappings are all counted in the same counter regardless of the
size of the THP that is partially mapped:

- anon-thp-pte-partial
- file-thp-pte-partial

Contiguous Block Statistics
---------------------------

An optional, additional set of statistics is generated for every
contiguous block size specified with `--cont <size>`.  These statistics
show how much memory is mapped in contiguous blocks of <size> and also
aligned to <size>.  A given contiguous block must all belong to the same
THP, but there is no requirement for it to be the *whole* THP.  Separate
counters describe contiguous blocks mapped by PTE vs those mapped by PMD:

- anon-cont-pte-aligned-<size>kB
- file-cont-pte-aligned-<size>kB
- anon-cont-pmd-aligned-<size>kB
- file-cont-pmd-aligned-<size>kB

As an example, if monitoring 64K contiguous blocks (--cont 64K), there are
a number of sources that could provide such blocks: a fully- and
contiguously-mapped 64K THP that is aligned to a 64K boundary would
provide 1 block.  A fully- and contiguously-mapped 128K THP that is
aligned to at least a 64K boundary would provide 2 blocks.  Or a 128K THP
that maps its first 100K, but contiguously and starting at a 64K boundary
would provide 1 block.  A fully- and contiguously-mapped 2M THP would
provide 32 blocks.  There are many other possible permutations.

options:
  -h, --help           show this help message and exit
  --pid pid            Process id of the target process. Maybe issued
                       multiple times to scan multiple processes. --pid
                       and --cgroup are mutually exclusive. If neither
                       are provided, all processes are scanned to
                       provide system-wide information.
  --cgroup path        Path to the target cgroup in sysfs. Iterates
                       over every pid in the cgroup and its children.
                       --pid and --cgroup are mutually exclusive. If
                       neither are provided, all processes are scanned
                       to provide system-wide information.
  --rollup             Sum the per-vma statistics to provide a summary
                       over the whole system, process or cgroup.
  --cont size[KMG]     Adds stats for memory that is mapped in
                       contiguous blocks of <size> and also aligned to
                       <size>. May be issued multiple times to track
                       multiple sized blocks. Useful to infer e.g.
                       arm64 contpte and hpa mappings. Size must be a
                       power-of-2 number of pages.
  --inc-smaps          Include all numerical, additive
                       /proc/<pid>/smaps stats in the output.
  --inc-empty          Show all statistics including those whose value
                       is 0.
  --periodic sleep_ms  Run in a loop, polling every sleep_ms
                       milliseconds.

Requires root privilege to access pagemap and kpageflags.
--8<--

Example command to summarise fully and partially mapped THPs and 64K
contiguous blocks over all VMAs in all processes in the system
(--inc-empty forces printing stats that are 0):

    # ./thpmaps --cont 64K --rollup --inc-empty

--8<--
anon-thp-pmd-aligned-2048kB:      139264 kB ( 6%)
file-thp-pmd-aligned-2048kB:           0 kB ( 0%)
anon-thp-pte-aligned-16kB:             0 kB ( 0%)
anon-thp-pte-aligned-32kB:             0 kB ( 0%)
anon-thp-pte-aligned-64kB:         72256 kB ( 3%)
anon-thp-pte-aligned-128kB:            0 kB ( 0%)
anon-thp-pte-aligned-256kB:            0 kB ( 0%)
anon-thp-pte-aligned-512kB:            0 kB ( 0%)
anon-thp-pte-aligned-1024kB:           0 kB ( 0%)
anon-thp-pte-aligned-2048kB:           0 kB ( 0%)
anon-thp-pte-unaligned-16kB:           0 kB ( 0%)
anon-thp-pte-unaligned-32kB:           0 kB ( 0%)
anon-thp-pte-unaligned-64kB:           0 kB ( 0%)
anon-thp-pte-unaligned-128kB:          0 kB ( 0%)
anon-thp-pte-unaligned-256kB:          0 kB ( 0%)
anon-thp-pte-unaligned-512kB:          0 kB ( 0%)
anon-thp-pte-unaligned-1024kB:         0 kB ( 0%)
anon-thp-pte-unaligned-2048kB:         0 kB ( 0%)
anon-thp-pte-partial:              63232 kB ( 3%)
file-thp-pte-aligned-16kB:        809024 kB (47%)
file-thp-pte-aligned-32kB:         43168 kB ( 3%)
file-thp-pte-aligned-64kB:         98496 kB ( 6%)
file-thp-pte-aligned-128kB:        17536 kB ( 1%)
file-thp-pte-aligned-256kB:            0 kB ( 0%)
file-thp-pte-aligned-512kB:            0 kB ( 0%)
file-thp-pte-aligned-1024kB:           0 kB ( 0%)
file-thp-pte-aligned-2048kB:           0 kB ( 0%)
file-thp-pte-unaligned-16kB:       21712 kB ( 1%)
file-thp-pte-unaligned-32kB:         704 kB ( 0%)
file-thp-pte-unaligned-64kB:         896 kB ( 0%)
file-thp-pte-unaligned-128kB:      44928 kB ( 3%)
file-thp-pte-unaligned-256kB:          0 kB ( 0%)
file-thp-pte-unaligned-512kB:          0 kB ( 0%)
file-thp-pte-unaligned-1024kB:         0 kB ( 0%)
file-thp-pte-unaligned-2048kB:         0 kB ( 0%)
file-thp-pte-partial:               9252 kB ( 1%)
anon-cont-pmd-aligned-64kB:       139264 kB ( 6%)
file-cont-pmd-aligned-64kB:            0 kB ( 0%)
anon-cont-pte-aligned-64kB:       100672 kB ( 4%)
file-cont-pte-aligned-64kB:       161856 kB ( 9%)
--8<--

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Tested-by: Barry Song <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Zenghui Yu <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Ronald Monthero [Tue, 16 Jan 2024 13:31:45 +0000 (23:31 +1000)]

mm/zswap: improve with alloc_workqueue() call

The core-api create_workqueue is deprecated, this patch replaces the
create_workqueue with alloc_workqueue.  The previous implementation
workqueue of zswap was a bounded workqueue, this patch uses
alloc_workqueue() to create an unbounded workqueue.  The WQ_UNBOUND
attribute is desirable making the workqueue not localized to a specific
cpu so that the scheduler is free to exercise improvisations in any
demanding scenarios for offloading cpu time slices for workqueues.  For
example if any other workqueues of the same primary cpu had to be served
which are WQ_HIGHPRI and WQ_CPU_INTENSIVE.  Also Unbound workqueue happens
to be more efficient in a system during memory pressure scenarios in
comparison to a bounded workqueue.

shrink_wq = alloc_workqueue("zswap-shrink",
                     WQ_UNBOUND|WQ_MEM_RECLAIM, 1);

Overall the change suggested in this patch should be seamless and does not
alter the existing behavior, other than the improvisation to be an
unbounded workqueue.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ronald Monthero <[email protected]>
Acked-by: Nhat Pham <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Vitaly Wool <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Pankaj Raghav [Mon, 15 Jan 2024 10:25:22 +0000 (11:25 +0100)]

readahead: use ilog2 instead of a while loop in page_cache_ra_order()

A while loop is used to adjust the new_order to be lower than the
ra->size. ilog2 could be used to do the same instead of using a loop.

ilog2 typically resolves to a bit scan reverse instruction. This is
particularly useful when ra->size is smaller than the 2^new_order as it
resolves in one instruction instead of looping to find the new_order.

No functional changes.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pankaj Raghav <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Hui Zhu [Thu, 11 Jan 2024 08:45:33 +0000 (08:45 +0000)]

fs/proc/task_mmu.c: add_to_pagemap: remove useless parameter addr

Function parameter addr of add_to_pagemap() is useless. Remove it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hui Zhu <[email protected]>
Reviewed-by: Muhammad Usama Anjum <[email protected]>
Tested-by: Muhammad Usama Anjum <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Kefeng Wang [Thu, 11 Jan 2024 15:24:29 +0000 (15:24 +0000)]

mm: convert mm_counter_file() to take a folio

Now all callers of mm_counter_file() have a folio, convert
mm_counter_file() to take a folio. Saves a call to compound_head() hidden
inside PageSwapBacked().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Kefeng Wang [Thu, 11 Jan 2024 15:24:28 +0000 (15:24 +0000)]

mm: convert mm_counter() to take a folio

Now all callers of mm_counter() have a folio, convert mm_counter() to take
a folio. Saves a call to compound_head() hidden inside PageAnon().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Kefeng Wang [Thu, 11 Jan 2024 15:24:27 +0000 (15:24 +0000)]

mm: convert to should_zap_page() to should_zap_folio()

Make should_zap_page() take a folio and rename it to should_zap_folio() as
preparation for converting mm counter functions to take a folio. Saves a
call to compound_head() hidden inside PageAnon().

[[email protected]: fix used-uninitialized warning]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Kefeng Wang [Thu, 11 Jan 2024 15:24:26 +0000 (15:24 +0000)]

mm: use pfn_swap_entry_folio() in copy_nonpresent_pte()

Call pfn_swap_entry_folio() as preparation for converting mm counter
functions to take a folio.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Kefeng Wang [Thu, 11 Jan 2024 15:24:25 +0000 (15:24 +0000)]

mm: use pfn_swap_entry_to_folio() in zap_huge_pmd()

Call pfn_swap_entry_to_folio() in zap_huge_pmd() as preparation for
converting mm counter functions to take a folio. Saves a call to
compound_head() embedded inside PageAnon().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Kefeng Wang [Thu, 11 Jan 2024 15:24:24 +0000 (15:24 +0000)]

mm: use pfn_swap_entry_folio() in __split_huge_pmd_locked()

Call pfn_swap_entry_folio() in __split_huge_pmd_locked() as preparation
for converting mm counter functions to take a folio.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Kefeng Wang [Thu, 11 Jan 2024 15:24:23 +0000 (15:24 +0000)]

s390: use pfn_swap_entry_folio() in ptep_zap_swap_entry()

Call pfn_swap_entry_folio() in ptep_zap_swap_entry() as preparation for
converting mm counter functions to take a folio.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Matthew Wilcox (Oracle) [Thu, 11 Jan 2024 15:24:22 +0000 (15:24 +0000)]

mprotect: use pfn_swap_entry_folio

We only want to know whether the folio is anonymous, so use
pfn_swap_entry_folio() and save a call to compound_head().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Kefeng Wang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Matthew Wilcox (Oracle) [Thu, 11 Jan 2024 15:24:21 +0000 (15:24 +0000)]

proc: use pfn_swap_entry_folio where obvious

These callers only pass the result to PageAnon(), so we can save the extra
call to compound_head() by using pfn_swap_entry_folio().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Kefeng Wang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Matthew Wilcox (Oracle) [Thu, 11 Jan 2024 15:24:20 +0000 (15:24 +0000)]

mm: add pfn_swap_entry_folio()

Patch series "mm: convert mm counter to take a folio", v3.

Make sure all mm_counter() and mm_counter_file() callers have a folio,
then convert mm counter functions to take a folio, which saves some
compound_head() calls.

This patch (of 10):

Thanks to the compound_head() hidden inside PageLocked(), this saves a
call to compound_head() over calling page_folio(pfn_swap_entry_to_page())

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Kefeng Wang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Matthew Wilcox (Oracle) [Thu, 11 Jan 2024 18:12:19 +0000 (18:12 +0000)]

memcg: use a folio in get_mctgt_type_thp

Replace five calls to compound_head() with one.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Matthew Wilcox (Oracle) [Thu, 11 Jan 2024 18:12:18 +0000 (18:12 +0000)]

memcg: use a folio in get_mctgt_type

Replace seven calls to compound_head() with one. We still use the page as
page_mapped() is different from folio_mapped().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Matthew Wilcox (Oracle) [Thu, 11 Jan 2024 18:12:17 +0000 (18:12 +0000)]

memcg: return the folio in union mc_target

All users of target.page convert it to the folio, so we can just return
the folio directly and save a few calls to compound_head().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Matthew Wilcox (Oracle) [Thu, 11 Jan 2024 18:12:16 +0000 (18:12 +0000)]

memcg: convert mem_cgroup_move_charge_pte_range() to use a folio

Patch series "Convert memcontrol charge moving to use folios".

No part of these patches should change behaviour; all the called functions
already convert from page to folio, so this ought to simply be a reduction
in the number of calls to compound_head().

This patch (of 4):

Remove many calls to compound_head() by calling page_folio() once at the
start of each stanza which receives a struct page from 'target'. There
should be no change in behaviour here as all the called functions start
out by converting the page to its folio.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Yang Shi [Thu, 21 Dec 2023 06:59:42 +0000 (22:59 -0800)]

mm: mmap: no need to call khugepaged_enter_vma() for stack

We avoid allocating THP for temporary stack, even though
khugepaged_enter_vma() is called for stack VMAs, it actualy returns
false. So no need to call it in the first place at all.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Reviewed-by: Yin Fengwei <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: kernel test robot <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Haifeng Xu [Thu, 28 Dec 2023 06:27:15 +0000 (06:27 +0000)]

mm: list_lru: remove unused macro list_lru_init_key()

list_lru_init_key() isn't used by anyone, remove it to clean up.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Haifeng Xu <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Shakeel Butt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Haifeng Xu [Thu, 28 Dec 2023 06:27:14 +0000 (06:27 +0000)]

mm: list_lru: disable memcg_aware when cgroup.memory is set to "nokmem"

Actually, when using a boot time kernel option "cgroup.memory=nokmem", all
lru items are inserted to list_lru_node. But for those users who invoke
list_lru_init_memcg() to initialize list_lru, list_lru_memcg_aware()
returns true. And this brings unneeded operations related to memcg.

To make things more convenient, let's disable memcg_aware when
cgroup.memory is set to "nokmem".

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Haifeng Xu <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Shakeel Butt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Kefeng Wang [Fri, 29 Dec 2023 08:22:07 +0000 (16:22 +0800)]

mm: memory: use nth_page() in clear/copy_subpage()

The clear and copy of huge gigantic page has converted to use nth_page()
to handle the possible discontinuous struct page(SPARSEMEM without
VMEMMAP), but not change for the non-gigantic part, fix it too.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Yajun Deng [Wed, 10 Jan 2024 08:46:22 +0000 (16:46 +0800)]

mm/mmap: simplify vma link and unlink

The file parameter in the __remove_shared_vm_struct is no longer used,
remove it.

These functions vma_link() and mmap_region() have some of the same code,
introduce vma_link_file() helper function to simplify the code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yajun Deng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Kuan-Ying Lee [Wed, 7 Feb 2024 08:58:51 +0000 (16:58 +0800)]

scripts/gdb/vmalloc: fix vmallocinfo error

The patch series "Mitigate a vmap lock contention" removes vmap_area_list,
which will break the gdb vmallocinfo command:

(gdb) lx-vmallocinfo
Python Exception <class 'gdb.error'>: No symbol "vmap_area_list" in current context.
Error occurred in Python: No symbol "vmap_area_list" in current context.

So we can instead use vmap_nodes to iterate all vmallocinfo.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kuan-Ying Lee <[email protected]>
Cc: Casper Li <[email protected]>
Cc: AngeloGioacchino Del Regno <[email protected]>
Cc: Chinwen Chang <[email protected]>
Cc: Jan Kiszka <[email protected]>
Cc: Kieran Bingham <[email protected]>
Cc: Matthias Brugger <[email protected]>
Cc: Qun-Wei Lin <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

JP Kobryn [Fri, 5 Jan 2024 20:24:01 +0000 (12:24 -0800)]

selftests/mm/ksm_functional: prevent unmapping undefined address

Replace some goto statements with return statements so that unmap() is not
called on an undefined address. This change is made so that unmap() can
only be reached after mmap() is called (and the address mentioned is
defined). Returning MAP_FAILED seems acceptable since client code checks
for this value.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 42096aa24b82 ("selftest/mm: ksm_functional_tests: test in mmap_and_merge_range() if anything got merged")
Signed-off-by: JP Kobryn <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Hongbo Li [Mon, 8 Jan 2024 04:48:15 +0000 (12:48 +0800)]

mm/filemap: avoid type conversion

The return type of function folio_test_hugetlb is bool type, there is no
need to assign it to an integer type.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hongbo Li <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Sumanth Korikkar [Mon, 8 Jan 2024 13:27:47 +0000 (14:27 +0100)]

s390: enable MHP_MEMMAP_ON_MEMORY

Enable MHP_MEMMAP_ON_MEMORY to support "memmap on memory".
memory_hotplug.memmap_on_memory=true kernel parameter should be set in
kernel boot option to enable the feature.

Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Gerald Schaefer <[email protected]>
Signed-off-by: Sumanth Korikkar <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Sumanth Korikkar [Mon, 8 Jan 2024 13:27:46 +0000 (14:27 +0100)]

s390/mm: implement MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers

MEM_PREPARE_ONLINE memory notifier makes memory block physical
accessible via sclp assign command. The notifier ensures self-contained
memory maps are accessible and hence enabling the "memmap on memory" on
s390.

MEM_FINISH_OFFLINE memory notifier shifts the memory block to an
inaccessible state via sclp unassign command.

Implementation considerations:
* When MHP_MEMMAP_ON_MEMORY is disabled, the system retains the old
  behavior. This means the memory map is allocated from default memory.
* If MACHINE_HAS_EDAT1 is unavailable, MHP_MEMMAP_ON_MEMORY is
  automatically disabled. This ensures that vmemmap pagetables do not
  consume additional memory from the default memory allocator.
* The MEM_GOING_ONLINE notifier has been modified to perform no
  operation, as MEM_PREPARE_ONLINE already executes the sclp assign
  command.
* The MEM_CANCEL_ONLINE/MEM_OFFLINE notifier now performs no operation, as
  MEM_FINISH_OFFLINE already executes the sclp unassign command.

Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Gerald Schaefer <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Sumanth Korikkar <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Sumanth Korikkar [Mon, 8 Jan 2024 13:27:45 +0000 (14:27 +0100)]

s390/sclp: remove unhandled memory notifier type

Remove memory notifier types which are unhandled by s390. Unhandled
memory notifier types are covered by default case.

Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Alexander Gordeev <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Sumanth Korikkar <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

commit | commitdiff | tree

Sumanth Korikkar [Mon, 8 Jan 2024 13:27:44 +0000 (14:27 +0100)]

s390/mm: allocate vmemmap pages from self-contained memory range

Allocate memory map (struct pages array) from the hotplugged memory
range, rather than using system memory. The change addresses the issue
where standby memory, when configured to be much larger than online
memory, could potentially lead to ipl failure due to memory map
allocation from online memory. For example, 16MB of memory map
allocation is needed for a memory block size of 1GB and when standby
memory is configured much larger than online memory, this could lead to
ipl failure.

To address this issue, the solution involves introducing "memmap on
memory" using the vmem_altmap structure on s390.  Architectures that
want to implement it should pass the altmap to the vmemmap_populate()
function and its associated callchain. This enhancement is discussed in
commit 4b94ffdc4163 ("x86, mm: introduce vmem_altmap to augment
vmemmap_populate()")

Provide "memmap on memory" support for s390 by passing the altmap in
vmemmap_populate() and its callchain. The allocation path is described
as follows:
* When altmap is NULL in vmemmap_populate(), memory map allocation
  occurs using the existing vmemmap_alloc_block_buf().
* When altmap is not NULL in vmemmap_populate(), memory map allocation
  still uses vmemmap_alloc_block_buf(), but this function internally
  calls altmap_alloc_block_buf().

For deallocation, the process is outlined as follows:
* When altmap is NULL in vmemmap_free(), memory map deallocation happens
  through free_pages().
* When altmap is not NULL in vmemmap_free(), memory map deallocation
  occurs via vmem_altmap_free().

While memory map allocation is primarily handled through the
self-contained memory map range, there might still be a small amount of
system memory allocation required for vmemmap pagetables. To mitigate
this impact, this feature will be limited to machines with EDAT1
support.

Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Gerald Schaefer <[email protected]>
Signed-off-by: Sumanth Korikkar <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Empty description

RSS Atom

This page took 0.13936 seconds and 4 git commands to generate.