Git Repo - linux.git/log

mm/damon/core: add multiple goals per damos_quota and helpers for those

The feedback-driven DAMOS quota auto-tuning feature allows only single
goal to the DAMON kernel API users.  The API users could implement
multiple goals for the end-users on their level, and that's what DAMON
sysfs interface is doing.  More DAMON kernel API users such as
DAMON_RECLAIM would need to do similar work.  To reduce unnecessary future
duplciated efforts, support multiple goals from DAMOS core layer.  To make
the support in minimum non-destructive change, keep the old single goal
setup interface, and add multiple goals setup.  The single goal will
treated as one of the multiple goals, so old API users are not required to
make any change.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/damon/core: split out quota goal related fields to a struct

'struct damos_quota' is not small now. Split out fields for quota goal to
a separate struct for easier reading.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/damon: move comments and fields for damos-quota-prioritization to the end

The comments and definition of 'struct damos_quota' lists a few fields for
effective quota generation first, fields for regions prioritization under
the quota, and then remaining fields for effective quota generation.
Readers' should unnecesssarily switch their context in the middle. List
all the fields for the effective quota first, and then fields for the
prioritization for making it easier to read.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/admin-guide/mm/damon/usage: document effective_bytes file

Update DAMON usage document for the effective quota file of the DAMON
sysfs interface.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/ABI/damon: document effective_bytes sysfs file

Update the DAMON ABI doc for the effective_bytes sysfs file and the
kdamond state file input command for updating the content of the file.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/damon/sysfs: implement a kdamond command for updating schemes' effective quotas

Implement yet another kdamond 'state' file input command, namely
'update_schemes_effective_quotas'. If it is written, the
'effective_bytes' files of the kdamond will be updated to provide the
current effective size quota of each scheme in bytes.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/damon/sysfs-schemes: implement quota effective_bytes file

DAMON sysfs interface allows users to set two types of quotas, namely time
quota and size quota.  DAMOS converts time quota to a size quota and use
smaller one among the resulting two size quotas.  The resulting effective
size quota can be helpful for debugging and analysis, but not exposed to
the user.  The recently added feedback-driven quota auto-tuning is making
it even more mysterious.

Implement a DAMON sysfs interface read-only empty file, namely
'effective_bytes', under the quota goal DAMON sysfs directory.  It will be
extended to expose the effective quota to the end user.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/damon/core: set damos_quota->esz as public field and document

Patch series "mm/damon: let DAMOS feeds and tame/auto-tune itself".

The Aim-oriented Feedback-driven DAMOS Aggressiveness Auto-tuning
patchset[1] which has merged since commit 9294a037c015 ("mm/damon/core:
implement goal-oriented feedback-driven quota auto-tuning") made the
mechanism and the policy separated.  That is, users can set a part of
DAMOS control policies without a deep understanding of the mechanism but
just their demands such as SLA.

However, users are still required to do some additional work of manually
collecting their target metric and feeding it to DAMOS.  In the case of
end-users who use DAMON sysfs interface, the context switches between
user-space and kernel-space could also make it inefficient.  The overhead
is supposed to be only trivial in common cases, though.  Meanwhile, in
simple use cases, the target metric could be common system metrics that
the kernel can efficiently self-retrieve, such as memory pressure stall
time (PSI).

Extend DAMOS quota auto-tuning to support multiple types of metrics
including the DAMOS self-retrievable ones, and add support for memory
pressure stall time metric.  Different types of metrics can be supported
in future.  The auto-tuning capability is currently supported for only
users of DAMOS kernel API and DAMON sysfs interface.  Extend the support
to DAMON_RECLAIM.

Patches Sequence
================

First five patches are for helping debugging and fine-tuning existing
quota control features.  The first one (patch 1) exposes the effective
quota that is made with given user inputs to DAMOS kernel API users and
kernel-doc documents.  Following four patches implement (patches 1, 2 and
3) and document (patches 4 and 5) a new DAMON sysfs file that exposes the
value.

Following six patches cleanup and simplify the existing DAMOS quota
auto-tuning code by improving layout of comments and data structures
(patches 6 and 7), supporting common use cases, namely multiple goals
(patches 8, 9 and 10), and simplifying the interface (patch 11).

Then six patches for the main purpose of this patchset follow.  The first
three changes extend the core logic for various target metrics (patch 12),
implement memory pressure stall time-based target metric support (patch
13), and update DAMON sysfs interface to support the new target metric
(patch 14).  Then, documentation updates for the features on design (patch
15), ABI (patch 16), and usage (patch 17) follow.

Last three patches add auto-tuning support on DAMON_RECLAIM.  The patches
implement DAMON_RECLAIM parameters for user-feedback driven quota
auto-tuning (patch 18), memory pressure stall time-driven quota
self-tuning (patch 19), and finally update the DAMON_RECLAIM usage
document for the new parameters (patch 20).

[1] https://lore.kernel.org/all/20231130023652 [email protected]/

This patch (of 20):

DAMOS allow users to specify the quota as they want in multiple ways
including time quota, size quota, and feedback-based auto-tuning.  DAMOS
makes one effective quota out of the inputs and use it at the end.
Knowing the current effective quota helps understanding DAMOS' internal
mechanism and fine-tuning quotas.  DAMON kernel API users can get the
information from ->esz field of damos_quota struct, but the field is
marked as private purpose, and not kernel-doc documented.  Make it public
and document.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/khugepaged: bypassing unnecessary scans with MMF_DISABLE_THP check

khugepaged scans the entire address space in the background for each
given mm, looking for opportunities to merge sequences of basic pages
into huge pages.  However, when an mm is inserted to the mm_slots list,
and the MMF_DISABLE_THP flag is set later, this scanning process
becomes unnecessary for that mm and can be skipped to avoid redundant
operations, especially in scenarios with a large address space.

On an Intel Core i5 CPU, the time taken by khugepaged to scan the
address space of the process, which has been set with the
MMF_DISABLE_THP flag after being added to the mm_slots list, is as
follows (shorter is better):

VMA Count |   Old   |   New   |  Change
---------------------------------------
    50    |   23us  |    9us  |  -60.9%
   100    |   32us  |    9us  |  -71.9%
   200    |   44us  |    9us  |  -79.5%
   400    |   75us  |    9us  |  -88.0%
   800    |   98us  |    9us  |  -90.8%

Once the count of VMAs for the process exceeds page_to_scan, khugepaged
needs to wait for scan_sleep_millisecs ms before scanning the next
process.  IMO, unnecessary scans could actually be skipped with a very
inexpensive mm->flags check in this case.

This commit introduces a check before each scanning process to test the
MMF_DISABLE_THP flag for the given mm; if the flag is set, the scanning
process is bypassed, thereby improving the efficiency of khugepaged.

This optimization is not a correctness issue but rather an enhancement
to save expensive checks on each VMA when userspace cannot prctl itself
before spawning into the new process.

On some servers within our company, we deploy a daemon responsible for
monitoring and updating local applications.  Some applications prefer
not to use THP, so the daemon calls prctl to disable THP before
fork/exec.  Conversely, for other applications, the daemon calls prctl
to enable THP before fork/exec.

Ideally, the daemon should invoke prctl after the fork, but its current
implementation follows the described approach.  In the Go standard
library, there is no direct encapsulation of the fork system call;
instead, fork and execve are combined into one through
syscall.ForkExec.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lance Yang <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

MAINTAINERS: update mm and memcg entries

Add F: lines for memory management and memory cgroup include files.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arch, crash: move arch_crash_save_vmcoreinfo() out to file vmcore_info.c

Nathan reported below building error:

=====
$ curl -LSso .config https://git.alpinelinux.org/aports/plain/community/linux-edge/config-edge.armv7
$ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- olddefconfig all
..
arm-linux-gnueabi-ld: arch/arm/kernel/machine_kexec.o: in function `arch_crash_save_vmcoreinfo':
machine_kexec.c:(.text+0x488): undefined reference to `vmcoreinfo_append_str'
====

On architecutres, like arm, s390, ppc, sh, function
arch_crash_save_vmcoreinfo() is located in machine_kexec.c and it can
only be compiled in when CONFIG_KEXEC_CORE=y.

That's not right because arch_crash_save_vmcoreinfo() is used to export
arch specific vmcoreinfo. CONFIG_VMCORE_INFO is supposed to control its
compiling in. However, CONFIG_VMVCORE_INFO could be independent of
CONFIG_KEXEC_CORE, e.g CONFIG_PROC_KCORE=y will select CONFIG_VMVCORE_INFO.
Or CONFIG_KEXEC/CONFIG_KEXEC_FILE is set while CONFIG_CRASH_DUMP is
not set, it will report linking error.

So, on arm, s390, ppc and sh, move arch_crash_save_vmcoreinfo out to
a new file vmcore_info.c. Let CONFIG_VMCORE_INFO decide if compiling in
arch_crash_save_vmcoreinfo().

[[email protected]: remove stray newlines at eof]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Reported-by: Nathan Chancellor <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/T/#u
Cc: Al Viro <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Michael Kelley <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Yang Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

loongarch, crash: wrap crash dumping code into crash related ifdefs

Now crash codes under kernel/ folder has been split out from kexec
code, crash dumping can be separated from kexec reboot in config
items on loongarch with some adjustments.

Here use IS_ENABLED(CONFIG_CRASH_RESERVE) check to decide if compiling
in the crashkernel reservation code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Michael Kelley <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Yang Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm, crash: wrap crash dumping code into crash related ifdefs

Now crash codes under kernel/ folder has been split out from kexec
code, crash dumping can be separated from kexec reboot in config
items on arm with some adjustments.

Here use CONFIG_CRASH_RESERVE ifdef to replace CONFIG_KEXEC ifdef.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Michael Kelley <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Yang Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

riscv, crash: wrap crash dumping code into crash related ifdefs

Now crash codes under kernel/ folder has been split out from kexec
code, crash dumping can be separated from kexec reboot in config
items on risc-v with some adjustments.

Here wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery, and
use IS_ENABLED(CONFIG_CRASH_RESERVE) check to decide if compiling
in the crashkernel reservation code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Michael Kelley <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Yang Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mips, crash: wrap crash dumping code into crash related ifdefs

Now crash codes under kernel/ folder has been split out from kexec
code, crash dumping can be separated from kexec reboot in config
items on mips with some adjustments.

Here use IS_ENABLED(CONFIG_CRASH_RESERVE) check to decide if compiling
in the crashkernel reservation code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Michael Kelley <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Yang Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

sh, crash: wrap crash dumping code into crash related ifdefs

Now crash codes under kernel/ folder has been split out from kexec
code, crash dumping can be separated from kexec reboot in config
items on SuperH with some adjustments.

Wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery, and use
IS_ENABLED(CONFIG_CRASH_RESERVE) check to decide if compiling in the
crashkernel reservation code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Michael Kelley <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Yang Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

s390, crash: wrap crash dumping code into crash related ifdefs

Now crash codes under kernel/ folder has been split out from kexec
code, crash dumping can be separated from kexec reboot in config
items on s390 with some adjustments.

Here wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Michael Kelley <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Yang Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

ppc, crash: enforce KEXEC and KEXEC_FILE to select CRASH_DUMP

In PowerPC, the crash dumping and kexec reboot share code in
arch_kexec_locate_mem_hole(), in which struct crash_mem is used.

Here enfoce enforce KEXEC and KEXEC_FILE to select CRASH_DUMP for now.

[[email protected]: fix allnoconfig on ppc]
Link: https://lkml.kernel.org/r/ZbJwMyCpz4HDySoo@MiWiFi-R3L-srv
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Acked-by: Hari Bathini <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Michael Kelley <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Yang Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64, crash: wrap crash dumping code into crash related ifdefs

Now crash codes under kernel/ folder has been split out from kexec
code, crash dumping can be separated from kexec reboot in config
items on arm64 with some adjustments.

Here wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery.

[[email protected]: fix building error in generic codes]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Michael Kelley <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Yang Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

x86, crash: wrap crash dumping code into crash related ifdefs

Now crash codes under kernel/ folder has been split out from kexec
code, crash dumping can be separated from kexec reboot in config
items on x86 with some adjustments.

Here, also change some ifdefs or IS_ENABLED() check to more appropriate
ones, e,g
- #ifdef CONFIG_KEXEC_CORE -> #ifdef CONFIG_CRASH_DUMP
- (!IS_ENABLED(CONFIG_KEXEC_CORE)) - > (!IS_ENABLED(CONFIG_CRASH_RESERVE))

[[email protected]: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope]
Link: https://lore.kernel.org/all/SN6PR02MB4157931105FA68D72E3D3DB8D47B2@SN6PR02MB4157.namprd02.prod.outlook.com/T/#u
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Michael Kelley <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Yang Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

crash: clean up kdump related config items

By splitting CRASH_RESERVE and VMCORE_INFO out from CRASH_CORE, cleaning
up the dependency of FA_DMUMP on CRASH_DUMP, and moving crash codes from
kexec_core.c to crash_core.c, now we can rearrange CRASH_DUMP to
depend on KEXEC_CORE, and make CRASH_DUMP select CRASH_RESERVE and
VMCORE_INFO.

KEXEC_CORE won't select CRASH_RESERVE and VMCORE_INFO any more because
KEXEC_CORE enables codes which allocate control pages, copy
kexec/kdump segments, and prepare for switching. These codes are shared
by both kexec reboot and crash dumping.

Doing this makes codes and the corresponding config items more
logical (the right item depends on or is selected by the left item).

PROC_KCORE -----------> VMCORE_INFO

           |----------> VMCORE_INFO
FA_DUMP----|
           |----------> CRASH_RESERVE

                                                ---->VMCORE_INFO
                                               /
                                               |---->CRASH_RESERVE
KEXEC      --|                                /|
             |--> KEXEC_CORE--> CRASH_DUMP-->/-|---->PROC_VMCORE
KEXEC_FILE --|                               \ |
                                               \---->CRASH_HOTPLUG

KEXEC      --|
             |--> KEXEC_CORE--> kexec reboot
KEXEC_FILE --|

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Michael Kelley <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Yang Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

crash: split crash dumping code out from kexec_core.c

Currently, KEXEC_CORE select CRASH_CORE automatically because crash codes
need be built in to avoid compiling error when building kexec code even
though the crash dumping functionality is not enabled. E.g
--------------------
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_KEXEC=y
CONFIG_KEXEC_FILE=y
---------------------

After splitting out crashkernel reservation code and vmcoreinfo exporting
code, there's only crash related code left in kernel/crash_core.c. Now
move crash related codes from kexec_core.c to crash_core.c and only build it
in when CONFIG_CRASH_DUMP=y.

And also wrap up crash codes inside CONFIG_CRASH_DUMP ifdeffery scope,
or replace inappropriate CONFIG_KEXEC_CORE ifdef with CONFIG_CRASH_DUMP
ifdef in generic kernel files.

With these changes, crash_core codes are abstracted from kexec codes and
can be disabled at all if only kexec reboot feature is wanted.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Michael Kelley <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Yang Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

crash: remove dependency of FA_DUMP on CRASH_DUMP

In kdump kernel, /proc/vmcore is an elf file mapping the crashed kernel's
old memory content. Its elf header is constructed in 1st kernel and passed
to kdump kernel via elfcorehdr_addr. Config CRASH_DUMP enables the code
of 1st kernel's old memory accessing in different architectures.

Currently, config FA_DUMP has dependency on CRASH_DUMP because fadump
needs access global variable 'elfcorehdr_addr' to judge if it's in
kdump kernel within function is_kdump_kernel(). In the current
kernel/crash_dump.c, variable 'elfcorehdr_addr' is defined, and function
setup_elfcorehdr() used to parse kernel parameter to fetch the passed
value of elfcorehdr_addr. Only for accessing elfcorehdr_addr, FA_DUMP
really doesn't have to depends on CRASH_DUMP.

To remove the dependency of FA_DUMP on CRASH_DUMP to avoid confusion,
rename kernel/crash_dump.c to kernel/elfcorehdr.c, and build it when
CONFIG_VMCORE_INFO is ebabled. With this, FA_DUMP doesn't need to depend
on CRASH_DUMP.

[[email protected]: power/fadump: make FA_DUMP select CRASH_DUMP]
Link: https://lkml.kernel.org/r/Zb8D1ASrgX0qVm9z@MiWiFi-R3L-srv
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Acked-by: Hari Bathini <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Michael Kelley <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Yang Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

crash: split vmcoreinfo exporting code out from crash_core.c

Now move the relevant codes into separate files:
kernel/crash_reserve.c, include/linux/crash_reserve.h.

And add config item CRASH_RESERVE to control its enabling.

And also update the old ifdeffery of CONFIG_CRASH_CORE, including of
<linux/crash_core.h> and config item dependency on CRASH_CORE
accordingly.

And also do renaming as follows:
- arch/xxx/kernel/{crash_core.c => vmcore_info.c}
because they are only related to vmcoreinfo exporting on x86, arm64,
riscv.

And also Remove config item CRASH_CORE, and rely on CONFIG_KEXEC_CORE to
decide if build in crash_core.c.

[[email protected]: remove duplicated include in vmcore_info.c]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Signed-off-by: Yang Li <[email protected]>
Acked-by: Hari Bathini <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Michael Kelley <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Yang Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

kexec: split crashkernel reservation code out from crash_core.c

Patch series "Split crash out from kexec and clean up related config
items", v3.

Motivation:
=============
Previously, LKP reported a building error. When investigating, it can't
be resolved reasonablly with the present messy kdump config items.

https://lore.kernel.org/oe-kbuild-all/202312182200 [email protected]/

The kdump (crash dumping) related config items could causes confusions:

Firstly,

CRASH_CORE enables codes including
- crashkernel reservation;
- elfcorehdr updating;
- vmcoreinfo exporting;
- crash hotplug handling;

Now fadump of powerpc, kcore dynamic debugging and kdump all selects
CRASH_CORE, while fadump
- fadump needs crashkernel parsing, vmcoreinfo exporting, and accessing
   global variable 'elfcorehdr_addr';
- kcore only needs vmcoreinfo exporting;
- kdump needs all of the current kernel/crash_core.c.

So only enabling PROC_CORE or FA_DUMP will enable CRASH_CORE, this
mislead people that we enable crash dumping, actual it's not.

Secondly,

It's not reasonable to allow KEXEC_CORE select CRASH_CORE.

Because KEXEC_CORE enables codes which allocate control pages, copy
kexec/kdump segments, and prepare for switching. These codes are
shared by both kexec reboot and kdump. We could want kexec reboot,
but disable kdump. In that case, CRASH_CORE should not be selected.

--------------------
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_KEXEC=y
CONFIG_KEXEC_FILE=y
---------------------

Thirdly,

It's not reasonable to allow CRASH_DUMP select KEXEC_CORE.

That could make KEXEC_CORE, CRASH_DUMP are enabled independently from
KEXEC or KEXEC_FILE. However, w/o KEXEC or KEXEC_FILE, the KEXEC_CORE
code built in doesn't make any sense because no kernel loading or
switching will happen to utilize the KEXEC_CORE code.
---------------------
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_CRASH_DUMP=y
---------------------

In this case, what is worse, on arch sh and arm, KEXEC relies on MMU,
while CRASH_DUMP can still be enabled when !MMU, then compiling error is
seen as the lkp test robot reported in above link.

------arch/sh/Kconfig------
config ARCH_SUPPORTS_KEXEC
         def_bool MMU

config ARCH_SUPPORTS_CRASH_DUMP
         def_bool BROKEN_ON_SMP
---------------------------

Changes:
===========
1, split out crash_reserve.c from crash_core.c;
2, split out vmcore_infoc. from crash_core.c;
3, move crash related codes in kexec_core.c into crash_core.c;
4, remove dependency of FA_DUMP on CRASH_DUMP;
5, clean up kdump related config items;
6, wrap up crash codes in crash related ifdefs on all 8 arch-es
   which support crash dumping, except of ppc;

Achievement:
===========
With above changes, I can rearrange the config item logic as below (the right
item depends on or is selected by the left item):

    PROC_KCORE -----------> VMCORE_INFO

               |----------> VMCORE_INFO
    FA_DUMP----|
               |----------> CRASH_RESERVE

                                                    ---->VMCORE_INFO
                                                   /
                                                   |---->CRASH_RESERVE
    KEXEC      --|                                /|
                 |--> KEXEC_CORE--> CRASH_DUMP-->/-|---->PROC_VMCORE
    KEXEC_FILE --|                               \ |
                                                   \---->CRASH_HOTPLUG

    KEXEC      --|
                 |--> KEXEC_CORE (for kexec reboot only)
    KEXEC_FILE --|

Test
========
On all 8 architectures, including x86_64, arm64, s390x, sh, arm, mips,
riscv, loongarch, I did below three cases of config item setting and
building all passed. Take configs on x86_64 as exampmle here:

(1) Both CONFIG_KEXEC and KEXEC_FILE is unset, then all kexec/kdump
items are unset automatically:
# Kexec and crash features
# CONFIG_KEXEC is not set
# CONFIG_KEXEC_FILE is not set
# end of Kexec and crash features

(2) set CONFIG_KEXEC_FILE and 'make olddefconfig':
---------------
# Kexec and crash features
CONFIG_CRASH_RESERVE=y
CONFIG_VMCORE_INFO=y
CONFIG_KEXEC_CORE=y
CONFIG_KEXEC_FILE=y
CONFIG_CRASH_DUMP=y
CONFIG_CRASH_HOTPLUG=y
CONFIG_CRASH_MAX_MEMORY_RANGES=8192
# end of Kexec and crash features
---------------

(3) unset CONFIG_CRASH_DUMP in case 2 and execute 'make olddefconfig':
------------------------
# Kexec and crash features
CONFIG_KEXEC_CORE=y
CONFIG_KEXEC_FILE=y
# end of Kexec and crash features
------------------------

Note:
For ppc, it needs investigation to make clear how to split out crash
code in arch folder. Hope Hari and Pingfan can help have a look, see if
it's doable. Now, I make it either have both kexec and crash enabled, or
disable both of them altogether.

This patch (of 14):

Both kdump and fa_dump of ppc rely on crashkernel reservation.  Move the
relevant codes into separate files: crash_reserve.c,
include/linux/crash_reserve.h.

And also add config item CRASH_RESERVE to control its enabling of the
codes.  And update config items which has relationship with crashkernel
reservation.

And also change ifdeffery from CONFIG_CRASH_CORE to CONFIG_CRASH_RESERVE
when those scopes are only crashkernel reservation related.

And also rename arch/XXX/include/asm/{crash_core.h => crash_reserve.h} on
arm64, x86 and risc-v because those architectures' crash_core.h is only
related to crashkernel reservation.

[[email protected]: s/CRASH_RESEERVE/CRASH_RESERVE/, per Klara Modin]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Acked-by: Hari Bathini <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Klara Modin <[email protected]>
Cc: Michael Kelley <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Yang Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmalloc: refactor vmalloc_dump_obj() function

This patch tends to simplify the function in question, by removing an
extra stack "objp" variable, returning back to an early exit approach if
spin_trylock() fails or VA was not found.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmalloc: improve description of vmap node layer

This patch adds extra explanation of recently added vmap node layer based
on community feedback. No functional change.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmalloc: add a shrinker to drain vmap pools

The added shrinker is used to return back current cached VAs into a global
vmap space, when a system enters into a low memory mode.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Cc: Kazuhito Hagio <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmalloc: set nr_nodes based on CPUs in a system

A number of nodes which are used in the alloc/free paths is set based on
num_possible_cpus() in a system.  Please note a high limit threshold
though is fixed and corresponds to 128 nodes.

For 32-bit or single core systems an access to a global vmap heap is not
balanced.  Such small systems do not suffer from lock contentions due to
low number of CPUs.  In such case the nr_nodes is equal to 1.

Test on AMD Ryzen Threadripper 3970X 32-Core Processor: sudo
./test_vmalloc.sh run_test_mask=7 nr_threads=64

<default perf>
94.41%     0.89%  [kernel]        [k] _raw_spin_lock
93.35%    93.07%  [kernel]        [k] native_queued_spin_lock_slowpath
76.13%     0.28%  [kernel]        [k] __vmalloc_node_range
72.96%     0.81%  [kernel]        [k] alloc_vmap_area
56.94%     0.00%  [kernel]        [k] __get_vm_area_node
41.95%     0.00%  [kernel]        [k] vmalloc
37.15%     0.01%  [test_vmalloc]  [k] full_fit_alloc_test
35.17%     0.00%  [kernel]        [k] ret_from_fork_asm
35.17%     0.00%  [kernel]        [k] ret_from_fork
35.17%     0.00%  [kernel]        [k] kthread
35.08%     0.00%  [test_vmalloc]  [k] test_func
34.45%     0.00%  [test_vmalloc]  [k] fix_size_alloc_test
28.09%     0.01%  [test_vmalloc]  [k] long_busy_list_alloc_test
23.53%     0.25%  [kernel]        [k] vfree.part.0
21.72%     0.00%  [kernel]        [k] remove_vm_area
20.08%     0.21%  [kernel]        [k] find_unlink_vmap_area
  2.34%     0.61%  [kernel]        [k] free_vmap_area_noflush
<default perf>
   vs
<patch-series perf>
82.32%     0.22%  [test_vmalloc]  [k] long_busy_list_alloc_test
63.36%     0.02%  [kernel]        [k] vmalloc
63.34%     2.64%  [kernel]        [k] __vmalloc_node_range
30.42%     4.46%  [kernel]        [k] vfree.part.0
28.98%     2.51%  [kernel]        [k] __alloc_pages_bulk
27.28%     0.19%  [kernel]        [k] __get_vm_area_node
26.13%     1.50%  [kernel]        [k] alloc_vmap_area
21.72%    21.67%  [kernel]        [k] clear_page_rep
19.51%     2.43%  [kernel]        [k] _raw_spin_lock
16.61%    16.51%  [kernel]        [k] native_queued_spin_lock_slowpath
13.40%     2.07%  [kernel]        [k] free_unref_page
10.62%     0.01%  [kernel]        [k] remove_vm_area
  9.02%     8.73%  [kernel]        [k] insert_vmap_area
  8.94%     0.00%  [kernel]        [k] ret_from_fork_asm
  8.94%     0.00%  [kernel]        [k] ret_from_fork
  8.94%     0.00%  [kernel]        [k] kthread
  8.29%     0.00%  [test_vmalloc]  [k] test_func
  7.81%     0.05%  [test_vmalloc]  [k] full_fit_alloc_test
  5.30%     4.73%  [kernel]        [k] purge_vmap_node
  4.47%     2.65%  [kernel]        [k] free_vmap_area_noflush
<patch-series perf>

confirms that a native_queued_spin_lock_slowpath goes down to
16.51% percent from 93.07%.

The throughput is ~12x higher:

urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
Run the test with following parameters: run_test_mask=7 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real    10m51.271s
user    0m0.013s
sys     0m0.187s
urezki@pc638:~$

urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
Run the test with following parameters: run_test_mask=7 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real    0m51.301s
user    0m0.015s
sys     0m0.040s
urezki@pc638:~$

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Kazuhito Hagio <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmalloc: support multiple nodes in vmallocinfo

Allocated areas are spread among nodes, it implies that the scanning has
to be performed individually of each node in order to dump all existing
VAs.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Kazuhito Hagio <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmalloc: support multiple nodes in vread_iter

Extend the vread_iter() to be able to perform a sequential reading of VAs
which are spread among multiple nodes. So a data read over the /dev/kmem
correctly reflects a vmalloc memory layout.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Kazuhito Hagio <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmalloc: add a scan area of VA only once

Invoke a kmemleak_scan_area() function only for newly allocated objects to
add a scan area within that object. There is no reason to add a same scan
area(pointer to beginning or inside the object) several times. If a VA is
obtained from the cache its scan area has already been associated.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 7db166b4aa0d ("mm: vmalloc: offload free_vmap_area_lock lock")
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Kazuhito Hagio <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmalloc: offload free_vmap_area_lock lock

Concurrent access to a global vmap space is a bottle-neck.  We can
simulate a high contention by running a vmalloc test suite.

To address it, introduce an effective vmap node logic.  Each node behaves
as independent entity.  When a node is accessed it serves a request
directly(if possible) from its pool.

This model has a size based pool for requests, i.e.  pools are serialized
and populated based on object size and real demand.  A maximum object size
that pool can handle is set to 256 pages.

This technique reduces a pressure on the global vmap lock.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Kazuhito Hagio <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmalloc: remove global purge_vmap_area_root rb-tree

Similar to busy VA, lazily-freed area is stored to a node it belongs to.
Such approach does not require any global locking primitive, instead an
access becomes scalable what mitigates a contention.

This patch removes a global purge-lock, global purge-tree and global purge
list.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Kazuhito Hagio <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/vmalloc: remove vmap_area_list

Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile get
the base address of vmalloc area. Now, vmap_area_list is empty, so export
VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list.

[[email protected]: fix a warning in the crash_save_vmcoreinfo_init()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Acked-by: Lorenzo Stoakes <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Kazuhito Hagio <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmalloc: remove global vmap_area_root rb-tree

Store allocated objects in a separate nodes.  A va->va_start address is
converted into a correct node where it should be placed and resided.  An
addr_to_node() function is used to do a proper address conversion to
determine a node that contains a VA.

Such approach balances VAs across nodes as a result an access becomes
scalable.  Number of nodes in a system depends on number of CPUs.

Please note:

1. As of now allocated VAs are bound to a node-0. It means the
   patch does not give any difference comparing with a current
   behavior;

2. The global vmap_area_lock, vmap_area_root are removed as there
   is no need in it anymore. The vmap_area_list is still kept and
   is _empty_. It is exported for a kexec only;

3. The vmallocinfo and vread() have to be reworked to be able to
   handle multiple nodes.

[[email protected]: mark vmap_init_free_space() with __init tag]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix a wrong value passed to __find_vmap_area()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Kazuhito Hagio <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmalloc: move vmap_init_free_space() down in vmalloc.c

A vmap_init_free_space() is a function that setups a vmap space and is
considered as part of initialization phase. Since a main entry which is
vmalloc_init(), has been moved down in vmalloc.c it makes sense to follow
the pattern.

There is no a functional change as a result of this patch.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Kazuhito Hagio <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmalloc: rename adjust_va_to_fit_type() function

This patch renames the adjust_va_to_fit_type() function to va_clip() which
is shorter and more expressive.

There is no a functional change as a result of this patch.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Kazuhito Hagio <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmalloc: add va_alloc() helper

Patch series "Mitigate a vmap lock contention", v3.

1. Motivation

- Offload global vmap locks making it scaled to number of CPUS;

- If possible and there is an agreement, we can remove the "Per cpu kva
  allocator" to make the vmap code to be more simple;

- There were complaints from XFS folk that a vmalloc might be contented
  on their workloads.

2. Design(high level overview)

We introduce an effective vmap node logic.  A node behaves as independent
entity to serve an allocation request directly(if possible) from its pool.
That way it bypasses a global vmap space that is protected by its own
lock.

An access to pools are serialized by CPUs.  Number of nodes are equal to
number of CPUs in a system.  Please note the high threshold is bound to
128 nodes.

Pools are size segregated and populated based on system demand.  The
maximum alloc request that can be stored into a segregated storage is 256
pages.  The lazily drain path decays a pool by 25% as a first step and as
second populates it by fresh freed VAs for reuse instead of returning them
into a global space.

When a VA is obtained(alloc path), it is stored in separate nodes.  A
va->va_start address is converted into a correct node where it should be
placed and resided.  Doing so we balance VAs across the nodes as a result
an access becomes scalable.  The addr_to_node() function does a proper
address conversion to a correct node.

A vmap space is divided on segments with fixed size, it is 16 pages.  That
way any address can be associated with a segment number.  Number of
segments are equal to num_possible_cpus() but not grater then 128.  The
numeration starts from 0.  See below how it is converted:

static inline unsigned int
addr_to_node_id(unsigned long addr)
{
return (addr / zone_size) % nr_nodes;
}

On a free path, a VA can be easily found by converting its "va_start"
address to a certain node it resides.  It is moved from "busy" data to
"lazy" data structure.  Later on, as noted earlier, the lazy kworker
decays each node pool and populates it by fresh incoming VAs.  Please
note, a VA is returned to a node that did an alloc request.

3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor

sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64

<default perf>
94.41%     0.89%  [kernel]        [k] _raw_spin_lock
93.35%    93.07%  [kernel]        [k] native_queued_spin_lock_slowpath
76.13%     0.28%  [kernel]        [k] __vmalloc_node_range
72.96%     0.81%  [kernel]        [k] alloc_vmap_area
56.94%     0.00%  [kernel]        [k] __get_vm_area_node
41.95%     0.00%  [kernel]        [k] vmalloc
37.15%     0.01%  [test_vmalloc]  [k] full_fit_alloc_test
35.17%     0.00%  [kernel]        [k] ret_from_fork_asm
35.17%     0.00%  [kernel]        [k] ret_from_fork
35.17%     0.00%  [kernel]        [k] kthread
35.08%     0.00%  [test_vmalloc]  [k] test_func
34.45%     0.00%  [test_vmalloc]  [k] fix_size_alloc_test
28.09%     0.01%  [test_vmalloc]  [k] long_busy_list_alloc_test
23.53%     0.25%  [kernel]        [k] vfree.part.0
21.72%     0.00%  [kernel]        [k] remove_vm_area
20.08%     0.21%  [kernel]        [k] find_unlink_vmap_area
  2.34%     0.61%  [kernel]        [k] free_vmap_area_noflush
<default perf>
   vs
<patch-series perf>
82.32%     0.22%  [test_vmalloc]  [k] long_busy_list_alloc_test
63.36%     0.02%  [kernel]        [k] vmalloc
63.34%     2.64%  [kernel]        [k] __vmalloc_node_range
30.42%     4.46%  [kernel]        [k] vfree.part.0
28.98%     2.51%  [kernel]        [k] __alloc_pages_bulk
27.28%     0.19%  [kernel]        [k] __get_vm_area_node
26.13%     1.50%  [kernel]        [k] alloc_vmap_area
21.72%    21.67%  [kernel]        [k] clear_page_rep
19.51%     2.43%  [kernel]        [k] _raw_spin_lock
16.61%    16.51%  [kernel]        [k] native_queued_spin_lock_slowpath
13.40%     2.07%  [kernel]        [k] free_unref_page
10.62%     0.01%  [kernel]        [k] remove_vm_area
  9.02%     8.73%  [kernel]        [k] insert_vmap_area
  8.94%     0.00%  [kernel]        [k] ret_from_fork_asm
  8.94%     0.00%  [kernel]        [k] ret_from_fork
  8.94%     0.00%  [kernel]        [k] kthread
  8.29%     0.00%  [test_vmalloc]  [k] test_func
  7.81%     0.05%  [test_vmalloc]  [k] full_fit_alloc_test
  5.30%     4.73%  [kernel]        [k] purge_vmap_node
  4.47%     2.65%  [kernel]        [k] free_vmap_area_noflush
<patch-series perf>

confirms that a native_queued_spin_lock_slowpath goes down to
16.51% percent from 93.07%.

The throughput is ~12x higher:

urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
Run the test with following parameters: run_test_mask=7 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real    10m51.271s
user    0m0.013s
sys     0m0.187s
urezki@pc638:~$

urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
Run the test with following parameters: run_test_mask=7 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real    0m51.301s
user    0m0.015s
sys     0m0.040s
urezki@pc638:~$

This patch (of 11):

Currently __alloc_vmap_area() function contains an open codded logic that
finds and adjusts a VA based on allocation request.

Introduce a va_alloc() helper that adjusts found VA only.  There is no a
functional change as a result of this patch.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Kazuhito Hagio <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm,page_owner: update Documentation regarding page_owner_stacks

Update page_owner documentation including the new page_owner_stacks
feature to show how it can be used.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Acked-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm,page_owner: filter out stacks by a threshold

We want to be able to filter out the stacks based on a threshold we can
can tune. By writing to 'count_threshold' file, we can adjust the
threshold value.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Acked-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm,page_owner: display all stacks and their count

This patch adds a new directory called 'page_owner_stacks' under
/sys/kernel/debug/, with a file called 'show_stacks' in it.  Reading from
that file will show all stacks that were added by page_owner followed by
their counting, giving us a clear overview of stack <-> count
relationship.

E.g:

  prep_new_page+0xa9/0x120
  get_page_from_freelist+0x801/0x2210
  __alloc_pages+0x18b/0x350
  alloc_pages_mpol+0x91/0x1f0
  folio_alloc+0x14/0x50
  filemap_alloc_folio+0xb2/0x100
  __filemap_get_folio+0x14a/0x490
  ext4_write_begin+0xbd/0x4b0 [ext4]
  generic_perform_write+0xc1/0x1e0
  ext4_buffered_write_iter+0x68/0xe0 [ext4]
  ext4_file_write_iter+0x70/0x740 [ext4]
  vfs_write+0x33d/0x420
  ksys_write+0xa5/0xe0
  do_syscall_64+0x80/0x160
  entry_SYSCALL_64_after_hwframe+0x6e/0x76
stack_count: 4578

The seq stack_{start,next} functions will iterate through the list
stack_list in order to print all stacks.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Acked-by: Marco Elver <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Acked-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm,page_owner: implement the tracking of the stacks count

Implement {inc,dec}_stack_record_count() which increments or decrements on
respective allocation and free operations, via __reset_page_owner() (free
operation) and __set_page_owner() (alloc operation).

Newly allocated stack_record structs will be added to the list stack_list
via add_stack_record_to_list(). Modifications on the list are protected
via a spinlock with irqs disabled, since this code can also be reached
from IRQ context.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Acked-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm,page_owner: maintain own list of stack_records structs

page_owner needs to increment a stack_record refcount when a new
allocation occurs, and decrement it on a free operation. In order to do
that, we need to have a way to get a stack_record from a handle.
Implement __stack_depot_get_stack_record() which just does that, and make
it public so page_owner can use it.

Also, traversing all stackdepot buckets comes with its own complexity,
plus we would have to implement a way to mark only those stack_records
that were originated from page_owner, as those are the ones we are
interested in. For that reason, page_owner maintains its own list of
stack_records, because traversing that list is faster than traversing all
buckets while keeping at the same time a low complexity.

For now, add to stack_list only the stack_records of dummy_handle and
failure_handle, and set their refcount of 1.

Further patches will add code to increment or decrement stack_records
count on allocation and free operation.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Acked-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

lib/stackdepot: move stack_record struct definition into the header

In order to move the heavy lifting into page_owner code, this one needs to
have access to the stack_record structure, which right now sits in
lib/stackdepot.c. Move it to the stackdepot.h header so page_owner can
access stack_record's struct fields.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Acked-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

lib/stackdepot: fix first entry having a 0-handle

Patch series "page_owner: print stacks and their outstanding allocations",
v10.

page_owner is a great debug functionality tool that lets us know about all
pages that have been allocated/freed and their specific stacktrace.  This
comes very handy when debugging memory leaks, since with some scripting we
can see the outstanding allocations, which might point to a memory leak.

In my experience, that is one of the most useful cases, but it can get
really tedious to screen through all pages and try to reconstruct the
stack <-> allocated/freed relationship, becoming most of the time a
daunting and slow process when we have tons of allocation/free operations.

This patchset aims to ease that by adding a new functionality into
page_owner.  This functionality creates a new directory called
'page_owner_stacks' under 'sys/kernel//debug' with a read-only file called
'show_stacks', which prints out all the stacks followed by their
outstanding number of allocations (being that the times the stacktrace has
allocated but not freed yet).  This gives us a clear and a quick overview
of stacks <-> allocated/free.

We take advantage of the new refcount_f field that stack_record struct
gained, and increment/decrement the stack refcount on every
__set_page_owner() (alloc operation) and __reset_page_owner (free
operation) call.

Unfortunately, we cannot use the new stackdepot api STACK_DEPOT_FLAG_GET
because it does not fulfill page_owner needs, meaning we would have to
special case things, at which point makes more sense for page_owner to do
its own {dec,inc}rementing of the stacks.  E.g: Using
STACK_DEPOT_FLAG_PUT, once the refcount reaches 0, such stack gets
evicted, so page_owner would lose information.

This patchset also creates a new file called 'set_threshold' within
'page_owner_stacks' directory, and by writing a value to it, the stacks
which refcount is below such value will be filtered out.

A PoC can be found below:

# cat /sys/kernel/debug/page_owner_stacks/show_stacks > page_owner_full_stacks.txt
# head -40 page_owner_full_stacks.txt
  prep_new_page+0xa9/0x120
  get_page_from_freelist+0x801/0x2210
  __alloc_pages+0x18b/0x350
  alloc_pages_mpol+0x91/0x1f0
  folio_alloc+0x14/0x50
  filemap_alloc_folio+0xb2/0x100
  page_cache_ra_unbounded+0x96/0x180
  filemap_get_pages+0xfd/0x590
  filemap_read+0xcc/0x330
  blkdev_read_iter+0xb8/0x150
  vfs_read+0x285/0x320
  ksys_read+0xa5/0xe0
  do_syscall_64+0x80/0x160
  entry_SYSCALL_64_after_hwframe+0x6e/0x76
stack_count: 521

  prep_new_page+0xa9/0x120
  get_page_from_freelist+0x801/0x2210
  __alloc_pages+0x18b/0x350
  alloc_pages_mpol+0x91/0x1f0
  folio_alloc+0x14/0x50
  filemap_alloc_folio+0xb2/0x100
  __filemap_get_folio+0x14a/0x490
  ext4_write_begin+0xbd/0x4b0 [ext4]
  generic_perform_write+0xc1/0x1e0
  ext4_buffered_write_iter+0x68/0xe0 [ext4]
  ext4_file_write_iter+0x70/0x740 [ext4]
  vfs_write+0x33d/0x420
  ksys_write+0xa5/0xe0
  do_syscall_64+0x80/0x160
  entry_SYSCALL_64_after_hwframe+0x6e/0x76
stack_count: 4609
...
...

# echo 5000 > /sys/kernel/debug/page_owner_stacks/set_threshold
# cat /sys/kernel/debug/page_owner_stacks/show_stacks > page_owner_full_stacks_5000.txt
# head -40 page_owner_full_stacks_5000.txt
  prep_new_page+0xa9/0x120
  get_page_from_freelist+0x801/0x2210
  __alloc_pages+0x18b/0x350
  alloc_pages_mpol+0x91/0x1f0
  folio_alloc+0x14/0x50
  filemap_alloc_folio+0xb2/0x100
  __filemap_get_folio+0x14a/0x490
  ext4_write_begin+0xbd/0x4b0 [ext4]
  generic_perform_write+0xc1/0x1e0
  ext4_buffered_write_iter+0x68/0xe0 [ext4]
  ext4_file_write_iter+0x70/0x740 [ext4]
  vfs_write+0x33d/0x420
  ksys_pwrite64+0x75/0x90
  do_syscall_64+0x80/0x160
  entry_SYSCALL_64_after_hwframe+0x6e/0x76
stack_count: 6781

  prep_new_page+0xa9/0x120
  get_page_from_freelist+0x801/0x2210
  __alloc_pages+0x18b/0x350
  pcpu_populate_chunk+0xec/0x350
  pcpu_balance_workfn+0x2d1/0x4a0
  process_scheduled_works+0x84/0x380
  worker_thread+0x12a/0x2a0
  kthread+0xe3/0x110
  ret_from_fork+0x30/0x50
  ret_from_fork_asm+0x1b/0x30
stack_count: 8641

This patch (of 7):

The very first entry of stack_record gets a handle of 0, but this is wrong
because stackdepot treats a 0-handle as a non-valid one.  E.g: See the
check in stack_depot_fetch()

Fix this by adding and offset of 1.

This bug has been lurking since the very beginning of stackdepot, but no
one really cared as it seems.  Because of that I am not adding a Fixes
tag.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Co-developed-by: Marco Elver <[email protected]>
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Oscar Salvador <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

merge mm-hotfixes-stable into mm-nonmm-stable to pick up stackdepot changes

mm/debug_vm_pgtable: fix BUG_ON with pud advanced test

Architectures like powerpc add debug checks to ensure we find only devmap
PUD pte entries. These debug checks are only done with CONFIG_DEBUG_VM.
This patch marks the ptes used for PUD advanced test devmap pte entries so
that we don't hit on debug checks on architecture like ppc64 as below.

WARNING: CPU: 2 PID: 1 at arch/powerpc/mm/book3s64/radix_pgtable.c:1382 radix__pud_hugepage_update+0x38/0x138
....
NIP [c0000000000a7004] radix__pud_hugepage_update+0x38/0x138
LR [c0000000000a77a8] radix__pudp_huge_get_and_clear+0x28/0x60
Call Trace:
[c000000004a2f950] [c000000004a2f9a0] 0xc000000004a2f9a0 (unreliable)
[c000000004a2f980] [000d34c100000000] 0xd34c100000000
[c000000004a2f9a0] [c00000000206ba98] pud_advanced_tests+0x118/0x334
[c000000004a2fa40] [c00000000206db34] debug_vm_pgtable+0xcbc/0x1c48
[c000000004a2fc10] [c00000000000fd28] do_one_initcall+0x60/0x388

Also

kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:202!
....

NIP [c000000000096510] pudp_huge_get_and_clear_full+0x98/0x174
LR [c00000000206bb34] pud_advanced_tests+0x1b4/0x334
Call Trace:
[c000000004a2f950] [000d34c100000000] 0xd34c100000000 (unreliable)
[c000000004a2f9a0] [c00000000206bb34] pud_advanced_tests+0x1b4/0x334
[c000000004a2fa40] [c00000000206db34] debug_vm_pgtable+0xcbc/0x1c48
[c000000004a2fc10] [c00000000000fd28] do_one_initcall+0x60/0x388

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 27af67f35631 ("powerpc/book3s64/mm: enable transparent pud hugepage")
Signed-off-by: Aneesh Kumar K.V (IBM) <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: cachestat: fix folio read-after-free in cache walk

In cachestat, we access the folio from the page cache's xarray to compute
its page offset, and check for its dirty and writeback flags. However, we
do not hold a reference to the folio before performing these actions,
which means the folio can concurrently be released and reused as another
folio/page/slab.

Get around this altogether by just using xarray's existing machinery for
the folio page offsets and dirty/writeback states.

This changes behavior for tmpfs files to now always report zeroes in their
dirty and writeback counters. This is okay as tmpfs doesn't follow
conventional writeback cache behavior: its pages get "cleaned" during
swapout, after which they're no longer resident etc.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: cf264e1329fb ("cachestat: implement cachestat syscall")
Reported-by: Jann Horn <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
Signed-off-by: Nhat Pham <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Tested-by: Jann Horn <[email protected]>
Cc: <[email protected]> [6.4+]
Signed-off-by: Andrew Morton <[email protected]>

MAINTAINERS: add memory mapping entry with reviewers

Recently there have been a number of patches which have affected various
aspects of the memory mapping logic as implemented in mm/mmap.c where it
would have been useful for regular contributors to have been notified.

Add an entry for this part of mm in particular with regular contributors
tagged as reviewers.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lorenzo Stoakes <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/vmscan: fix a bug calling wakeup_kswapd() with a wrong zone index

With numa balancing on, when a numa system is running where a numa node
doesn't have its local memory so it has no managed zones, the following
oops has been observed.  It's because wakeup_kswapd() is called with a
wrong zone index, -1.  Fixed it by checking the index before calling
wakeup_kswapd().

> BUG: unable to handle page fault for address: 00000000000033f3
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>    rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> Code: (omitted)
> RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
>  <TASK>
> ? __die
> ? page_fault_oops
> ? __pte_offset_map_lock
> ? exc_page_fault
> ? asm_exc_page_fault
> ? wakeup_kswapd
> migrate_misplaced_page
> __handle_mm_fault
> handle_mm_fault
> do_user_addr_fault
> exc_page_fault
> asm_exc_page_fault
> RIP: 0033:0x55b897ba0808
> Code: (omitted)
> RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
> RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
> RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
> RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
> R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
>  </TASK>

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Byungchul Park <[email protected]>
Reported-by: Hyeongtak Ji <[email protected]>
Fixes: c574bbe917036 ("NUMA balancing: optimize page placement for memory tiering system")
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

kasan: revert eviction of stack traces in generic mode

This partially reverts commits cc478e0b6bdf, 63b85ac56a64, 08d7c94d9635,
a414d4286f34, and 773688a6cb24 to make use of variable-sized stack depot
records, since eviction of stack entries from stack depot forces fixed-
sized stack records.  Care was taken to retain the code cleanups by the
above commits.

Eviction was added to generic KASAN as a response to alleviating the
additional memory usage from fixed-sized stack records, but this still
uses more memory than previously.

With the re-introduction of variable-sized records for stack depot, we can
just switch back to non-evictable stack records again, and return back to
the previous performance and memory usage baseline.

Before (observed after a KASAN kernel boot):

  pools: 597
  refcounted_allocations: 17547
  refcounted_frees: 6477
  refcounted_in_use: 11070
  freelist_size: 3497
  persistent_count: 12163
  persistent_bytes: 1717008

After:

  pools: 319
  refcounted_allocations: 0
  refcounted_frees: 0
  refcounted_in_use: 0
  freelist_size: 0
  persistent_count: 29397
  persistent_bytes: 5183536

As can be seen from the counters, with a generic KASAN config, refcounted
allocations and evictions are no longer used.  Due to using variable-sized
records, I observe a reduction of 278 stack depot pools (saving 4448 KiB)
with my test setup.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: cc478e0b6bdf ("kasan: avoid resetting aux_lock")
Fixes: 63b85ac56a64 ("kasan: stop leaking stack trace handles")
Fixes: 08d7c94d9635 ("kasan: memset free track in qlink_free")
Fixes: a414d4286f34 ("kasan: handle concurrent kasan_record_aux_stack calls")
Fixes: 773688a6cb24 ("kasan: use stack_depot_put for Generic mode")
Signed-off-by: Marco Elver <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Tested-by: Mikhail Gavrilov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

stackdepot: use variable size records for non-evictable entries

With the introduction of stack depot evictions, each stack record is now
fixed size, so that future reuse after an eviction can safely store
differently sized stack traces.  In all cases that do not make use of
evictions, this wastes lots of space.

Fix it by re-introducing variable size stack records (up to the max
allowed size) for entries that will never be evicted.  We know if an entry
will never be evicted if the flag STACK_DEPOT_FLAG_GET is not provided,
since a later stack_depot_put() attempt is undefined behavior.

With my current kernel config that enables KASAN and also SLUB owner
tracking, I observe (after a kernel boot) a whopping reduction of 296
stack depot pools, which translates into 4736 KiB saved.  The savings here
are from SLUB owner tracking only, because KASAN generic mode still uses
refcounting.

Before:

  pools: 893
  allocations: 29841
  frees: 6524
  in_use: 23317
  freelist_size: 3454

After:

  pools: 597
  refcounted_allocations: 17547
  refcounted_frees: 6477
  refcounted_in_use: 11070
  freelist_size: 3497
  persistent_count: 12163
  persistent_bytes: 1717008

[[email protected]: fix -Wstringop-overflow warning]
Link: https://lore.kernel.org/all/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CABXGCsOzpRPZGg23QqJAzKnqkZPKzvieeg=W7sgjgi3q0pBo0g@mail.gmail.com/
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CABXGCsOzpRPZGg23QqJAzKnqkZPKzvieeg=W7sgjgi3q0pBo0g@mail.gmail.com/
Fixes: 108be8def46e ("lib/stackdepot: allow users to evict stack traces")
Signed-off-by: Marco Elver <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Tested-by: Mikhail Gavrilov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/admin-guide/mm/damon/usage: fix wrong quotas diabling condition

After the introduction of DAMOS quotas, DAMOS quotas is not disabled if
both size and time quotas are zero but the quota goal is set. The new
rule is also applied to DAMON sysfs interface, but the usage doc is not
updated. Update it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/mm/damon: move monitoring target regions setup detail from the usage to the design document

Design doc is aimed to have all concept level details, while the usage doc
is focused on only how the features can be used. Some details about
monitoring target regions construction is on the usage doc. Move the
details about the monitoring target regions construction differences for
DAMON operations set from the usage to the design doc.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/mm/damon: move DAMON operation sets list from the usage to the design document

The list of DAMON operation sets and their explanation, which may better
to be on design document, is written on the usage document. Move the
detail to design document and make the usage document only reference the
design document.

[[email protected]: fix a typo on a reference link]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/mm/damon: move the list of DAMOS actions to design doc

DAMOS operation actions are explained nearly twice on the DAMON usage
document, once for the sysfs interface, and then again for the debugfs
interface. Duplication is bad. Also it would better to keep this kind of
concept level details in design document and keep the usage document small
and focus on only the usage. Move the list to design document and update
usage document to reference it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Docs/mm/damon/maintainer-profile: fix reference links for mm-[un]stable tree

Patch series "Docs/mm/damon: misc readability improvements".

Fix trivial mistakes and improve layout of information on different
documents for DAMON.

This patch (of 5):

A couple of sentences on maintainer-profile.rst are having reference links
for mm-unstable and mm-stable trees with wrong rst markup. Fix those.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

userfaultfd: use per-vma locks in userfaultfd operations

All userfaultfd operations, except write-protect, opportunistically use
per-vma locks to lock vmas. On failure, attempt again inside mmap_lock
critical section.

Write-protect operation requires mmap_lock as it iterates over multiple
vmas.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lokesh Gidra <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Nicolas Geoffray <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tim Murray <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: add vma_assert_locked() for !CONFIG_PER_VMA_LOCK

vma_assert_locked() is needed to replace mmap_assert_locked() once we
start using per-vma locks in userfaultfd operations.

In !CONFIG_PER_VMA_LOCK case when mm is locked, it implies that the given
VMA is locked.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lokesh Gidra <[email protected]>
Reviewed-by: Suren Baghdasaryan <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Nicolas Geoffray <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Tim Murray <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx

Increments and loads to mmap_changing are always in mmap_lock critical
section. This ensures that if userspace requests event notification for
non-cooperative operations (e.g. mremap), userfaultfd operations don't
occur concurrently.

This can be achieved by using a separate read-write semaphore in
userfaultfd_ctx such that increments are done in write-mode and loads in
read-mode, thereby eliminating the dependency on mmap_lock for this
purpose.

This is a preparatory step before we replace mmap_lock usage with per-vma
locks in fill/move ioctls.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lokesh Gidra <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Nicolas Geoffray <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tim Murray <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

userfaultfd: move userfaultfd_ctx struct to header file

Patch series "per-vma locks in userfaultfd", v7.

Performing userfaultfd operations (like copy/move etc.) in critical
section of mmap_lock (read-mode) causes significant contention on the lock
when operations requiring the lock in write-mode are taking place
concurrently.  We can use per-vma locks instead to significantly reduce
the contention issue.

Android runtime's Garbage Collector uses userfaultfd for concurrent
compaction.  mmap-lock contention during compaction potentially causes
jittery experience for the user.  During one such reproducible scenario,
we observed the following improvements with this patch-set:

- Wall clock time of compaction phase came down from ~3s to <500ms
- Uninterruptible sleep time (across all threads in the process) was
  ~10ms (none in mmap_lock) during compaction, instead of >20s

This patch (of 4):

Move the struct to userfaultfd_k.h to be accessible from mm/userfaultfd.c.
There are no other changes in the struct.

This is required to prepare for using per-vma locks in userfaultfd
operations.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lokesh Gidra <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Nicolas Geoffray <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tim Murray <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

kasan: increase the number of bits to shift when recording extra timestamps

In 5d4c6ac94694 ("kasan: record and report more information") I thought
that printk only displays a maximum of 99999 seconds, but actually printk
can display a larger number of seconds.

So increase the number of bits to shift when recording the extra timestamp
(44 bits), without affecting the precision, shift it right by 9 bits,
discarding all bits that do not affect the microsecond part (nanoseconds
will not be shown).

Currently the maximum time that can be displayed is 9007199.254740s,
because

11111111111111111111111111111111111111111111 (44 bits) << 9
= 11111111111111111111111111111111111111111111000000000
= 9007199.254740

Link: https://lkml.kernel.org/r/AM6PR03MB58481629F2F28CE007412139994D2@AM6PR03MB5848.eurprd03.prod.outlook.com
Fixes: 5d4c6ac94694 ("kasan: record and report more information")
Signed-off-by: Juntong Deng <[email protected]>
Acked-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

rmap: replace two calls to compound_order with folio_order

Removes two unnecessary conversions from folio to page. Should be no
difference in behaviour.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

dax: fix incorrect list of data cache aliasing architectures

commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
prevents DAX from building on architectures with virtually aliased
dcache with:

  depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g. recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually
aliased data cache.

This is a regression introduced in the v4.0 Linux kernel where the
dax mount option is removed for 32-bit ARMv7 boards which have no data
cache aliasing, and therefore should work fine with FS_DAX.

This was turned into the following check in alloc_dax() by a preparatory
change:

        if (ops && (IS_ENABLED(CONFIG_ARM) ||
            IS_ENABLED(CONFIG_MIPS) ||
            IS_ENABLED(CONFIG_SPARC)))
                return NULL;

Use cpu_dcache_is_aliasing() instead to figure out whether the environment
has aliasing data caches.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
Signed-off-by: Mathieu Desnoyers <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Michael Sclafani <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Introduce cpu_dcache_is_aliasing() across all architectures

Introduce a generic way to query whether the data cache is virtually
aliased on all architectures. Its purpose is to ensure that subsystems
which are incompatible with virtually aliased data caches (e.g. FS_DAX)
can reliably query this.

For data cache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The data cache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The data cache aliasing is statically known or depends on querying CPU
state at runtime:

* arm (cache_is_vivt() || cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The data cache is never aliasing:

* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and
implement "cpu_dcache_is_aliasing()".

Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus
cpu_dcache_is_aliasing() simply evaluates to "false".

Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future
work. This would be useful to gate features like XIP on architectures
which have aliasing CPU dcache-icache but not CPU dcache-dcache.

Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache"
to clarify that we really mean "CPU data cache" and "CPU cache" to
eliminate any possible confusion with VFS "dentry cache" and "page
cache".

Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Michael Sclafani <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

dax: check for data cache aliasing at runtime

Replace the following fs/Kconfig:FS_DAX dependency:

depends on !(ARM || MIPS || SPARC)

By a runtime check within alloc_dax(). This runtime check returns
ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means
the kernel is using an aliased mapping) on an architecture which
has data cache aliasing.

Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
CONFIG_DAX=n for consistency.

This is done in preparation for using cpu_dcache_is_aliasing() in a
following change which will properly support architectures which detect
data cache aliasing at runtime.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
Signed-off-by: Mathieu Desnoyers <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Michael Sclafani <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

virtio: treat alloc_dax() -EOPNOTSUPP failure as non-fatal

In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of virtio
virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as
non-fatal.

Link: https://lkml.kernel.org/r/[email protected]
Co-developed-by: Dan Williams <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Michael Sclafani <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

dcssblk: handle alloc_dax() -EOPNOTSUPP failure

In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dcssblk
dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures.

Considering that s390 is not a data cache aliasing architecture,
and considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP
from alloc_dax() should make dcssblk_add_store() fail.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
Signed-off-by: Mathieu Desnoyers <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Acked-by: Heiko Carstens <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Michael Sclafani <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

dm: treat alloc_dax() -EOPNOTSUPP failure as non-fatal

In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dm alloc_dev()
to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
Suggested-by: Dan Williams <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Michael Sclafani <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of nvdimm/pmem
pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

[ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ]

Link: https://lkml.kernel.org/r/[email protected]
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
Signed-off-by: Mathieu Desnoyers <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Michael Sclafani <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n

Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
CONFIG_DAX=n to be consistent with the fact that CONFIG_DAX=y
never returns NULL.

This is done in preparation for using cpu_dcache_is_aliasing() in a
following change which will properly support architectures which detect
data cache aliasing at runtime.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 4e4ced93794a ("dax: Move mandatory ->zero_page_range() check in alloc_dax()")
Signed-off-by: Mathieu Desnoyers <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Michael Sclafani <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

dax: add empty static inline for CONFIG_DAX=n

Patch series "Introduce cpu_dcache_is_aliasing() to fix DAX regression",
v6.

This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased data caches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches")

Even though it used to work fine before.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliasing data caches (VIVT and VIPT with aliasing data
cache). It touches the pages through their linear mapping, which is not
consistent with the userspace mappings with virtually aliasing data
caches.

This patch series introduces cpu_dcache_is_aliasing() with the new
Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all
architectures. The implementation of cpu_dcache_is_aliasing() is either
evaluated to a constant at compile-time or a runtime check, which is
what is needed on ARM.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

This patch (of 9):

When building a kernel with CONFIG_DAX=n, all uses of set_dax_nocache()
and set_dax_nomc() need to be either within regions of code or compile
units which are explicitly not compiled, or they need to rely on compiler
optimizations to eliminate calls to those undefined symbols.

It appears that at least the openrisc and loongarch architectures don't
end up eliminating those undefined symbols even if they are provably
within code which is eliminated due to conditional branches depending on
constants.

Implement empty static inline functions for set_dax_nocache() and
set_dax_nomc() in CONFIG_DAX=n to ensure those undefined references are
removed.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Fixes: 7ac5360cd4d0 ("dax: remove the copy_from_iter and copy_to_iter methods")
Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Michael Sclafani <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

nvdimm/pmem: fix leak on dax_add_host() failure

Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
before setting pmem->dax_dev, which therefore issues the two following
calls on NULL pointers:

out_cleanup_dax:
kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mathieu Desnoyers <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Reviewed-by: Fan Ni <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64/mm: automatically fold contpte mappings

There are situations where a change to a single PTE could cause the
contpte block in which it resides to become foldable (i.e.  could be
repainted with the contiguous bit).  Such situations arise, for example,
when user space temporarily changes protections, via mprotect, for
individual pages, such can be the case for certain garbage collectors.

We would like to detect when such a PTE change occurs.  However this can
be expensive due to the amount of checking required.  Therefore only
perform the checks when an indiviual PTE is modified via mprotect
(ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
when we are setting the final PTE in a contpte-aligned block.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64/mm: __always_inline to improve fork() perf

As set_ptes() and wrprotect_ptes() become a bit more complex, the compiler
may choose not to inline them.  But this is critical for fork()
performance.  So mark the functions, along with contpte_try_unfold() which
is called by them, as __always_inline.  This is worth ~1% on the fork()
microbenchmark with order-0 folios (the common case).

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64/mm: implement pte_batch_hint()

When core code iterates over a range of ptes and calls ptep_get() for each
of them, if the range happens to cover contpte mappings, the number of pte
reads becomes amplified by a factor of the number of PTEs in a contpte
block.  This is because for each call to ptep_get(), the implementation
must read all of the ptes in the contpte block to which it belongs to
gather the access and dirty bits.

This causes a hotspot for fork(), as well as operations that unmap memory
such as munmap(), exit and madvise(MADV_DONTNEED).  Fortunately we can fix
this by implementing pte_batch_hint() which allows their iterators to skip
getting the contpte tail ptes when gathering the batch of ptes to operate
on.  This results in the number of PTE reads returning to 1 per pte.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Tested-by: John Hubbard <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: add pte_batch_hint() to reduce scanning in folio_pte_batch()

Some architectures (e.g. arm64) can tell from looking at a pte, if some
follow-on ptes also map contiguous physical memory with the same pgprot.
(for arm64, these are contpte mappings).

Take advantage of this knowledge to optimize folio_pte_batch() so that it
can skip these ptes when scanning to create a batch. By default, if an
arch does not opt-in, folio_pte_batch() returns a compile-time 1, so the
changes are optimized out and the behaviour is as before.

arm64 will opt-in to providing this hint in the next patch, which will
greatly reduce the cost of ptep_get() when scanning a range of contptes.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Tested-by: John Hubbard <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64/mm: implement new [get_and_]clear_full_ptes() batch APIs

Optimize the contpte implementation to fix some of the
exit/munmap/dontneed performance regression introduced by the initial
contpte commit.  Subsequent patches will solve it entirely.

During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
cleared.  Previously this was done 1 PTE at a time.  But the core-mm
supports batched clear via the new [get_and_]clear_full_ptes() APIs.  So
let's implement those APIs and for fully covered contpte mappings, we no
longer need to unfold the contpte.  This significantly reduces unfolding
operations, reducing the number of tlbis that must be issued.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Tested-by: John Hubbard <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64/mm: implement new wrprotect_ptes() batch API

Optimize the contpte implementation to fix some of the fork performance
regression introduced by the initial contpte commit.  Subsequent patches
will solve it entirely.

During fork(), any private memory in the parent must be write-protected.
Previously this was done 1 PTE at a time.  But the core-mm supports
batched wrprotect via the new wrprotect_ptes() API.  So let's implement
that API and for fully covered contpte mappings, we no longer need to
unfold the contpte.  This has 2 benefits:

  - reduced unfolding, reduces the number of tlbis that must be issued.
  - The memory remains contpte-mapped ("folded") in the parent, so it
    continues to benefit from the more efficient use of the TLB after
    the fork.

The optimization to wrprotect a whole contpte block without unfolding is
possible thanks to the tightening of the Arm ARM in respect to the
definition and behaviour when 'Misprogramming the Contiguous bit'.  See
section D21194 at https://developer.arm.com/documentation/102105/ja-07/

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Tested-by: John Hubbard <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64/mm: wire up PTE_CONT for user mappings

With the ptep API sufficiently refactored, we can now introduce a new
"contpte" API layer, which transparently manages the PTE_CONT bit for user
mappings.

In this initial implementation, only suitable batches of PTEs, set via
set_ptes(), are mapped with the PTE_CONT bit.  Any subsequent modification
of individual PTEs will cause an "unfold" operation to repaint the contpte
block as individual PTEs before performing the requested operation.
While, a modification of a single PTE could cause the block of PTEs to
which it belongs to become eligible for "folding" into a contpte entry,
"folding" is not performed in this initial implementation due to the costs
of checking the requirements are met.  Due to this, contpte mappings will
degrade back to normal pte mappings over time if/when protections are
changed.  This will be solved in a future patch.

Since a contpte block only has a single access and dirty bit, the semantic
here changes slightly; when getting a pte (e.g.  ptep_get()) that is part
of a contpte mapping, the access and dirty information are pulled from the
block (so all ptes in the block return the same access/dirty info).  When
changing the access/dirty info on a pte (e.g.  ptep_set_access_flags())
that is part of a contpte mapping, this change will affect the whole
contpte block.  This is works fine in practice since we guarantee that
only a single folio is mapped by a contpte block, and the core-mm tracks
access/dirty information per folio.

In order for the public functions, which used to be pure inline, to
continue to be callable by modules, export all the contpte_* symbols that
are now called by those public inline functions.

The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
at build time.  It defaults to enabled as long as its dependency,
TRANSPARENT_HUGEPAGE is also enabled.  The core-mm depends upon
TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
enabled, then there is no chance of meeting the physical contiguity
requirement for contpte mappings.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Acked-by: Ard Biesheuvel <[email protected]>
Tested-by: John Hubbard <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64/mm: dplit __flush_tlb_range() to elide trailing DSB

Split __flush_tlb_range() into __flush_tlb_range_nosync() +
__flush_tlb_range(), in the same way as the existing flush_tlb_page()
arrangement.  This allows calling __flush_tlb_range_nosync() to elide the
trailing DSB.  Forthcoming "contpte" code will take advantage of this when
clearing the young bit from a contiguous range of ptes.

Ordering between dsb and mmu_notifier_arch_invalidate_secondary_tlbs() has
changed, but now aligns with the ordering of __flush_tlb_page().  It has
been discussed that __flush_tlb_page() may be wrong though.  Regardless,
both will be resolved separately if needed.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Tested-by: John Hubbard <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64/mm: new ptep layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs.  For now, The
existing API is prefixed with double underscore to become the arch-private
API and the public API is just a simple wrapper that calls the private
API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate.  But since there are
already some contig-aware users (e.g.  hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

The following APIs are treated this way:

- ptep_get
- set_pte
- set_ptes
- pte_clear
- ptep_get_and_clear
- ptep_test_and_clear_young
- ptep_clear_flush_young
- ptep_set_wrprotect
- ptep_set_access_flags

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Tested-by: John Hubbard <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64/mm: convert ptep_clear() to ptep_get_and_clear()

ptep_clear() is a generic wrapper around the arch-implemented
ptep_get_and_clear(). We are about to convert ptep_get_and_clear() into a
public version and private version (__ptep_get_and_clear()) to support the
transparent contpte work. We won't have a private version of ptep_clear()
so let's convert it to directly call ptep_get_and_clear().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Tested-by: John Hubbard <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64/mm: convert set_pte_at() to set_ptes(..., 1)

Since set_ptes() was introduced, set_pte_at() has been implemented as a
generic macro around set_ptes(..., 1).  So this change should continue to
generate the same code.  However, making this change prepares us for the
transparent contpte support.  It means we can reroute set_ptes() to
__set_ptes().  Since set_pte_at() is a generic macro, there will be no
equivalent __set_pte_at() to reroute to.

Note that a couple of calls to set_pte_at() remain in the arch code.  This
is intentional, since those call sites are acting on behalf of core-mm and
should continue to call into the public set_ptes() rather than the
arch-private __set_ptes().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Tested-by: John Hubbard <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64/mm: convert READ_ONCE(*ptep) to ptep_get(ptep)

There are a number of places in the arch code that read a pte by using the
READ_ONCE() macro.  Refactor these call sites to instead use the
ptep_get() helper, which itself is a READ_ONCE().  Generated code should
be the same.

This will benefit us when we shortly introduce the transparent contpte
support.  In this case, ptep_get() will become more complex so we now have
all the code abstracted through it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Tested-by: John Hubbard <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: tidy up pte_next_pfn() definition

Now that the all architecture overrides of pte_next_pfn() have been
replaced with pte_advance_pfn(), we can simplify the definition of the
generic pte_next_pfn() macro so that it is unconditionally defined.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

x86/mm: convert pte_next_pfn() to pte_advance_pfn()

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
override the new pte_advance_pfn() API to do so.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arm64/mm: convert pte_next_pfn() to pte_advance_pfn()

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
override the new pte_advance_pfn() API to do so.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: introduce pte_advance_pfn() and use for pte_next_pfn()

The goal is to be able to advance a PTE by an arbitrary number of PFNs.
So introduce a new API that takes a nr param. Define the default
implementation here and allow for architectures to override.
pte_next_pfn() becomes a wrapper around pte_advance_pfn().

Follow up commits will convert each overriding architecture's
pte_next_pfn() to pte_advance_pfn().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: thp: batch-collapse PMD with set_ptes()

Refactor __split_huge_pmd_locked() so that a present PMD can be collapsed
to PTEs in a single batch using set_ptes().

This should improve performance a little bit, but the real motivation is
to remove the need for the arm64 backend to have to fold the contpte
entries. Instead, since the ptes are set as a batch, the contpte blocks
can be initially set up pre-folded (once the arm64 contpte support is
added in the next few patches). This leads to noticeable performance
improvement during split.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: clarify the spec for set_ptes()

Patch series "Transparent Contiguous PTEs for User Mappings", v6.

This is a series to opportunistically and transparently use contpte
mappings (set the contiguous bit in ptes) for user memory when those
mappings meet the requirements.  The change benefits arm64, but there is
some (very) minor refactoring for x86 to enable its integration with
core-mm.

It is part of a wider effort to improve performance by allocating and
mapping variable-sized blocks of memory (folios).  One aim is for the 4K
kernel to approach the performance of the 16K kernel, but without breaking
compatibility and without the associated increase in memory.  Another aim
is to benefit the 16K and 64K kernels by enabling 2M THP, since this is
the contpte size for those kernels.  We have good performance data that
demonstrates both aims are being met (see below).

Of course this is only one half of the change.  We require the mapped
physical memory to be the correct size and alignment for this to actually
be useful (i.e.  64K for 4K pages, or 2M for 16K/64K pages).  Fortunately
folios are solving this problem for us.  Filesystems that support it (XFS,
AFS, EROFS, tmpfs, ...) will allocate large folios up to the PMD size
today, and more filesystems are coming.  And for anonymous memory,
"multi-size THP" is now upstream.

Patch Layout
============

In this version, I've split the patches to better show each optimization:

  - 1-2:    mm prep: misc code and docs cleanups
  - 3-6:    mm,arm64,x86 prep: Add pte_advance_pfn() and make pte_next_pfn() a
            generic wrapper around it
  - 7-11:   arm64 prep: Refactor ptep helpers into new layer
  - 12:     functional contpte implementation
  - 23-18:  various optimizations on top of the contpte implementation

Testing
=======

I've tested this series on both Ampere Altra (bare metal) and Apple M2 (VM):
  - mm selftests (inc new tests written for multi-size THP); no regressions
  - Speedometer Java script benchmark in Chromium web browser; no issues
  - Kernel compilation; no issues
  - Various tests under high memory pressure with swap enabled; no issues

Performance
===========

High Level Use Cases
~~~~~~~~~~~~~~~~~~~~

First some high level use cases (kernel compilation and speedometer JavaScript
benchmarks). These are running on Ampere Altra (I've seen similar improvements
on Android/Pixel 6).

baseline:                  mm-unstable (mTHP switched off)
mTHP:                      + enable 16K, 32K, 64K mTHP sizes "always"
mTHP + contpte:            + this series
mTHP + contpte + exefolio: + patch at [6], which series supports

Kernel Compilation with -j8 (negative is faster):

| kernel                    | real-time | kern-time | user-time |
|---------------------------|-----------|-----------|-----------|
| baseline                  |      0.0% |      0.0% |      0.0% |
| mTHP                      |     -5.0% |    -39.1% |     -0.7% |
| mTHP + contpte            |     -6.0% |    -41.4% |     -1.5% |
| mTHP + contpte + exefolio |     -7.8% |    -43.1% |     -3.4% |

Kernel Compilation with -j80 (negative is faster):

| kernel                    | real-time | kern-time | user-time |
|---------------------------|-----------|-----------|-----------|
| baseline                  |      0.0% |      0.0% |      0.0% |
| mTHP                      |     -5.0% |    -36.6% |     -0.6% |
| mTHP + contpte            |     -6.1% |    -38.2% |     -1.6% |
| mTHP + contpte + exefolio |     -7.4% |    -39.2% |     -3.2% |

Speedometer (positive is faster):

| kernel                    | runs_per_min |
|:--------------------------|--------------|
| baseline                  |         0.0% |
| mTHP                      |         1.5% |
| mTHP + contpte            |         3.2% |
| mTHP + contpte + exefolio |         4.5% |

Micro Benchmarks
~~~~~~~~~~~~~~~~

The following microbenchmarks are intended to demonstrate the performance of
fork() and munmap() do not regress. I'm showing results for order-0 (4K)
mappings, and for order-9 (2M) PTE-mapped THP. Thanks to David for sharing his
benchmarks.

baseline:                  mm-unstable + batch zap [7] series
contpte-basic:             + patches 0-19; functional contpte implementation
contpte-batch:             + patches 20-23; implement new batched APIs
contpte-inline:            + patch 24; __always_inline to help compiler
contpte-fold:              + patch 25; fold contpte mapping when sensible

Primary platform is Ampere Altra bare metal. I'm also showing results for M2 VM
(on top of MacOS) for reference, although experience suggests this might not be
the most reliable for performance numbers of this sort:

| FORK           |         order-0        |         order-9        |
| Ampere Altra   |------------------------|------------------------|
| (pte-map)      |       mean |     stdev |       mean |     stdev |
|----------------|------------|-----------|------------|-----------|
| baseline       |       0.0% |      2.7% |       0.0% |      0.2% |
| contpte-basic  |       6.3% |      1.4% |    1948.7% |      0.2% |
| contpte-batch  |       7.6% |      2.0% |      -1.9% |      0.4% |
| contpte-inline |       3.6% |      1.5% |      -1.0% |      0.2% |
| contpte-fold   |       4.6% |      2.1% |      -1.8% |      0.2% |

| MUNMAP         |         order-0        |         order-9        |
| Ampere Altra   |------------------------|------------------------|
| (pte-map)      |       mean |     stdev |       mean |     stdev |
|----------------|------------|-----------|------------|-----------|
| baseline       |       0.0% |      0.5% |       0.0% |      0.3% |
| contpte-basic  |       1.8% |      0.3% |    1104.8% |      0.1% |
| contpte-batch  |      -0.3% |      0.4% |       2.7% |      0.1% |
| contpte-inline |      -0.1% |      0.6% |       0.9% |      0.1% |
| contpte-fold   |       0.1% |      0.6% |       0.8% |      0.1% |

| FORK           |         order-0        |         order-9        |
| Apple M2 VM    |------------------------|------------------------|
| (pte-map)      |       mean |     stdev |       mean |     stdev |
|----------------|------------|-----------|------------|-----------|
| baseline       |       0.0% |      1.4% |       0.0% |      0.8% |
| contpte-basic  |       6.8% |      1.2% |     469.4% |      1.4% |
| contpte-batch  |      -7.7% |      2.0% |      -8.9% |      0.7% |
| contpte-inline |      -6.0% |      2.1% |      -6.0% |      2.0% |
| contpte-fold   |       5.9% |      1.4% |      -6.4% |      1.4% |

| MUNMAP         |         order-0        |         order-9        |
| Apple M2 VM    |------------------------|------------------------|
| (pte-map)      |       mean |     stdev |       mean |     stdev |
|----------------|------------|-----------|------------|-----------|
| baseline       |       0.0% |      0.6% |       0.0% |      0.4% |
| contpte-basic  |       1.6% |      0.6% |     233.6% |      0.7% |
| contpte-batch  |       1.9% |      0.3% |      -3.9% |      0.4% |
| contpte-inline |       2.2% |      0.8% |      -1.6% |      0.9% |
| contpte-fold   |       1.5% |      0.7% |      -1.7% |      0.7% |

Misc
~~~~

John Hubbard at Nvidia has indicated dramatic 10x performance improvements
for some workloads at [8], when using 64K base page kernel.

[1] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299 [email protected]/
[2] https://lore.kernel.org/linux-arm-kernel/20231115163018.1303287 [email protected]/
[3] https://lore.kernel.org/linux-arm-kernel/20231204105440 [email protected]/
[4] https://lore.kernel.org/lkml/20231218105100 [email protected]/
[5] https://lore.kernel.org/linux-mm/633af0a7-0823-424f-b6ef-374d99483f05@arm.com/
[6] https://lore.kernel.org/lkml/08c16f7d-f3b3-4f22-9acc-da943f647dc3@arm.com/
[7] https://lore.kernel.org/linux-mm/20240214204435 [email protected]/
[8] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520be85@nvidia.com/
[9] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/contpte-lkml_v6

This patch (of 18):

set_ptes() spec implies that it can only be used to set a present pte
because it interprets the PFN field to increment it.  However,
set_pte_at() has been implemented on top of set_ptes() since set_ptes()
was introduced, and set_pte_at() allows setting a pte to a not-present
state.  So clarify the spec to state that when nr==1, new state of pte may
be present or not present.  When nr>1, new state of all ptes must be
present.

While we are at it, tighten the spec to set requirements around the
initial state of ptes; when nr==1 it may be either present or not-present.
But when nr>1 all ptes must initially be not-present.  All set_ptes()
callsites already conform to this requirement.  Stating it explicitly is
useful because it allows for a simplification to the upcoming arm64
contpte implementation.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memory: optimize unmap/zap with PTE-mapped THP

Similar to how we optimized fork(), let's implement PTE batching when
consecutive (present) PTEs map consecutive pages of the same large folio.

Most infrastructure we need for batching (mmu gather, rmap) is already
there.  We only have to add get_and_clear_full_ptes() and
clear_full_ptes().  Similarly, extend zap_install_uffd_wp_if_needed() to
process a PTE range.

We won't bother sanity-checking the mapcount of all subpages, but only
check the mapcount of the first subpage we process.  If there is a real
problem hiding somewhere, we can trigger it simply by using small folios,
or when we zap single pages of a large folio.  Ideally, we had that check
in rmap code (including for delayed rmap), but then we cannot print the
PTE.  Let's keep it simple for now.  If we ever have a cheap
folio_mapcount(), we might just want to check for underflows there.

To keep small folios as fast as possible force inlining of a specialized
variant using __always_inline with nr=1.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yin Fengwei <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing

In tlb_batch_pages_flush(), we can end up freeing up to 512 pages or now
up to 256 folio fragments that span more than one page, before we
conditionally reschedule.

It's a pain that we have to handle cond_resched() in
tlb_batch_pages_flush() manually and cannot simply handle it in
release_pages() -- release_pages() can be called from atomic context.
Well, in a perfect world we wouldn't have to make our code more
complicated at all.

With page poisoning and init_on_free, we might now run into soft lockups
when we free a lot of rather large folio fragments, because page freeing
time then depends on the actual memory size we are freeing instead of on
the number of folios that are involved.

In the absolute (unlikely) worst case, on arm64 with 64k we will be able
to free up to 256 folio fragments that each span 512 MiB: zeroing out 128
GiB does sound like it might take a while. But instead of ignoring this
unlikely case, let's just handle it.

So, let's teach tlb_batch_pages_flush() that there are some configurations
where page freeing is horribly slow, and let's reschedule more frequently
-- similarly like we did for now before we had large folio fragments in
there. Avoid yet another loop over all encoded pages in the common case
by handling that separately.

Note that with page poisoning/zeroing, we might now end up freeing only a
single folio fragment at a time that might exceed the old 512 pages limit:
but if we cannot even free a single MAX_ORDER page on a system without
running into soft lockups, something else is already completely bogus.
Freeing a PMD-mapped THP would similarly cause trouble.

In theory, we might even free 511 order-0 pages + a single MAX_ORDER page,
effectively having to zero out 8703 pages on arm64 with 64k, translating
to ~544 MiB of memory: however, if 512 MiB doesn't result in soft lockups,
544 MiB is unlikely to result in soft lockups, so we won't care about that
for the time being.

In the future, we might want to detect if handling cond_resched() is
required at all, and just not do any of that with full preemption enabled.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yin Fengwei <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mmu_gather: add __tlb_remove_folio_pages()

Add __tlb_remove_folio_pages(), which will remove multiple consecutive
pages that belong to the same large folio, instead of only a single page.
We'll be using this function when optimizing unmapping/zapping of large
folios that are mapped by PTEs.

We're using the remaining spare bit in an encoded_page to indicate that
the next enoced page in an array contains actually shifted "nr_pages".
Teach swap/freeing code about putting multiple folio references, and
delayed rmap handling to remove page ranges of a folio.

This extension allows for still gathering almost as many small folios as
we used to (-1, because we have to prepare for a possibly bigger next
entry), but still allows for gathering consecutive pages that belong to
the same large folio.

Note that we don't pass the folio pointer, because it is not required for
now.  Further, we don't support page_size != PAGE_SIZE, it won't be
required for simple PTE batching.

We have to provide a separate s390 implementation, but it's fairly
straight forward.

Another, more invasive and likely more expensive, approach would be to use
folio+range or a PFN range instead of page+nr_pages.  But, we should do
that consistently for the whole mmu_gather.  For now, let's keep it simple
and add "nr_pages" only.

Note that it is now possible to gather significantly more pages: In the
past, we were able to gather ~10000 pages, now we can also gather ~5000
folio fragments that span multiple pages.  A folio fragment on x86-64 can
span up to 512 pages (2 MiB THP) and on arm64 with 64k in theory 8192
pages (512 MiB THP).  Gathering more memory is not considered something we
should worry about, especially because these are already corner cases.

While we can gather more total memory, we won't free more folio fragments.
As long as page freeing time primarily only depends on the number of
involved folios, there is no effective change for !preempt configurations.
However, we'll adjust tlb_batch_pages_flush() separately to handle corner
cases where page freeing time grows proportionally with the actual memory
size.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yin Fengwei <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mmu_gather: add tlb_remove_tlb_entries()

Let's add a helper that lets us batch-process multiple consecutive PTEs.

Note that the loop will get optimized out on all architectures except on
powerpc. We have to add an early define of __tlb_remove_tlb_entry() on
ppc to make the compiler happy (and avoid making tlb_remove_tlb_entries()
a macro).

[[email protected]: change __tlb_remove_tlb_entry() to an inline function]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yin Fengwei <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP

Nowadays, encoded pages are only used in mmu_gather handling.  Let's
update the documentation, and define ENCODED_PAGE_BIT_DELAY_RMAP.  While
at it, rename ENCODE_PAGE_BITS to ENCODED_PAGE_BITS.

If encoded page pointers would ever be used in other context again, we'd
likely want to change the defines to reflect their context (e.g.,
ENCODED_PAGE_FLAG_MMU_GATHER_DELAY_RMAP).  For now, let's keep it simple.

This is a preparation for using the remaining spare bit to indicate that
the next item in an array of encoded pages is a "nr_pages" argument and
not an encoded page.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yin Fengwei <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size()

We have two bits available in the encoded page pointer to store additional
information. Currently, we use one bit to request delay of the rmap
removal until after a TLB flush.

We want to make use of the remaining bit internally for batching of
multiple pages of the same folio, specifying that the next encoded page
pointer in an array is actually "nr_pages". So pass page + delay_rmap
flag instead of an encoded page, to handle the encoding internally.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yin Fengwei <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memory: factor out zapping folio pte into zap_present_folio_pte()

Let's prepare for further changes by factoring it out into a separate
function.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yin Fengwei <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memory: further separate anon and pagecache folio handling in zap_present_pte()

We don't need up-to-date accessed-dirty information for anon folios and
can simply work with the ptent we already have. Also, we know the RSS
counter we want to update.

We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
zap_install_uffd_wp_if_needed() after updating the folio and RSS.

While at it, only call zap_install_uffd_wp_if_needed() if there is even
any chance that pte_install_uffd_wp_if_needed() would do *something*.
That is, just don't bother if uffd-wp does not apply.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yin Fengwei <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>