Git Repo - linux.git/log

mm: change inlined allocation helpers to account at the call site

Main goal of memory allocation profiling patchset is to provide accounting
that is cheap enough to run in production.  To achieve that we inject
counters using codetags at the allocation call sites to account every time
allocation is made.  This injection allows us to perform accounting
efficiently because injected counters are immediately available as opposed
to the alternative methods, such as using _RET_IP_, which would require
counter lookup and appropriate locking that makes accounting much more
expensive.  This method requires all allocation functions to inject
separate counters at their call sites so that their callers can be
individually accounted.  Counter injection is implemented by allocation
hooks which should wrap all allocation functions.

Inlined functions which perform allocations but do not use allocation
hooks are directly charged for the allocations they perform.  In most
cases these functions are just specialized allocation wrappers used from
multiple places to allocate objects of a specific type.  It would be more
useful to do the accounting at their call sites instead.  Instrument these
helpers to do accounting at the call site.  Simple inlined allocation
wrappers are converted directly into macros.  More complex allocators or
allocators with documentation are converted into _noprof versions and
allocation hooks are added.  This allows memory allocation profiling
mechanism to charge allocations to the callers of these functions.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Acked-by: Jan Kara <[email protected]> [jbd2]
Cc: Anna Schumaker <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Tissoires <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Jakub Sitnicki <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

memprofiling: documentation

Provide documentation for memory allocation profiling.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

MAINTAINERS: add entries for code tagging and memory allocation profiling

The new code & libraries added are being maintained - mark them as such.

[[email protected]: MAINTAINERS: improve entries]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Lukas Bulwahn <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations

If slabobj_ext vector allocation for a slab object fails and later on it
succeeds for another object in the same slab, the slabobj_ext for the
original object will be NULL and will be flagged in case when
CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled.

Mark failed slabobj_ext vector allocations using a new objext_flags flag
stored in the lower bits of slab->obj_exts. When new allocation succeeds
it marks all tag references in the same slabobj_ext vector as empty to
avoid warnings implemented by CONFIG_MEM_ALLOC_PROFILING_DEBUG checks.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

codetag: debug: mark codetags for reserved pages as empty

To avoid debug warnings while freeing reserved pages which were not
allocated with usual allocators, mark their codetags as empty before
freeing.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

codetag: debug: skip objext checking when it's for objext itself

objext objects are created with __GFP_NO_OBJ_EXT flag and therefore have
no corresponding objext themselves (otherwise we would get an infinite
recursion). When freeing these objects their codetag will be empty and
when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled this will lead to false
warnings. Introduce CODETAG_EMPTY special codetag value to mark
allocations which intentionally lack codetag to avoid these warnings.
Set objext codetags to CODETAG_EMPTY before freeing to indicate that
the codetag is expected to be empty.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

lib: add memory allocations report in show_mem()

Include allocations in show_mem reports.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

rhashtable: plumb through alloc tag

This gives better memory allocation profiling results; rhashtable
allocations will be accounted to the code that initialized the rhashtable.

[[email protected]: undo _noprof additions in the documentation]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: vmalloc: enable memory allocation profiling

This wrapps all external vmalloc allocation functions with the
alloc_hooks() wrapper, and switches internal allocations to _noprof
variants where appropriate, for the new memory allocation profiling
feature.

[[email protected]: arch/um: fix forward declaration for vmalloc]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: undo _noprof additions in the documentation]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: percpu: enable per-cpu allocation tagging

Redefine __alloc_percpu, __alloc_percpu_gfp and __alloc_reserved_percpu
to record allocations and deallocations done by these functions.

[[email protected]: undo _noprof additions in the documentation]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: percpu: add codetag reference into pcpuobj_ext

To store codetag for every per-cpu allocation, a codetag reference is
embedded into pcpuobj_ext when CONFIG_MEM_ALLOC_PROFILING=y. Hooks to use
the newly introduced codetag are added.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: percpu: introduce pcpuobj_ext

Upcoming alloc tagging patches require a place to stash per-allocation
metadata.

We already do this when memcg is enabled, so this patch generalizes the
obj_cgroup * vector in struct pcpu_chunk by creating a pcpu_obj_ext type,
which we will be adding to in an upcoming patch - similarly to the
previous slabobj_ext patch.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: [email protected]
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mempool: hook up to memory allocation profiling

This adds hooks to mempools for correctly annotating mempool-backed
allocations at the correct source line, so they show up correctly in
/sys/kernel/debug/allocations.

Various inline functions are converted to wrappers so that we can invoke
alloc_hooks() in fewer places.

[[email protected]: undo _noprof additions in the documentation]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: add missing mempool_create_node documentation]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/slab: enable slab allocation tagging for kmalloc and friends

Redefine kmalloc, krealloc, kzalloc, kcalloc, etc. to record allocations
and deallocations done by these functions.

[[email protected]: undo _noprof additions in the documentation]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix kcalloc() kernel-doc warnings]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Randy Dunlap <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

rust: add a rust helper for krealloc()

Memory allocation profiling is turning krealloc() into a nontrivial macro
- so for now, we need a helper for it.

Until we have proper support on the rust side for memory allocation
profiling this does mean that all Rust allocations will be accounted to
the helper.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Alice Ryhl <[email protected]>
Acked-by: Miguel Ojeda <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/slab: add allocation accounting into slab allocation and free paths

Account slab allocations using codetag reference embedded into slabobj_ext.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

lib: add codetag reference into slabobj_ext

To store code tag for every slab object, a codetag reference is embedded
into slabobj_ext when CONFIG_MEM_ALLOC_PROFILING=y.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/page_ext: enable early_page_ext when CONFIG_MEM_ALLOC_PROFILING_DEBUG=y

For all page allocations to be tagged, page_ext has to be initialized
before the first page allocation.  Early tasks allocate their stacks using
page allocator before alloc_node_page_ext() initializes page_ext area,
unless early_page_ext is enabled.  Therefore these allocations will
generate a warning when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled.
Enable early_page_ext whenever CONFIG_MEM_ALLOC_PROFILING_DEBUG=y to
ensure page_ext initialization prior to any page allocation.  This will
have all the negative effects associated with early_page_ext, such as
possible longer boot time, therefore we enable it only when debugging with
CONFIG_MEM_ALLOC_PROFILING_DEBUG enabled and not universally for
CONFIG_MEM_ALLOC_PROFILING.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: fix non-compound multi-order memory accounting in __free_pages

When a non-compound multi-order page is freed, it is possible that a
speculative reference keeps the page pinned.  In this case we free all
pages except for the first page, which will be freed later by the last
put_page().  However the page passed to put_page() is indistinguishable
from an order-0 page, so it cannot do the accounting, just as it cannot
free the subsequent pages.  Do the accounting here, where we free the
pages.

Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Vlastimil Babka <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: create new codetag references during page splitting

When a high-order page is split into smaller ones, each newly split page
should get its codetag. After the split each split page will be
referencing the original codetag. The codetag's "bytes" counter remains
the same because the amount of allocated memory has not changed, however
the "calls" counter gets increased to keep the counter correct when these
individual pages get freed.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: enable page allocation tagging

Redefine page allocators to record allocation tags upon their invocation.
Instrument post_alloc_hook and free_pages_prepare to modify current
allocation tag.

[[email protected]: undo _noprof additions in the documentation]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

change alloc_pages name in dma_map_ops to avoid name conflicts

After redefining alloc_pages, all uses of that name are being replaced.
Change the conflicting names to prevent preprocessor from replacing them
when it's not intended.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: percpu: increase PERCPU_MODULE_RESERVE to accommodate allocation tags

As each allocation tag generates a per-cpu variable, more space is
required to store them. Increase PERCPU_MODULE_RESERVE to provide enough
area. A better long-term solution would be to allocate this memory
dynamically.

[[email protected]: increase PERCPU_MODULE_RESERVE to accommodate allocation tags]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

lib: introduce early boot parameter to avoid page_ext memory overhead

The highest memory overhead from memory allocation profiling comes from
page_ext objects.  This overhead exists even if the feature is disabled
but compiled-in.  To avoid it, introduce an early boot parameter that
prevents page_ext object creation.  The new boot parameter is a tri-state
with possible values of 0|1|never.  When it is set to "never" the memory
allocation profiling support is disabled, and overhead is minimized
(currently no page_ext objects are allocated, in the future more overhead
might be eliminated).  As a result we also lose ability to enable memory
allocation profiling at runtime (because there is no space to store
alloctag references).  Runtime sysctrl becomes read-only if the early boot
parameter was set to "never".  Note that the default value of this boot
parameter depends on the CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
configuration.  When CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=n the
boot parameter is set to "never", therefore eliminating any overhead.
CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y results in boot parameter
being set to 1 (enabled).  This allows distributions to avoid any overhead
by setting CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=n config and with
no changes to the kernel command line.

We reuse sysctl.vm.mem_profiling boot parameter name in order to avoid
introducing yet another control.  This change turns it into a tri-state
early boot parameter.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

lib: introduce support for page allocation tagging

Introduce helper functions to easily instrument page allocators by storing
a pointer to the allocation tag associated with the code that allocated
the page in a page_ext field.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

lib: add allocation tagging support for memory allocation profiling

Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily
instrument memory allocators. It registers an "alloc_tags" codetag type
with /proc/allocinfo interface to output allocation tag information when
the feature is enabled.

CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory
allocation profiling instrumentation.

Memory allocation profiling can be enabled or disabled at runtime using
/proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n.
CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation
profiling by default.

[[email protected]: Documentation/filesystems/proc.rst: fix allocinfo title]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: do limited memory accounting for modules with ARCH_NEEDS_WEAK_PER_CPU]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: explicitly include irqflags.h in alloc_tag.h]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix alloc_tag_init() to prevent passing NULL to PTR_ERR()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Klara Modin <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

lib: prevent module unloading if memory is not freed

Skip freeing module's data section if there are non-zero allocation tags
because otherwise, once these allocations are freed, the access to their
code tag would cause UAF.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

lib: code tagging module support

Add support for code tagging from dynamically loaded modules.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

lib: code tagging framework

Add basic infrastructure to support code tagging which stores tag common
information consisting of the module name, function, file name and line
number. Provide functions to register a new code tag type and navigate
between code tags.

Link: https://lkml.kernel.org/r/[email protected]
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

slab: objext: introduce objext_flags as extension to page_memcg_data_flags

Introduce objext_flags to store additional objext flags unrelated to memcg.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Pasha Tatashin <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation

Slab extension objects can't be allocated before slab infrastructure is
initialized.  Some caches, like kmem_cache and kmem_cache_node, are
created before slab infrastructure is initialized.  Objects from these
caches can't have extension objects.  Introduce SLAB_NO_OBJ_EXT slab flag
to mark these caches and avoid creating extensions for objects allocated
from these slabs.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Pasha Tatashin <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext creation

Introduce __GFP_NO_OBJ_EXT flag in order to prevent recursive allocations
when allocating slabobj_ext on a slab.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Pasha Tatashin <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: introduce slabobj_ext to support slab object extensions

Currently slab pages can store only vectors of obj_cgroup pointers in
page->memcg_data. Introduce slabobj_ext structure to allow more data to
be stored for each slab object. Wrap obj_cgroup into slabobj_ext to
support current functionality while allowing to extend slabobj_ext in the
future.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Pasha Tatashin <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

fs: convert alloc_inode_sb() to a macro

We're introducing alloc tagging, which tracks memory allocations by
callsite. Converting alloc_inode_sb() to a macro means allocations will
be tracked by its caller, which is a bit more useful.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Cc: Alexander Viro <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Pasha Tatashin <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

scripts/kallysms: always include __start and __stop symbols

These symbols are used to denote section boundaries: by always including
them we can unify loading sections from modules with loading built-in
sections, which leads to some significant cleanup.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Pasha Tatashin <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/slub: mark slab_free_freelist_hook() __always_inline

It seems we need to be more forceful with the compiler on this one. This
is done for performance reasons only.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Pasha Tatashin <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

asm-generic/io.h: kill vmalloc.h dependency

Needed to avoid a new circular dependency with the memory allocation
profiling series.

Naturally, a whole bunch of files needed to include vmalloc.h that were
previously getting it implicitly.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Pasha Tatashin <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

fix missing vmalloc.h includes

Patch series "Memory allocation profiling", v6.

Overview:
Low overhead [1] per-callsite memory allocation profiling. Not just for
debug kernels, overhead low enough to be deployed in production.

Example output:
  root@moria-kvm:~# sort -rn /proc/allocinfo
   127664128    31168 mm/page_ext.c:270 func:alloc_page_ext
    56373248     4737 mm/slub.c:2259 func:alloc_slab_page
    14880768     3633 mm/readahead.c:247 func:page_cache_ra_unbounded
    14417920     3520 mm/mm_init.c:2530 func:alloc_large_system_hash
    13377536      234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
    11718656     2861 mm/filemap.c:1919 func:__filemap_get_folio
     9192960     2800 kernel/fork.c:307 func:alloc_thread_stack_node
     4206592        4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
     4136960     1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
     3940352      962 mm/memory.c:4214 func:alloc_anon_folio
     2894464    22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
     ...

Usage:
kconfig options:
- CONFIG_MEM_ALLOC_PROFILING
- CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
- CONFIG_MEM_ALLOC_PROFILING_DEBUG
   adds warnings for allocations that weren't accounted because of a
   missing annotation

sysctl:
  /proc/sys/vm/mem_profiling

Runtime info:
  /proc/allocinfo

Notes:

[1]: Overhead
To measure the overhead we are comparing the following configurations:
(1) Baseline with CONFIG_MEMCG_KMEM=n
(2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
(3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
(4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
(5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT
(6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)  && CONFIG_MEMCG_KMEM=y
(7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y

Performance overhead:
To evaluate performance we implemented an in-kernel test executing
multiple get_free_page/free_page and kmalloc/kfree calls with allocation
sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
affinity set to a specific CPU to minimize the noise. Below are results
from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
56 core Intel Xeon:

                        kmalloc                 pgalloc
(1 baseline)            6.764s                  16.902s
(2 default disabled)    6.793s  (+0.43%)        17.007s (+0.62%)
(3 default enabled)     7.197s  (+6.40%)        23.666s (+40.02%)
(4 runtime enabled)     7.405s  (+9.48%)        23.901s (+41.41%)
(5 memcg)               13.388s (+97.94%)       48.460s (+186.71%)
(6 def disabled+memcg)  13.332s (+97.10%)       48.105s (+184.61%)
(7 def enabled+memcg)   13.446s (+98.78%)       54.963s (+225.18%)

Memory overhead:
Kernel size:

   text           data        bss         dec         diff
(1) 26515311       18890222    17018880    62424413
(2) 26524728       19423818    16740352    62688898    264485
(3) 26524724       19423818    16740352    62688894    264481
(4) 26524728       19423818    16740352    62688898    264485
(5) 26541782       18964374    16957440    62463596    39183

Memory consumption on a 56 core Intel CPU with 125GB of memory:
Code tags:           192 kB
PageExts:         262144 kB (256MB)
SlabExts:           9876 kB (9.6MB)
PcpuExts:            512 kB (0.5MB)

Total overhead is 0.2% of total memory.

Benchmarks:

Hackbench tests run 100 times:
hackbench -s 512 -l 200 -g 15 -f 25 -P
      baseline       disabled profiling           enabled profiling
avg   0.3543         0.3559 (+0.0016)             0.3566 (+0.0023)
stdev 0.0137         0.0188                       0.0077

hackbench -l 10000
      baseline       disabled profiling           enabled profiling
avg   6.4218         6.4306 (+0.0088)             6.5077 (+0.0859)
stdev 0.0933         0.0286                       0.0489

stress-ng tests:
stress-ng --class memory --seq 4 -t 60
stress-ng --class cpu --seq 4 -t 60
Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/

[2] https://lore.kernel.org/all/20240306182440.2003814 [email protected]/

This patch (of 37):

The next patch drops vmalloc.h from a system header in order to fix a
circular dependency; this adds it to all the files that were pulling it in
implicitly.

[[email protected]: fix arch/alpha/lib/memcpy.c]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix arch/x86/mm/numa_32.c]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: a few places were depending on sizes.h]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix mm/kasan/hw_tags.c]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix arc build]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: Pasha Tatashin <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Gaynor <[email protected]>
Cc: Alice Ryhl <[email protected]>
Cc: Andreas Hindborg <[email protected]>
Cc: Benno Lossin <[email protected]>
Cc: "Björn Roy Baron" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Gary Guo <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wedson Almeida Filho <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

scripts/kernel-doc: drop "_noprof" on function prototypes

Memory profiling introduces macros as hooks for function-level allocation
profiling[1].  Memory allocation functions that are profiled are named
like xyz_alloc() for API access to the function.  xyz_alloc() then calls
xyz_alloc_noprof() to do the allocation work.

The kernel-doc comments for the memory allocation functions are introduced
with the xyz_alloc() function names but the function implementations are
the xyz_alloc_noprof() names.  This causes kernel-doc warnings for
mismatched documentation and function prototype names.  By dropping the
"_noprof" part of the function name, the kernel-doc function name matches
the function prototype name, so the warnings are resolved.

[1] https://lore.kernel.org/all/20240321163705.3067592 [email protected]/

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Reported-by: Stephen Rothwell <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Tested-by: Suren Baghdasaryan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kent Overstreet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

percpu: clean up all mappings when pcpu_map_pages() fails

In pcpu_map_pages(), if __pcpu_map_pages() fails on a CPU, we call
__pcpu_unmap_pages() to clean up mappings on all CPUs where mappings were
created, but not on the CPU where __pcpu_map_pages() fails.

__pcpu_map_pages() and __pcpu_unmap_pages() are wrappers around
vmap_pages_range_noflush() and vunmap_range_noflush(). All other callers
of vmap_pages_range_noflush() call vunmap_range_noflush() when mapping
fails, except pcpu_map_pages(). The reason could be that partial mappings
may be left behind from a failed mapping attempt.

Call __pcpu_unmap_pages() for the failed CPU as well in pcpu_map_pages().

This was found by code inspection, no failures or bugs were observed.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Dennis Zhou <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/numa_balancing: allow migrate on protnone reference with MPOL_PREFERRED_MANY policy

commit bda420b98505 ("numa balancing: migrate on fault among multiple
bound nodes") added support for migrate on protnone reference with
MPOL_BIND memory policy.  This allowed numa fault migration when the
executing node is part of the policy mask for MPOL_BIND.  This patch
extends migration support to MPOL_PREFERRED_MANY policy.

Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag
MPOL_F_NUMA_BALANCING.  This causes issues when we want to use
NUMA_BALANCING_MEMORY_TIERING.  To effectively use the slow memory tier,
the kernel should not allocate pages from the slower memory tier via
allocation control zonelist fallback.  Instead, we should move cold pages
from the faster memory node via memory demotion.  For a page allocation,
kswapd is only woken up after we try to allocate pages from all nodes in
the allocation zone list.  This implies that, without using memory
policies, we will end up allocating hot pages in the slower memory tier.

MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add
MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better
allocation control when we have memory tiers in the system.  With
MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only
of faster memory nodes.  When we fail to allocate pages from the faster
memory node, kswapd would be woken up, allowing demotion of cold pages to
slower memory nodes.

With the current kernel, such usage of memory policies implies we can't do
page promotion from a slower memory tier to a faster memory tier using
numa fault.  This patch fixes this issue.

For MPOL_PREFERRED_MANY, if the executing node is in the policy node mask,
we allow numa migration to the executing nodes.  If the executing node is
not in the policy node mask, we do not allow numa migration.

Example:
On a 2-sockets system, NUMA node N0, N1 and N2 are in socket 0,
N3 in socket 1. N0, N1 and N3 have fast memory and CPU, while
N2 has slow memory and no CPU.  For a workload, we may use
MPOL_PREFERRED_MANY with nodemask N0 and N1 set because the workload
runs on CPUs of socket 0 at most times. Then, even if the workload
runs on CPUs of N3 occasionally, we will not try to migrate the workload
pages from N2 to N3 because users may want to avoid cross-socket access
as much as possible in the long term.

In below table, Process is the Process executing node and
Curr Loc Pgs is the numa node where page present(folio node)
===========================================================
Process  Policy  Curr Loc Pgs     Observation
-----------------------------------------------------------
N0       N0 N1      N1         Pages Migrated from N1 to N0
N0       N0 N1      N2         Pages Migrated from N2 to N0
N0       N0 N1      N3        Pages Migrated from N3 to N0

N3       N0 N1      N0         Pages NOT Migrated  to N3
N3       N0 N1      N1         Pages NOT Migrated  to N3
N3       N0 N1      N2        Pages NOT Migrated  to N3
------------------------------------------------------------

Link: https://lkml.kernel.org/r/158acc57319129aa46d50fd64c9330f3e7c7b4bf.1711373653.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/369d6a58758396335fd1176d97bbca4e7730d75a.1709909210.git.donettom@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V (IBM) <[email protected]>
Signed-off-by: Donet Tom <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/mempolicy: use numa_node_id() instead of cpu_to_node()

Patch series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY
policy:, v4.

This patchset is to optimize the cross-socket memory access with
MPOL_PREFERRED_MANY policy.

To test this patch we ran the following test on a 3 node system.
Node 0 - 2GB   - Tier 1
Node 1 - 11GB  - Tier 1
Node 6 - 10GB  - Tier 2

Below changes are made to memcached to set the memory policy,
It select Node0 and Node1 as preferred nodes.

   #include <numaif.h>
   #include <numa.h>

    unsigned long nodemask;
    int ret;

    nodemask = 0x03;
    ret = set_mempolicy(MPOL_PREFERRED_MANY | MPOL_F_NUMA_BALANCING,
                                               &nodemask, 10);
    /* If MPOL_F_NUMA_BALANCING isn't supported,
     * fall back to MPOL_PREFERRED_MANY */
    if (ret < 0 && errno == EINVAL){
       printf("set mem policy normal\n");
        ret = set_mempolicy(MPOL_PREFERRED_MANY, &nodemask, 10);
    }
    if (ret < 0) {
       perror("Failed to call set_mempolicy");
       exit(-1);
    }

Test Procedure:
===============
1. Make sure memory tiring and demotion are enabled.
2. Start memcached.

   # ./memcached -b 100000 -m 204800 -u root -c 1000000 -t 7
       -d -s "/tmp/memcached.sock"

3. Run memtier_benchmark to store 3200000 keys.

  #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
    --threads=1 --pipeline=1 --ratio=1:0 --key-pattern=S:S --key-minimum=1
    --key-maximum=3200000 -n allkeys -c 1 -R -x 1 -d 1024

4. Start a memory eater on node 0 and 1. This will demote all memcached
   pages to node 6.
5. Make sure all the memcached pages got demoted to lower tier by reading
   /proc/<memcaced PID>/numa_maps.

    # cat /proc/2771/numa_maps
     ---
    default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
    default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
     ---

6. Kill memory eater.
7. Read the pgpromote_success counter.
8. Start reading the keys by running memtier_benchmark.

  #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
   --pipeline=1 --distinct-client-seed --ratio=0:3 --key-pattern=R:R
   --key-minimum=1 --key-maximum=3200000 -n allkeys
   --threads=64 -c 1 -R -x 6

9. Read the pgpromote_success counter.

Test Results:
=============
Without Patch
------------------
1. pgpromote_success  before test
Node 0:  pgpromote_success 11
Node 1:  pgpromote_success 140974

pgpromote_success  after test
Node 0:  pgpromote_success 11
Node 1:  pgpromote_success 140974

2. Memtier-benchmark result.
AGGREGATED AVERAGE RESULTS (6 runs)
==================================================================
Type    Ops/sec   Hits/sec   Misses/sec  Avg. Latency  p50 Latency
------------------------------------------------------------------
Sets     0.00       ---         ---        ---          ---
Gets    305792.03  305791.93   0.10       0.18949       0.16700
Waits    0.00       ---         ---        ---          ---
Totals  305792.03  305791.93   0.10       0.18949       0.16700

======================================
p99 Latency  p99.9 Latency  KB/sec
-------------------------------------
---          ---            0.00
0.44700     1.71100        11542.69
---           ---            ---
0.44700     1.71100        11542.69

With Patch
---------------
1. pgpromote_success  before test
Node 0:  pgpromote_success 5
Node 1:  pgpromote_success 89386

pgpromote_success  after test
Node 0:  pgpromote_success 57895
Node 1:  pgpromote_success 141463

2. Memtier-benchmark result.
AGGREGATED AVERAGE RESULTS (6 runs)
====================================================================
Type    Ops/sec    Hits/sec  Misses/sec  Avg. Latency  p50 Latency
--------------------------------------------------------------------
Sets     0.00        ---       ---        ---           ---
Gets    521942.24  521942.07  0.17       0.11459        0.10300
Waits    0.00        ---       ---         ---          ---
Totals  521942.24  521942.07  0.17       0.11459        0.10300

=======================================
p99 Latency  p99.9 Latency  KB/sec
---------------------------------------
---          ---            0.00
0.23100      0.31900        19701.68
---          ---             ---
0.23100      0.31900        19701.68

Test Result Analysis:
=====================
1. With patch we could observe pages are getting promoted.
2. Memtier-benchmark results shows that, with the patch,
   performance has increased more than 50%.

Ops/sec without fix -  305792.03
Ops/sec with fix    -  521942.24

This patch (of 2):

Instead of using 'cpu_to_node()', we use 'numa_node_id()', which is
quicker.  smp_processor_id is guaranteed to be stable in the
'mpol_misplaced()' function because it is called with ptl held.
lockdep_assert_held was added to ensure that.

No functional change in this patch.

[[email protected]: add "* @vmf: structure describing the fault" comment]
Link: https://lkml.kernel.org/r/d8b993ea9dccfac0bc3ed61d3a81f4ac5f376e46.1711002865.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/6059f034f436734b472d066db69676fb3a459864.1711373653.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/744646531af02cc687cde8ae788fb1779e99d02c.1709909210.git.donettom@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V (IBM) <[email protected]>
Signed-off-by: Donet Tom <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: zswap: remove unnecessary check in zswap_find_zpool()

zswap_find_zpool() checks if ZSWAP_NR_ZPOOLS > 1, which is always true.
This is a remnant from a patch version that had ZSWAP_NR_ZPOOLS as a
config option and never made it upstream. Remove the unnecessary check.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

lib/test_hmm.c: handle src_pfns and dst_pfns allocation failure

The kcalloc() in dmirror_device_evict_chunk() will return null if the
physical memory has run out.  As a result, if src_pfns or dst_pfns is
dereferenced, the null pointer dereference bug will happen.

Moreover, the device is going away.  If the kcalloc() fails, the pages
mapping a chunk could not be evicted.  So add a __GFP_NOFAIL flag in
kcalloc().

Finally, as there is no need to have physically contiguous memory, Switch
kcalloc() to kvcalloc() in order to avoid failing allocations.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: b2ef9f5a5cb3 ("mm/hmm/test: add selftest driver for HMM")
Signed-off-by: Duoming Zhou <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: zpool: return pool size in pages

All zswap backends track their pool sizes in pages. Currently they
multiply by PAGE_SIZE for zswap, only for zswap to divide again in order
to do limit math. Report pages directly.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: zswap: optimize zswap pool size tracking

Profiling the munmap() of a zswapped memory region shows 60% of the total
cycles currently going into updating the zswap_pool_total_size.

There are three consumers of this counter:
- store, to enforce the globally configured pool limit
- meminfo & debugfs, to report the size to the user
- shrink, to determine the batch size for each cycle

Instead of aggregating everytime an entry enters or exits the zswap
pool, aggregate the value from the zpools on-demand:

- Stores aggregate the counter anyway upon success. Aggregating to
  check the limit instead is the same amount of work.

- Meminfo & debugfs might benefit somewhat from a pre-aggregated
  counter, but aren't exactly hotpaths.

- Shrinking can aggregate once for every cycle instead of doing it for
  every freed entry. As the shrinker might work on tens or hundreds of
  objects per scan cycle, this is a large reduction in aggregations.

The paths that benefit dramatically are swapin, swapoff, and unmaps.
There could be millions of pages being processed until somebody asks for
the pool size again.  This eliminates the pool size updates from those
paths entirely.

Top profile entries for a 24G range munmap(), before:

    38.54%  zswap-unmap  [kernel.kallsyms]  [k] zs_zpool_total_size
    12.51%  zswap-unmap  [kernel.kallsyms]  [k] zpool_get_total_size
     9.10%  zswap-unmap  [kernel.kallsyms]  [k] zswap_update_total_size
     2.95%  zswap-unmap  [kernel.kallsyms]  [k] obj_cgroup_uncharge_zswap
     2.88%  zswap-unmap  [kernel.kallsyms]  [k] __slab_free
     2.86%  zswap-unmap  [kernel.kallsyms]  [k] xas_store

and after:

     7.70%  zswap-unmap  [kernel.kallsyms]  [k] __slab_free
     7.16%  zswap-unmap  [kernel.kallsyms]  [k] obj_cgroup_uncharge_zswap
     6.74%  zswap-unmap  [kernel.kallsyms]  [k] xas_store

It was also briefly considered to move to a single atomic in zswap
that is updated by the backends, since zswap only cares about the sum
of all pools anyway. However, zram directly needs per-pool information
out of zsmalloc. To keep the backend from having to update two atomics
every time, I opted for the lazy aggregation instead for now.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: document pXd_leaf() API

There's one small section already, but since we're going to remove
pXd_huge(), that comment may start to obsolete.

Rewrite that section with more information, hopefully with that the API is
crystal clear on what it implies.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/arm: remove pmd_thp_or_huge()

ARM/ARM64 used to define pmd_thp_or_huge(). Now this macro is completely
redundant. Remove it and use pmd_leaf().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/treewide: remove pXd_huge()

This API is not used anymore, drop it for the whole tree.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/treewide: replace pXd_huge() with pXd_leaf()

Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(),
reuse it. Luckily, pXd_huge() isn't widely used.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/gup: merge pXd huge mapping checks

Huge mapping checks in GUP are slightly redundant and can be simplified.

pXd_huge() now is the same as pXd_leaf(). pmd_trans_huge() and
pXd_devmap() should both imply pXd_leaf(). Time to merge them into one.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/powerpc: redefine pXd_huge() with pXd_leaf()

PowerPC book3s 4K mostly has the same definition on both, except
pXd_huge() constantly returns 0 for hash MMUs.  As Michael Ellerman
pointed out [1], it is safe to check _PAGE_PTE on hash MMUs, as the bit
will never be set so it will keep returning false.

As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc hugetlb
pgtable walker __find_linux_pte() already used pXd_leaf() to check leaf
hugetlb mappings.

The goal should be that we will have one API pXd_leaf() to detect all
kinds of huge mappings (hugepd is still special in this case, though).
AFAICT we need to use the pXd_leaf() impl (rather than pXd_huge()'s) to
make sure ie.  THPs on hash MMU will also return true.

This helps to simplify a follow up patch to drop pXd_huge() treewide.

NOTE: *_leaf() definition need to be moved before the inclusion of
asm/book3s/64/pgtable-4k.h, which defines pXd_huge() with it.

[1] https://lore.kernel.org/r/[email protected]

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/arm64: merge pXd_huge() and pXd_leaf() definitions

Unlike most archs, aarch64 defines pXd_huge() and pXd_leaf() slightly
differently. Redefine the pXd_huge() with pXd_leaf().

There used to be two traps for old aarch64 definitions over these APIs that
I found when reading the code around, they're:

(1) 4797ec2dc83a ("arm64: fix pud_huge() for 2-level pagetables")
(2) 23bc8f69f0ec ("arm64: mm: fix p?d_leaf()")

Define pXd_huge() with the current pXd_leaf() will make sure (2) isn't a
problem (on PROT_NONE checks). To make sure it also works for (1), we
move over the __PAGETABLE_PMD_FOLDED check to pud_leaf(), allowing it to
constantly returning "false" for 2-level pgtables, which looks even safer
to cover both now.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/arm: redefine pmd_huge() with pmd_leaf()

Most of the archs already define these two APIs the same way.  ARM is more
complicated in two aspects:

  - For pXd_huge() it's always checking against !PXD_TABLE_BIT, while for
    pXd_leaf() it's always checking against PXD_TYPE_SECT.

  - SECT/TABLE bits are defined differently on 2-level v.s. 3-level ARM
    pgtables, which makes the whole thing even harder to follow.

Luckily, the second complexity should be hidden by the pmd_leaf()
implementation against 2-level v.s. 3-level headers.  Invoke pmd_leaf()
directly for pmd_huge(), to remove the first part of complexity.  This
prepares to drop pXd_huge() API globally.

When at it, drop the obsolete comments - it's outdated.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/arm: use macros to define pmd/pud helpers

It's already confusing that ARM 2-level v.s. 3-level defines SECT bit
differently on pmd/puds. Always use a macro which is much clearer.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/sparc: change pXd_huge() behavior to exclude swap entries

Please refer to the previous patch on the reasoning for x86. Now sparc is
the only architecture that will allow swap entries to be reported as
pXd_huge(). After this patch, all architectures should forbid swap
entries in pXd_huge().

[[email protected]: s/;;/;/, per Muchun]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/x86: change pXd_huge() behavior to exclude swap entries

This patch partly reverts below commits:

3a194f3f8ad0 ("mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry")
cbef8478bee5 ("mm/hugetlb: pmd_huge() returns true for non-present hugepage")

Right now, pXd_huge() definition across kernel is unclear. We have two
groups that think differently on swap entries:

  - x86/sparc:     Allow pXd_huge() to accept swap entries
  - all the rest:  Doesn't allow pXd_huge() to accept swap entries

This is so confusing.  Since the sparc helpers seem to be added in 2016,
which is after x86's (2015), so sparc could have followed a trend.  x86
proposed such swap handling in 2015 to resolve hugetlb swap entries hit in
GUP, but now GUP guards swap entries with !pXd_present() in all layers so
we should be safe.

We should define this API properly, one way or another, rather than keep
them defined differently across archs.

Gut feeling tells me that pXd_huge() shouldn't include swap entries, and it
turns out that I am not the only one thinking so, the question was raised
when the current pmd_huge() for x86 was proposed by Ville Syrjälä:

https://lore.kernel.org/all/[email protected]/

  I might also be missing something obvious, but why is it even necessary
  to treat PRESENT==0+PSE==0 as a huge entry?

It is also questioned when Jason Gunthorpe reviewed the other patchset on
swap entry handlings:

https://lore.kernel.org/all/20240221125753 [email protected]/

Revert its meaning back to original.  It shouldn't have any functional
change as we should be ready with guards on !pXd_present() explicitly
everywhere.

Note that I also dropped the "#if CONFIG_PGTABLE_LEVELS > 2", it was there
probably because it was breaking things when 3a194f3f8ad0 was proposed,
according to the report here:

https://lore.kernel.org/all/[email protected]/

Now we shouldn't need that.

Instead of reverting to _PAGE_PSE raw check, leverage pXd_leaf().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/gup: check p4d presence before going on

Currently there should have no p4d swap entries so it may not matter much,
however this may help us to rule out swap entries in pXd_huge() API, which
will include p4d_huge(). The p4d_present() checks make it 100% clear that
we won't rely on p4d_huge() for swap entries.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/gup: cache p4d in follow_p4d_mask()

Add a variable to cache p4d in follow_p4d_mask(). It's a good practise to
make sure all the following checks will have a consistent view of the
entry.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/hmm: process pud swap entry without pud_huge()

Swap pud entries do not always return true for pud_huge() for all archs.
x86 and sparc (so far) allow it, but all the rest do not accept a swap
entry to be reported as pud_huge().  So it's not safe to check swap
entries within pud_huge().  Check swap entries before pud_huge(), so it
should be always safe.

This is the only place in the kernel that (IMHO, wrongly) relies on
pud_huge() to return true on pud swap entries.  The plan is to cleanup
pXd_huge() to only report non-swap mappings for all archs.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: page_alloc: control latency caused by zone PCP draining

Patch series "mm/treewide: Remove pXd_huge() API", v2.

In previous work [1], we removed the pXd_large() API, which is arch
specific.  This patchset further removes the hugetlb pXd_huge() API.

Hugetlb was never special on creating huge mappings when compared with
other huge mappings.  Having a standalone API just to detect such pgtable
entries is more or less redundant, especially after the pXd_leaf() API set
is introduced with/without CONFIG_HUGETLB_PAGE.

When looking at this problem, a few issues are also exposed that we don't
have a clear definition of the *_huge() variance API.  This patchset
started by cleaning these issues first, then replace all *_huge() users to
use *_leaf(), then drop all *_huge() code.

On x86/sparc, swap entries will be reported "true" in pXd_huge(), while
for all the rest archs they're reported "false" instead.  This part is
done in patch 1-5, in which I suspect patch 1 can be seen as a bug fix,
but I'll leave that to hmm experts to decide.

Besides, there are three archs (arm, arm64, powerpc) that have slightly
different definitions between the *_huge() v.s.  *_leaf() variances.  I
tackled them separately so that it'll be easier for arch experts to chim
in when necessary.  This part is done in patch 6-9.

The final patches 10-14 do the rest on the final removal, since *_leaf()
will be the ultimate API in the future, and we seem to have quite some
confusions on how *_huge() APIs can be defined, provide a rich comment for
*_leaf() API set to define them properly to avoid future misuse, and
hopefully that'll also help new archs to start support huge mappings and
avoid traps (like either swap entries, or PROT_NONE entry checks).

[1] https://lore.kernel.org/r/20240305043750 [email protected]

This patch (of 14):

When the complete PCP is drained a much larger number of pages than the
usual batch size might be freed at once, causing large IRQ and preemption
latency spikes, as they are all freed while holding the pcp and zone
spinlocks.

To avoid those latency spikes, limit the number of pages freed in a single
bulk operation to common batch limits.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lucas Stach <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: virtual_address_range: Switch to ksft_exit_fail_msg

mmap() must not succeed in validate_lower_address_hint(), for if it does,
it is a bug in mmap() itself. Reflect this behaviour with
ksft_exit_fail_msg(). While at it, do some formatting changes.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dev Jain <[email protected]>
Reviewed-by: Muhammad Usama Anjum <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/madvise: don't perform madvise VMA walk for MADV_POPULATE_(READ|WRITE)

We changed faultin_page_range() to no longer consume a VMA, because
faultin_page_range() might internally release the mm lock to lookup
the VMA again -- required to cleanly handle VM_FAULT_RETRY. But
independent of that, __get_user_pages() will always lookup the VMA
itself.

Now that we let __get_user_pages() just handle VMA checks in a way that
is suitable for MADV_POPULATE_(READ|WRITE), the VMA walk in madvise()
is just overhead. So let's just call madvise_populate()
on the full range instead.

There is one change in behavior: madvise_walk_vmas() would skip any VMA
holes, and if everything succeeded, it would return -ENOMEM after
processing all VMAs.

However, for MADV_POPULATE_(READ|WRITE) it's unlikely for the caller to
notice any difference: -ENOMEM might either indicate that there were VMA
holes or that populating page tables failed because there was not enough
memory. So it's unlikely that user space will notice the difference, and
that special handling likely only makes sense for some other madvise()
actions.

Further, we'd already fail with -ENOMEM early in the past if looking up the
VMA after dropping the MM lock failed because of concurrent VMA
modifications. So let's just keep it simple and avoid the madvise VMA
walk, and consistently fail early if we find a VMA hole.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memcg: add NULL check to obj_cgroup_put()

9 out of 16 callers perform a NULL check before calling obj_cgroup_put().
Move the NULL check in the function, similar to mem_cgroup_put(). The
unlikely() NULL check in current_objcg_update() was left alone to avoid
dropping the unlikey() annotation as this a fast path.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Shakeel Butt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: remove guard around pgd_offset_k() macro

The last architecture redefining pgd_offset_k() was IA64 and it was
removed by commit cf8e8658100d ("arch: Remove Itanium (IA-64)
architecture")

There is no need anymore to guard generic version of pgd_offset_k()
with #ifndef pgd_offset_k

Link: https://lkml.kernel.org/r/59d3f47d5615d18cca1986f269be2fcb3df34556.1710589838.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

merge mm-hotfixes-stable into mm-nonmm-stable to pick up needed changes

mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio()

When I did memory failure tests recently, below warning occurs:

DEBUG_LOCKS_WARN_ON(1)
WARNING: CPU: 8 PID: 1011 at kernel/locking/lockdep.c:232 __lock_acquire+0xccb/0x1ca0
Modules linked in: mce_inject hwpoison_inject
CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:__lock_acquire+0xccb/0x1ca0
RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082
RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8
RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0
RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb
R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10
R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004
FS:  00007ff9f32aa740(0000) GS:ffffa1ce5fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff9f3134ba0 CR3: 00000008484e4000 CR4: 00000000000006f0
Call Trace:
<TASK>
lock_acquire+0xbe/0x2d0
_raw_spin_lock_irqsave+0x3a/0x60
hugepage_subpool_put_pages.part.0+0xe/0xc0
free_huge_folio+0x253/0x3f0
dissolve_free_huge_page+0x147/0x210
__page_handle_poison+0x9/0x70
memory_failure+0x4e6/0x8c0
hard_offline_page_store+0x55/0xa0
kernfs_fop_write_iter+0x12c/0x1d0
vfs_write+0x380/0x540
ksys_write+0x64/0xe0
do_syscall_64+0xbc/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff9f3114887
RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887
RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001
RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00
</TASK>
Kernel panic - not syncing: kernel: panic_on_warn set ...
CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
panic+0x326/0x350
check_panic_on_warn+0x4f/0x50
__warn+0x98/0x190
report_bug+0x18e/0x1a0
handle_bug+0x3d/0x70
exc_invalid_op+0x18/0x70
asm_exc_invalid_op+0x1a/0x20
RIP: 0010:__lock_acquire+0xccb/0x1ca0
RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082
RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8
RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0
RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb
R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10
R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004
lock_acquire+0xbe/0x2d0
_raw_spin_lock_irqsave+0x3a/0x60
hugepage_subpool_put_pages.part.0+0xe/0xc0
free_huge_folio+0x253/0x3f0
dissolve_free_huge_page+0x147/0x210
__page_handle_poison+0x9/0x70
memory_failure+0x4e6/0x8c0
hard_offline_page_store+0x55/0xa0
kernfs_fop_write_iter+0x12c/0x1d0
vfs_write+0x380/0x540
ksys_write+0x64/0xe0
do_syscall_64+0xbc/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff9f3114887
RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887
RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001
RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00
</TASK>

After git bisecting and digging into the code, I believe the root cause is
that _deferred_list field of folio is unioned with _hugetlb_subpool field.
In __update_and_free_hugetlb_folio(), folio->_deferred_list is
initialized leading to corrupted folio->_hugetlb_subpool when folio is
hugetlb.  Later free_huge_folio() will use _hugetlb_subpool and above
warning happens.

But it is assumed hugetlb flag must have been cleared when calling
folio_put() in update_and_free_hugetlb_folio().  This assumption is broken
due to below race:

CPU1 CPU2
dissolve_free_huge_page update_and_free_pages_bulk
update_and_free_hugetlb_folio hugetlb_vmemmap_restore_folios
  folio_clear_hugetlb_vmemmap_optimized
  clear_flag = folio_test_hugetlb_vmemmap_optimized
  if (clear_flag) <-- False, it's already cleared.
   __folio_clear_hugetlb(folio) <-- Hugetlb is not cleared.
  folio_put
   free_huge_folio <-- free_the_page is expected.
list_for_each_entry()
  __folio_clear_hugetlb <-- Too late.

Fix this issue by checking whether folio is hugetlb directly instead of
checking clear_flag to close the race window.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 32c877191e02 ("hugetlb: do not clear hugetlb dtor until allocating vmemmap")
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests: mm: protection_keys: save/restore nr_hugepages value from launch script

The save/restore of nr_hugepages was added to the test itself by using the
atexit() functionality.  But it is broken as parent exits after creating
child.  Hence calling the atexit() function early.  That's not it.  The
child exits after creating its child and so on.

The parent cannot wait to get the termination status for its children as
it'll keep on holding the resources until the new pkey allocation fails.
It is impossible to wait for exits of all the grand and great grand
children.  Hence the restoring of nr_hugepages value from parent is wrong.

Let's save/restore the nr_hugepages settings in the launch script
instead of doing it in the test.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: c52eb6db7b7d ("selftests: mm: restore settings from only parent process")
Signed-off-by: Muhammad Usama Anjum <[email protected]>
Reported-by: Joey Gouly <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]
Cc: Joey Gouly <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

stackdepot: respect __GFP_NOLOCKDEP allocation flag

If stack_depot_save_flags() allocates memory it always drops
__GFP_NOLOCKDEP flag.  So when KASAN tries to track __GFP_NOLOCKDEP
allocation we may end up with lockdep splat like bellow:

======================================================
WARNING: possible circular locking dependency detected
6.9.0-rc3+ #49 Not tainted
------------------------------------------------------
kswapd0/149 is trying to acquire lock:
ffff88811346a920
(&xfs_nondir_ilock_class){++++}-{4:4}, at: xfs_reclaim_inode+0x3ac/0x590
[xfs]

but task is already holding lock:
ffffffff8bb33100 (fs_reclaim){+.+.}-{0:0}, at:
balance_pgdat+0x5d9/0xad0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:
-> #1 (fs_reclaim){+.+.}-{0:0}:
        __lock_acquire+0x7da/0x1030
        lock_acquire+0x15d/0x400
        fs_reclaim_acquire+0xb5/0x100
prepare_alloc_pages.constprop.0+0xc5/0x230
        __alloc_pages+0x12a/0x3f0
        alloc_pages_mpol+0x175/0x340
        stack_depot_save_flags+0x4c5/0x510
        kasan_save_stack+0x30/0x40
        kasan_save_track+0x10/0x30
        __kasan_slab_alloc+0x83/0x90
        kmem_cache_alloc+0x15e/0x4a0
        __alloc_object+0x35/0x370
        __create_object+0x22/0x90
__kmalloc_node_track_caller+0x477/0x5b0
        krealloc+0x5f/0x110
        xfs_iext_insert_raw+0x4b2/0x6e0 [xfs]
        xfs_iext_insert+0x2e/0x130 [xfs]
        xfs_iread_bmbt_block+0x1a9/0x4d0 [xfs]
        xfs_btree_visit_block+0xfb/0x290 [xfs]
        xfs_btree_visit_blocks+0x215/0x2c0 [xfs]
        xfs_iread_extents+0x1a2/0x2e0 [xfs]
xfs_buffered_write_iomap_begin+0x376/0x10a0 [xfs]
        iomap_iter+0x1d1/0x2d0
iomap_file_buffered_write+0x120/0x1a0
        xfs_file_buffered_write+0x128/0x4b0 [xfs]
        vfs_write+0x675/0x890
        ksys_write+0xc3/0x160
        do_syscall_64+0x94/0x170
entry_SYSCALL_64_after_hwframe+0x71/0x79

Always preserve __GFP_NOLOCKDEP to fix this.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
Signed-off-by: Andrey Ryabinin <[email protected]>
Reported-by: Xiubo Li <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Reported-by: Damien Le Moal <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Suggested-by: Dave Chinner <[email protected]>
Tested-by: Xiubo Li <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb: check for anon_vma prior to folio allocation

Commit 9acad7ba3e25 ("hugetlb: use vmf_anon_prepare() instead of
anon_vma_prepare()") may bailout after allocating a folio if we do not
hold the mmap lock. When this occurs, vmf_anon_prepare() will release the
vma lock. Hugetlb then attempts to call restore_reserve_on_error(), which
depends on the vma lock being held.

We can move vmf_anon_prepare() prior to the folio allocation in order to
avoid calling restore_reserve_on_error() without the vma lock.

Link: https://lkml.kernel.org/r/ZiFqSrSRLhIV91og@fedora
Fixes: 9acad7ba3e25 ("hugetlb: use vmf_anon_prepare() instead of anon_vma_prepare()")
Reported-by: [email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: zswap: fix shrinker NULL crash with cgroup_disable=memory

Christian reports a NULL deref in zswap that he bisected down to the zswap
shrinker.  The issue also cropped up in the bug trackers of libguestfs [1]
and the Red Hat bugzilla [2].

The problem is that when memcg is disabled with the boot time flag, the
zswap shrinker might get called with sc->memcg == NULL.  This is okay in
many places, like the lruvec operations.  But it crashes in
memcg_page_state() - which is only used due to the non-node accounting of
cgroup's the zswap memory to begin with.

Nhat spotted that the memcg can be NULL in the memcg-disabled case, and I
was then able to reproduce the crash locally as well.

[1] https://github.com/libguestfs/libguestfs/issues/139
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2275252

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
Signed-off-by: Johannes Weiner <[email protected]>
Reported-by: Christian Heusel <[email protected]>
Debugged-by: Nhat Pham <[email protected]>
Suggested-by: Nhat Pham <[email protected]>
Tested-by: Christian Heusel <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Richard W.M. Jones <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Vitaly Wool <[email protected]>
Cc: <[email protected]> [v6.8]
Signed-off-by: Andrew Morton <[email protected]>

mm: turn folio_test_hugetlb into a PageType

The current folio_test_hugetlb() can be fooled by a concurrent folio split
into returning true for a folio which has never belonged to hugetlbfs.
This can't happen if the caller holds a refcount on it, but we have a few
places (memory-failure, compaction, procfs) which do not and should not
take a speculative reference.

Since hugetlb pages do not use individual page mapcounts (they are always
fully mapped and use the entire_mapcount field to record the number of
mappings), the PageType field is available now that page_mapcount()
ignores the value in this field.

In compaction and with CONFIG_DEBUG_VM enabled, the current implementation
can result in an oops, as reported by Luis. This happens since 9c5ccf2db04b
("mm: remove HUGETLB_PAGE_DTOR") effectively added some VM_BUG_ON() checks
in the PageHuge() testing path.

[[email protected]: update vmcoreinfo]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 9c5ccf2db04b ("mm: remove HUGETLB_PAGE_DTOR")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reported-by: Luis Chamberlain <[email protected]>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218227
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: support page_mapcount() on page_has_type() pages

Return 0 for pages which can't be mapped. This matches how page_mapped()
works. It is more convenient for users to not have to filter out these
pages.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 9c5ccf2db04b ("mm: remove HUGETLB_PAGE_DTOR")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros

Following the separation of FOLIO_FLAGS from PAGEFLAGS, separate
FOLIO_FLAG_FALSE from PAGEFLAG_FALSE and FOLIO_TYPE_OPS from
PAGE_TYPE_OPS.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 9c5ccf2db04b ("mm: remove HUGETLB_PAGE_DTOR")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/hugetlb: fix missing hugetlb_lock for resv uncharge

There is a recent report on UFFDIO_COPY over hugetlb:

https://lore.kernel.org/all/000000000000ee06de0616177560@google.com/

350: lockdep_assert_held(&hugetlb_lock);

Should be an issue in hugetlb but triggered in an userfault context, where
it goes into the unlikely path where two threads modifying the resv map
together. Mike has a fix in that path for resv uncharge but it looks like
the locking criteria was overlooked: hugetlb_cgroup_uncharge_folio_rsvd()
will update the cgroup pointer, so it requires to be called with the lock
held.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 79aa925bf239 ("hugetlb_cgroup: fix reservation accounting")
Signed-off-by: Peter Xu <[email protected]>
Reported-by: [email protected]
Reviewed-by: Mina Almasry <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests: mm: fix unused and uninitialized variable warning

Fix the warnings by initializing and marking the variable as unused.
I've caught the warnings by using clang.

split_huge_page_test.c:303:6: warning: variable 'dummy' set but not used [-Wunused-but-set-variable]
  303 |         int dummy;
      |             ^
split_huge_page_test.c:343:3: warning: variable 'dummy' is uninitialized when used here [-Wuninitialized]
  343 |                 dummy += *(*addr + i);
      |                 ^~~~~
split_huge_page_test.c:303:11: note: initialize the variable 'dummy' to silence this warning
  303 |         int dummy;
      |                  ^
      |                   = 0
2 warnings generated.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: fc4d182316bd ("mm: huge_memory: enable debugfs to split huge pages to any order")
Signed-off-by: Muhammad Usama Anjum <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Cc: Bill Wendling <[email protected]>
Cc: Justin Stitt <[email protected]>
Cc: Muhammad Usama Anjum <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/harness: remove use of LINE_MAX

Android was seeing a compliation error because its C library does not
define LINE_MAX. This replaces the use of LINE_MAX / snprintf with
asprintf, which will change the behavior to not truncate the test name if
it is over 2048 chars long.

See also:
https://github.com/llvm/llvm-project/issues/88119

[[email protected]: remove limits.h include, per Edward]
[[email protected]: check asprintf() return]
[[email protected]: fix undeclared function error]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 38c957f07038 ("selftests: kselftest_harness: generate test name once")
Signed-off-by: Edward Liaw <[email protected]>
Signed-off-by: Muhammad Usama Anjum <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Bill Wendling <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Edward Liaw <[email protected]>
Cc: Justin Stitt <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: "Mike Rapoport (IBM)" <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Will Drewry <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

nilfs2: fix OOB in nilfs_set_de_type

The size of the nilfs_type_by_mode array in the fs/nilfs2/dir.c file is
defined as "S_IFMT >> S_SHIFT", but the nilfs_set_de_type() function,
which uses this array, specifies the index to read from the array in the
same way as "(mode & S_IFMT) >> S_SHIFT".

static void nilfs_set_de_type(struct nilfs_dir_entry *de, struct inode
*inode)
{
umode_t mode = inode->i_mode;

de->file_type = nilfs_type_by_mode[(mode & S_IFMT)>>S_SHIFT]; // oob
}

However, when the index is determined this way, an out-of-bounds (OOB)
error occurs by referring to an index that is 1 larger than the array size
when the condition "mode & S_IFMT == S_IFMT" is satisfied. Therefore, a
patch to resize the nilfs_type_by_mode array should be applied to prevent
OOB errors.

Link: https://lkml.kernel.org/r/[email protected]
Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=2e22057de05b9f3b30d8
Fixes: 2ba466d74ed7 ("nilfs2: directory entry operations")
Signed-off-by: Jeongjun Park <[email protected]>
Signed-off-by: Ryusuke Konishi <[email protected]>
Tested-by: Ryusuke Konishi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

MAINTAINERS: update Naoya Horiguchi's email address

My old NEC address has been removed, so update MAINTAINERS and .mailmap to
map it to my gmail address.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Naoya Horiguchi <[email protected]>
Acked-by: Miaohe Lin <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

fork: defer linking file vma until vma is fully initialized

Thorvald reported a WARNING [1]. And the root cause is below race:

CPU 1 CPU 2
fork hugetlbfs_fallocate
  dup_mmap hugetlbfs_punch_hole
   i_mmap_lock_write(mapping);
   vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
   i_mmap_unlock_write(mapping);
   hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!
i_mmap_lock_write(mapping);
    hugetlb_vmdelete_list
  vma_interval_tree_foreach
   hugetlb_vma_trylock_write -- Vma_lock is cleared.
   tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
i_mmap_unlock_write(mapping);

hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside
i_mmap_rwsem lock while vma lock can be used in the same time.  Fix this
by deferring linking file vma until vma is fully initialized.  Those vmas
should be initialized first before they can be used.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing")
Signed-off-by: Miaohe Lin <[email protected]>
Reported-by: Thorvald Natvig <[email protected]>
Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/ [1]
Reviewed-by: Jane Chu <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peng Zhang <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/shmem: inline shmem_is_huge() for disabled transparent hugepages

In order to  minimize code size (CONFIG_CC_OPTIMIZE_FOR_SIZE=y),
compiler might choose to make a regular function call (out-of-line) for
shmem_is_huge() instead of inlining it. When transparent hugepages are
disabled (CONFIG_TRANSPARENT_HUGEPAGE=n), it can cause compilation
error.

mm/shmem.c: In function `shmem_getattr':
./include/linux/huge_mm.h:383:27: note: in expansion of macro `BUILD_BUG'
  383 | #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
      |                           ^~~~~~~~~
mm/shmem.c:1148:33: note: in expansion of macro `HPAGE_PMD_SIZE'
1148 |                 stat->blksize = HPAGE_PMD_SIZE;

To prevent the possible error, always inline shmem_is_huge() when
transparent hugepages are disabled.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sumanth Korikkar <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm,page_owner: defer enablement of static branch

Kefeng Wang reported that he was seeing some memory leaks with kmemleak
with page_owner enabled.

The reason is that we enable the page_owner_inited static branch and then
proceed with the linking of stack_list struct to dummy_stack, which means
that exists a race window between these two steps where we can have pages
already being allocated calling add_stack_record_to_list(), allocating
objects and linking them to stack_list, but then we set stack_list
pointing to dummy_stack in init_page_owner. Which means that the objects
that have been allocated during that time window are unreferenced and
lost.

Fix this by deferring the enablement of the branch until we have properly
set up the list.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 4bedfb314bdd ("mm,page_owner: maintain own list of stack_records structs")
Signed-off-by: Oscar Salvador <[email protected]>
Reported-by: Kefeng Wang <[email protected]>
Closes: https://lore.kernel.org/linux-mm/[email protected]/
Tested-by: Kefeng Wang <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Squashfs: check the inode number is not the invalid value of zero

Syskiller has produced an out of bounds access in fill_meta_index().

That out of bounds access is ultimately caused because the inode
has an inode number with the invalid value of zero, which was not checked.

The reason this causes the out of bounds access is due to following
sequence of events:

1. Fill_meta_index() is called to allocate (via empty_meta_index())
   and fill a metadata index.  It however suffers a data read error
   and aborts, invalidating the newly returned empty metadata index.
   It does this by setting the inode number of the index to zero,
   which means unused (zero is not a valid inode number).

2. When fill_meta_index() is subsequently called again on another
   read operation, locate_meta_index() returns the previous index
   because it matches the inode number of 0.  Because this index
   has been returned it is expected to have been filled, and because
   it hasn't been, an out of bounds access is performed.

This patch adds a sanity check which checks that the inode number
is not zero when the inode is created and returns -EINVAL if it is.

[[email protected]: whitespace fix]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Phillip Lougher <[email protected]>
Reported-by: "Ubisectech Sirius" <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]/
Cc: Christian Brauner <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm,swapops: update check in is_pfn_swap_entry for hwpoison entries

Tony reported that the Machine check recovery was broken in v6.9-rc1, as
he was hitting a VM_BUG_ON when injecting uncorrectable memory errors to
DRAM.

After some more digging and debugging on his side, he realized that this
went back to v6.1, with the introduction of 'commit 0d206b5d2e0d
("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")'. That
commit, among other things, introduced swp_offset_pfn(), replacing
hwpoison_entry_to_pfn() in its favour.

The patch also introduced a VM_BUG_ON() check for is_pfn_swap_entry(), but
is_pfn_swap_entry() never got updated to cover hwpoison entries, which
means that we would hit the VM_BUG_ON whenever we would call
swp_offset_pfn() for such entries on environments with CONFIG_DEBUG_VM
set. Fix this by updating the check to cover hwpoison entries as well,
and update the comment while we are it.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
Signed-off-by: Oscar Salvador <[email protected]>
Reported-by: Tony Luck <[email protected]>
Closes: https://lore.kernel.org/all/Zg8kLSl2yAlA3o5D@agluck-desk3/
Tested-by: Tony Luck <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Miaohe Lin <[email protected]>
Cc: <[email protected]> [6.1.x]
Signed-off-by: Andrew Morton <[email protected]>

mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled

When I did hard offline test with hugetlb pages, below deadlock occurs:

======================================================
WARNING: possible circular locking dependency detected
6.8.0-11409-gf6cef5f8c37f #1 Not tainted
------------------------------------------------------
bash/46904 is trying to acquire lock:
ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60

but task is already holding lock:
ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (pcp_batch_high_lock){+.+.}-{3:3}:
       __mutex_lock+0x6c/0x770
       page_alloc_cpu_online+0x3c/0x70
       cpuhp_invoke_callback+0x397/0x5f0
       __cpuhp_invoke_callback_range+0x71/0xe0
       _cpu_up+0xeb/0x210
       cpu_up+0x91/0xe0
       cpuhp_bringup_mask+0x49/0xb0
       bringup_nonboot_cpus+0xb7/0xe0
       smp_init+0x25/0xa0
       kernel_init_freeable+0x15f/0x3e0
       kernel_init+0x15/0x1b0
       ret_from_fork+0x2f/0x50
       ret_from_fork_asm+0x1a/0x30

-> #0 (cpu_hotplug_lock){++++}-{0:0}:
       __lock_acquire+0x1298/0x1cd0
       lock_acquire+0xc0/0x2b0
       cpus_read_lock+0x2a/0xc0
       static_key_slow_dec+0x16/0x60
       __hugetlb_vmemmap_restore_folio+0x1b9/0x200
       dissolve_free_huge_page+0x211/0x260
       __page_handle_poison+0x45/0xc0
       memory_failure+0x65e/0xc70
       hard_offline_page_store+0x55/0xa0
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x387/0x550
       ksys_write+0x64/0xe0
       do_syscall_64+0xca/0x1e0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75

other info that might help us debug this:

Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(pcp_batch_high_lock);
                               lock(cpu_hotplug_lock);
                               lock(pcp_batch_high_lock);
  rlock(cpu_hotplug_lock);

*** DEADLOCK ***

5 locks held by bash/46904:
#0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
#1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
#2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
#3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70
#4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40

stack backtrace:
CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8c37f #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x68/0xa0
check_noncircular+0x129/0x140
__lock_acquire+0x1298/0x1cd0
lock_acquire+0xc0/0x2b0
cpus_read_lock+0x2a/0xc0
static_key_slow_dec+0x16/0x60
__hugetlb_vmemmap_restore_folio+0x1b9/0x200
dissolve_free_huge_page+0x211/0x260
__page_handle_poison+0x45/0xc0
memory_failure+0x65e/0xc70
hard_offline_page_store+0x55/0xa0
kernfs_fop_write_iter+0x12c/0x1d0
vfs_write+0x387/0x550
ksys_write+0x64/0xe0
do_syscall_64+0xca/0x1e0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
RIP: 0033:0x7fc862314887
Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887
RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001
RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00

In short, below scene breaks the lock dependency chain:

memory_failure
  __page_handle_poison
   zone_pcp_disable -- lock(pcp_batch_high_lock)
   dissolve_free_huge_page
    __hugetlb_vmemmap_restore_folio
     static_key_slow_dec
      cpus_read_lock -- rlock(cpu_hotplug_lock)

Fix this by calling drain_all_pages() instead.

This issue won't occur until commit a6b40850c442 ("mm: hugetlb: replace
hugetlb_free_vmemmap_enabled with a static_key").  As it introduced
rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while
lock(pcp_batch_high_lock) is already in the __page_handle_poison().

[[email protected]: extend comment per Oscar]
[[email protected]: reflow block comment]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Oscar Salvador <[email protected]>
Reviewed-by: Jane Chu <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/userfaultfd: allow hugetlb change protection upon poison entry

After UFFDIO_POISON, there can be two kinds of hugetlb pte markers, either
the POISON one or UFFD_WP one.

Allow change protection to run on a poisoned marker just like !hugetlb
cases, ignoring the marker irrelevant of the permission.

Here the two bits are mutual exclusive.  For example, when install a
poisoned entry it must not be UFFD_WP already (by checking pte_none()
before such install).  And it also means if UFFD_WP is set there must have
no POISON bit set.  It makes sense because UFFD_WP is a bit to reflect
permission, and permissions do not apply if the pte is poisoned and
destined to sigbus.

So here we simply check uffd_wp bit set first, do nothing otherwise.

Attach the Fixes to UFFDIO_POISON work, as before that it should not be
possible to have poison entry for hugetlb (e.g., hugetlb doesn't do swap,
so no chance of swapin errors).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Fixes: fc71884a5f59 ("mm: userfaultfd: add new UFFDIO_POISON ioctl")
Signed-off-by: Peter Xu <[email protected]>
Reported-by: [email protected]
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Axel Rasmussen <[email protected]>
Cc: <[email protected]> [6.6+]
Signed-off-by: Andrew Morton <[email protected]>

mm,page_owner: fix printing of stack records

When seq_* code sees that its buffer overflowed, it re-allocates a bigger
onecand calls seq_operations->start() callback again. stack_start()
naively though that if it got called again, it meant that the old record
got already printed so it returned the next object, but that is not true.

The consequence of that is that every time stack_stop() -> stack_start()
get called because we needed a bigger buffer, stack_start() will skip
entries, and those will not be printed.

Fix it by not advancing to the next object in stack_start().

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 765973a09803 ("mm,page_owner: display all stacks and their count")
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm,page_owner: fix accounting of pages when migrating

Upon migration, new allocated pages are being given the handle of the old
pages. This is problematic because it means that for the stack which
allocated the old page, we will be substracting the old page + the new one
when that page is freed, creating an accounting imbalance.

There is an interest in keeping it that way, as otherwise the output will
biased towards migration stacks should those operations occur often, but
that is not really helpful.

The link from the new page to the old stack is being performed by calling
__update_page_owner_handle() in __folio_copy_owner(). The only thing that
is left is to link the migrate stack to the old page, so the old page will
be subtracted from the migrate stack, avoiding by doing so any possible
imbalance.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 217b2119b9e2 ("mm,page_owner: implement the tracking of the stacks count")
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm,page_owner: fix refcount imbalance

Current code does not contemplate scenarios were an allocation and free
operation on the same pages do not handle it in the same amount at once.
To give an example, page_alloc_exact(), where we will allocate a page of
enough order to stafisfy the size request, but we will free the remainings
right away.

In the above example, we will increment the stack_record refcount only
once, but we will decrease it the same number of times as number of unused
pages we have to free. This will lead to a warning because of refcount
imbalance.

Fix this by recording the number of base pages in the refcount field.

Link: https://lkml.kernel.org/r/[email protected]
Reported-by: [email protected]
Closes: https://lore.kernel.org/linux-mm/[email protected]
Fixes: 217b2119b9e2 ("mm,page_owner: implement the tracking of the stacks count")
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Tested-by: Alexandre Ghiti <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm,page_owner: update metadata for tail pages

Patch series "page_owner: Fix refcount imbalance and print fixup", v4.

This series consists of a refactoring/correctness of updating the metadata
of tail pages, a couple of fixups for the refcounting part and a fixup for
the stack_start() function.

From this series on, instead of counting the stacks, we count the
outstanding nr_base_pages each stack has, which gives us a much better
memory overview. The other fixup is for the migration part.

A more detailed explanation can be found in the changelog of the
respective patches.

This patch (of 4):

__set_page_owner_handle() and __reset_page_owner() update the metadata of
all pages when the page is of a higher-order, but we miss to do the same
when the pages are migrated. __folio_copy_owner() only updates the
metadata of the head page, meaning that the information stored in the
first page and the tail pages will not match.

Strictly speaking that is not a big problem because 1) we do not print
tail pages and 2) upon splitting all tail pages will inherit the metadata
of the head page, but it is better to have all metadata in check should
there be any problem, so it can ease debugging.

For that purpose, a couple of helpers are created
__update_page_owner_handle() which updates the metadata on allocation, and
__update_page_owner_free_handle() which does the same when the page is
freed.

__folio_copy_owner() will make use of both as it needs to entirely replace
the page_owner metadata for the new page.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Tested-by: Kefeng Wang <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

userfaultfd: change src_folio after ensuring it's unpinned in UFFDIO_MOVE

Commit d7a08838ab74 ("mm: userfaultfd: fix unexpected change to src_folio
when UFFDIO_MOVE fails") moved the src_folio->{mapping, index} changing to
after clearing the page-table and ensuring that it's not pinned. This
avoids failure of swapout+migration and possibly memory corruption.

However, the commit missed fixing it in the huge-page case.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Lokesh Gidra <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Lokesh Gidra <[email protected]>
Cc: Nicolas Geoffray <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly

Darrick reports that in some cases where pread() would fail with -EIO and
mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ /
MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.

While the madvise() call can be interrupted by a signal, this is not the
desired behavior. MADV_POPULATE_READ / MADV_POPULATE_WRITE should behave
like page faults in that case: fail and not retry forever.

A reproducer can be found at [1].

The reason is that __get_user_pages(), as called by
faultin_vma_page_range(), will not handle VM_FAULT_RETRY in a proper way:
it will simply return 0 when VM_FAULT_RETRY happened, making
madvise_populate()->faultin_vma_page_range() retry again and again, never
setting FOLL_TRIED->FAULT_FLAG_TRIED for __get_user_pages().

__get_user_pages_locked() does what we want, but duplicating that logic in
faultin_vma_page_range() feels wrong.

So let's use __get_user_pages_locked() instead, that will detect
VM_FAULT_RETRY and set FOLL_TRIED when retrying, making the fault handler
return VM_FAULT_SIGBUS (VM_FAULT_ERROR) at some point, propagating -EFAULT
from faultin_page() to __get_user_pages(), all the way to
madvise_populate().

But, there is an issue: __get_user_pages_locked() will end up re-taking
the MM lock and then __get_user_pages() will do another VMA lookup. In
the meantime, the VMA layout could have changed and we'd fail with
different error codes than we'd want to.

As __get_user_pages() will currently do a new VMA lookup either way, let
it do the VMA handling in a different way, controlled by a new
FOLL_MADV_POPULATE flag, effectively moving these checks from
madvise_populate() + faultin_page_range() in there.

With this change, Darricks reproducer properly fails with -EFAULT, as
documented for MADV_POPULATE_READ / MADV_POPULATE_WRITE.

[1] https://lore.kernel.org/all/20240313171936.GN1927156@frogsfrogsfrogs/

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables")
Signed-off-by: David Hildenbrand <[email protected]>
Reported-by: Darrick J. Wong <[email protected]>
Closes: https://lore.kernel.org/all/20240311223815.GW1927156@frogsfrogsfrogs/
Cc: Darrick J. Wong <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Merge branch 'master' into mm-stable

Linux 6.9-rc4

Merge tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull sysfs fix from Al Viro:
"Get rid of lockdep false positives around sysfs/overlayfs

  syzbot has uncovered a class of lockdep false positives for setups
  with sysfs being one of the backing layers in overlayfs. The root
  cause is that of->mutex allocated when opening a sysfs file read-only
  (which overlayfs might do) is confused with of->mutex of a file opened
  writable (held in write to sysfs file, which overlayfs won't do).

  Assigning them separate lockdep classes fixes that bunch and it's
  obviously safe"

* tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kernfs: annotate different lockdep class for of->mutex of writable files

Merge tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 fixes from Ingo Molnar:

- Follow up fixes for the BHI mitigations code

- Fix !SPECULATION_MITIGATIONS bug not turning off mitigations as
   expected

- Work around an APIC emulation bug when the kernel is built with Clang
   and run as a SEV guest

- Follow up x86 topology fixes

* tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu/amd: Move TOPOEXT enablement into the topology parser
  x86/cpu/amd: Make the NODEID_MSR union actually work
  x86/cpu/amd: Make the CPUID 0x80000008 parser correct
  x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
  x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto
  x86/bugs: Clarify that syscall hardening isn't a BHI mitigation
  x86/bugs: Fix BHI handling of RRSBA
  x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'
  x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES
  x86/bugs: Fix BHI documentation
  x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n
  x86/topology: Don't update cpu_possible_map in topo_set_cpuids()
  x86/bugs: Fix return type of spectre_bhi_state()
  x86/apic: Force native_apic_mem_read() to use the MOV instruction

Merge tag 'timers-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Ingo Molnar:

- Address a (valid) W=1 build warning

- Fix timer self-tests

- Annotate a KCSAN warning wrt. accesses to the tick_do_timer_cpu
   global variable

- Address a !CONFIG_BUG build warning

* tag 'timers-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  selftests: kselftest: Fix build failure with NOLIBC
  selftests: timers: Fix abs() warning in posix_timers test
  selftests: kselftest: Mark functions that unconditionally call exit() as __noreturn
  selftests: timers: Fix posix_timers ksft_print_msg() warning
  selftests: timers: Fix valid-adjtimex signed left-shift undefined behavior
  bug: Fix no-return-statement warning with !CONFIG_BUG
  timekeeping: Use READ/WRITE_ONCE() for tick_do_timer_cpu
  selftests/timers/posix_timers: Reimplement check_timer_distribution()
  irqflags: Explicitly ignore lockdep_hrtimer_exit() argument

Merge tag 'perf-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event fix from Ingo Molnar:
"Fix the x86 PMU multi-counter code returning invalid data in certain
circumstances"

* tag 'perf-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix out of range data

Merge tag 'locking-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Ingo Molnar:
"Fix a PREEMPT_RT build bug"

* tag 'locking-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking: Make rwsem_assert_held_write_nolockdep() build with PREEMPT_RT=y

Merge tag 'irq-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Ingo Molnar:
"Fix a bug in the GIC irqchip driver"

* tag 'irq-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3-its: Fix VSYNC referencing an unmapped VPE on GIC v4.1