]> Git Repo - linux.git/log
linux.git
2 years agomm/mmap: remove __vma_adjust()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:49 +0000 (11:26 -0500)]
mm/mmap: remove __vma_adjust()

Inline the work of __vma_adjust() into vma_merge().  This reduces code
size and has the added benefits of the comments for the cases being
located with the code.

Change the comments referencing vma_adjust() accordingly.

[[email protected]: fix vma_merge() offset when expanding the next vma]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/mmap: convert do_brk_flags() to use vma_prepare() and vma_complete()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:48 +0000 (11:26 -0500)]
mm/mmap: convert do_brk_flags() to use vma_prepare() and vma_complete()

Use the abstracted vma locking for do_brk_flags()

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/mmap: introduce dup_vma_anon() helper
Liam R. Howlett [Fri, 20 Jan 2023 16:26:47 +0000 (11:26 -0500)]
mm/mmap: introduce dup_vma_anon() helper

Create a helper for duplicating the anon vma when adjusting the vma.  This
simplifies the logic of __vma_adjust().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/mmap: don't use __vma_adjust() in shift_arg_pages()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:46 +0000 (11:26 -0500)]
mm/mmap: don't use __vma_adjust() in shift_arg_pages()

Introduce shrink_vma() which uses the vma_prepare() and vma_complete()
functions to reduce the vma coverage.

Convert shift_arg_pages() to use expand_vma() and the new shrink_vma()
function.  Remove support from __vma_adjust() to reduce a vma size since
shift_arg_pages() is the only user that shrinks a VMA in this way.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/mremap: convert vma_adjust() to vma_expand()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:45 +0000 (11:26 -0500)]
mm/mremap: convert vma_adjust() to vma_expand()

Stop using vma_adjust() in preparation for removing the function.  Export
vma_expand() to use instead.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: don't use __vma_adjust() in __split_vma()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:44 +0000 (11:26 -0500)]
mm: don't use __vma_adjust() in __split_vma()

Use the abstracted locking and maple tree operations.  Since __split_vma()
is the only user of the __vma_adjust() function to use the insert
argument, drop that argument.  Remove the NULL passed through from
fs/exec's shift_arg_pages() and mremap() at the same time.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/mmap: introduce init_vma_prep() and init_multi_vma_prep()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:43 +0000 (11:26 -0500)]
mm/mmap: introduce init_vma_prep() and init_multi_vma_prep()

Add init_vma_prep() and init_multi_vma_prep() to set up the struct
vma_prepare.  This is to abstract the locking when adjusting the VMAs.

Also change __vma_adjust() variable remove_next int in favour of a pointer
to the VMA to remove.  Rename next_next to remove2 since this better
reflects its use.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/mmap: use vma_prepare() and vma_complete() in vma_expand()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:42 +0000 (11:26 -0500)]
mm/mmap: use vma_prepare() and vma_complete() in vma_expand()

Use the new locking functions for vma_expand().  This reduces code
duplication.

At the same time change VM_BUG_ON() to VM_WARN_ON()

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/mmap: refactor locking out of __vma_adjust()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:41 +0000 (11:26 -0500)]
mm/mmap: refactor locking out of __vma_adjust()

Move the locking into vma_prepare() and vma_complete() for use elsewhere

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/mmap: move anon_vma setting in __vma_adjust()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:40 +0000 (11:26 -0500)]
mm/mmap: move anon_vma setting in __vma_adjust()

Move the anon_vma setting & warn_no up the function.  This is done to
clear up the locking later.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: change munmap splitting order and move_vma()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:39 +0000 (11:26 -0500)]
mm: change munmap splitting order and move_vma()

Splitting can be more efficient when the order is not of concern.  Change
do_vmi_align_munmap() to reduce walking of the tree during split
operations.

move_vma() must also be altered to remove the dependency of keeping the
original VMA as the active part of the split.  Transition to using vma
iterator to look up the prev and/or next vma after munmap.

[[email protected]: fix vma iterator initialization]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agommap: clean up mmap_region() unrolling
Liam R. Howlett [Fri, 20 Jan 2023 16:26:38 +0000 (11:26 -0500)]
mmap: clean up mmap_region() unrolling

Move logic of unrolling to the error path as apposed to duplicating it
within the function body.  This reduces the potential of missing an update
to one path when making changes.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Cc: Li Zetao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: add vma iterator to vma_adjust() arguments
Liam R. Howlett [Fri, 20 Jan 2023 16:26:37 +0000 (11:26 -0500)]
mm: add vma iterator to vma_adjust() arguments

Change the vma_adjust() function definition to accept the vma iterator and
pass it through to __vma_adjust().

Update fs/exec to use the new vma_adjust() function parameters.

Update mm/mremap to use the new vma_adjust() function parameters.

Revert the __split_vma() calls back from __vma_adjust() to vma_adjust()
and pass through the vma iterator.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: pass vma iterator through to __vma_adjust()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:36 +0000 (11:26 -0500)]
mm: pass vma iterator through to __vma_adjust()

Pass the iterator through to be used in __vma_adjust().  The state of the
iterator needs to be correct for the operation that will occur so make the
adjustments.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: remove unnecessary write to vma iterator in __vma_adjust()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:35 +0000 (11:26 -0500)]
mm: remove unnecessary write to vma iterator in __vma_adjust()

If the vma start address is going to change due to an insert, then it is
safe to not write the vma to the tree.  The write of the insert vma will
alter the tree as necessary.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomadvise: use split_vma() instead of __split_vma()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:34 +0000 (11:26 -0500)]
madvise: use split_vma() instead of __split_vma()

The split_vma() wrapper is specifically for this use case, so use it.

[[email protected]: fix VMA_ITERATOR start position]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: pass through vma iterator to __vma_adjust()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:33 +0000 (11:26 -0500)]
mm: pass through vma iterator to __vma_adjust()

Pass the vma iterator through to __vma_adjust() so the state can be
updated.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agommap: convert __vma_adjust() to use vma iterator
Liam R. Howlett [Fri, 20 Jan 2023 16:26:32 +0000 (11:26 -0500)]
mmap: convert __vma_adjust() to use vma iterator

Use the vma iterator internally for __vma_adjust().  Avoid using the maple
tree interface directly for type safety.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/damon/vaddr-test.h: stop using vma_mas_store() for maple tree store
Liam R. Howlett [Fri, 20 Jan 2023 16:26:31 +0000 (11:26 -0500)]
mm/damon/vaddr-test.h: stop using vma_mas_store() for maple tree store

Prepare for the removal of the vma_mas_store() function by open coding the
maple tree store in this test code.  Set the range of the maple state and
call the store function directly.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Reported-by: kernel test robot <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: switch vma_merge(), split_vma(), and __split_vma to vma iterator
Liam R. Howlett [Fri, 20 Jan 2023 16:26:30 +0000 (11:26 -0500)]
mm: switch vma_merge(), split_vma(), and __split_vma to vma iterator

Drop the vmi_* functions and transition all users to use the vma iterator
directly.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agonommu: pass through vma iterator to shrink_vma()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:29 +0000 (11:26 -0500)]
nommu: pass through vma iterator to shrink_vma()

Rename the function to vmi_shrink_vma() indicate it takes the vma
iterator.  Use the iterator to preallocate and drop the delete function.
The maple tree is able to do the modification easier than the linked list
and rbtree, so just clear the necessary area in the tree.

add_vma_to_mm() is no longer used, so drop this function.

vmi_add_vma_to_mm() is now only used once, so inline this function into
do_mmap().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agonommu: convert nommu to using the vma iterator
Liam R. Howlett [Fri, 20 Jan 2023 16:26:28 +0000 (11:26 -0500)]
nommu: convert nommu to using the vma iterator

Gain type safety in nommu by using the vma_iterator and not the maple tree
directly.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/mremap: use vmi version of vma_merge()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:27 +0000 (11:26 -0500)]
mm/mremap: use vmi version of vma_merge()

Use the vma iterator so that the iterator can be invalidated or updated to
avoid each caller doing so.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agommap: use vmi version of vma_merge()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:26 +0000 (11:26 -0500)]
mmap: use vmi version of vma_merge()

Use the vma iterator so that the iterator can be invalidated or updated to
avoid each caller doing so.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agommap: pass through vmi iterator to __split_vma()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:25 +0000 (11:26 -0500)]
mmap: pass through vmi iterator to __split_vma()

Use the vma iterator so that the iterator can be invalidated or updated to
avoid each caller doing so.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomadvise: use vmi iterator for __split_vma() and vma_merge()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:24 +0000 (11:26 -0500)]
madvise: use vmi iterator for __split_vma() and vma_merge()

Use the vma iterator so that the iterator can be invalidated or updated to
avoid each caller doing so.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agosched: convert to vma iterator
Liam R. Howlett [Fri, 20 Jan 2023 16:26:23 +0000 (11:26 -0500)]
sched: convert to vma iterator

Use the vma iterator so that the iterator can be invalidated or updated to
avoid each caller doing so.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agotask_mmu: convert to vma iterator
Liam R. Howlett [Fri, 20 Jan 2023 16:26:22 +0000 (11:26 -0500)]
task_mmu: convert to vma iterator

Use the vma iterator so that the iterator can be invalidated or updated to
avoid each caller doing so.

Update the comments to how the vma iterator works.  The vma iterator will
keep track of the last vm_end and start the search from vm_end + 1.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomempolicy: convert to vma iterator
Liam R. Howlett [Fri, 20 Jan 2023 16:26:21 +0000 (11:26 -0500)]
mempolicy: convert to vma iterator

Use the vma iterator so that the iterator can be invalidated or updated to
avoid each caller doing so.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agocoredump: convert to vma iterator
Liam R. Howlett [Fri, 20 Jan 2023 16:26:20 +0000 (11:26 -0500)]
coredump: convert to vma iterator

Use the vma iterator so that the iterator can be invalidated or updated to
avoid each caller doing so.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomlock: convert mlock to vma iterator
Liam R. Howlett [Fri, 20 Jan 2023 16:26:19 +0000 (11:26 -0500)]
mlock: convert mlock to vma iterator

Use the vma iterator so that the iterator can be invalidated or updated to
avoid each caller doing so.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: change mprotect_fixup to vma iterator
Liam R. Howlett [Fri, 20 Jan 2023 16:26:18 +0000 (11:26 -0500)]
mm: change mprotect_fixup to vma iterator

Use the vma iterator so that the iterator can be invalidated or updated to
avoid each caller doing so.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agouserfaultfd: use vma iterator
Liam R. Howlett [Fri, 20 Jan 2023 16:26:17 +0000 (11:26 -0500)]
userfaultfd: use vma iterator

Use the vma iterator so that the iterator can be invalidated or updated to
avoid each caller doing so.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agoipc/shm: introduce new do_vma_munmap() to munmap
Liam R. Howlett [Thu, 26 Jan 2023 21:20:49 +0000 (16:20 -0500)]
ipc/shm: introduce new do_vma_munmap() to munmap

The shm already has the vma iterator in position for a write.
do_vmi_munmap() searches for the correct position and aligns the write, so
it is not the right function to use in this case.

The shm VMA tree modification is similar to the brk munmap situation, the
vma iterator is in position and the VMA is already known.  This patch
generalizes the brk munmap function do_brk_munmap() to be used for any
other callers with the vma iterator already in position to munmap a VMA.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Reported-by: Sven Schnelle <[email protected]>
Link: https://lore.kernel.org/linux-mm/[email protected]/
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agoipc/shm: use the vma iterator for munmap calls
Liam R. Howlett [Fri, 20 Jan 2023 16:26:16 +0000 (11:26 -0500)]
ipc/shm: use the vma iterator for munmap calls

Pass through the vma iterator to do_vmi_munmap() to handle the iterator
state internally

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: add temporary vma iterator versions of vma_merge(), split_vma(), and __split_vma()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:15 +0000 (11:26 -0500)]
mm: add temporary vma iterator versions of vma_merge(), split_vma(), and __split_vma()

These wrappers are short-lived in this patch set so that each user can be
converted on its own.  In the end, these functions are renamed in one
commit.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agommap: convert vma_expand() to use vma iterator
Liam R. Howlett [Fri, 20 Jan 2023 16:26:14 +0000 (11:26 -0500)]
mmap: convert vma_expand() to use vma iterator

Use the vma iterator instead of the maple state for type safety and for
consistency through the mm code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agommap: change do_mas_munmap and do_mas_aligned_munmap() to use vma iterator
Liam R. Howlett [Fri, 20 Jan 2023 16:26:13 +0000 (11:26 -0500)]
mmap: change do_mas_munmap and do_mas_aligned_munmap() to use vma iterator

Start passing the vma iterator through the mm code.  This will allow for
reuse of the state and cleaner invalidation if necessary.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/mmap: remove preallocation from do_mas_align_munmap()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:12 +0000 (11:26 -0500)]
mm/mmap: remove preallocation from do_mas_align_munmap()

In preparation of passing the vma state through split, the pre-allocation
that occurs before the split has to be moved to after.  Since the
preallocation would then live right next to the store, just call store
instead of preallocating.  This effectively restores the potential error
path of splitting and not munmap'ing which pre-dates the maple tree.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agommap: convert vma_link() vma iterator
Liam R. Howlett [Fri, 20 Jan 2023 16:26:11 +0000 (11:26 -0500)]
mmap: convert vma_link() vma iterator

Avoid using the maple tree interface directly.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agokernel/fork: convert forking to using the vmi iterator
Liam R. Howlett [Fri, 20 Jan 2023 16:26:10 +0000 (11:26 -0500)]
kernel/fork: convert forking to using the vmi iterator

Avoid using the maple tree interface directly.  This gains type safety.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/mmap: convert brk to use vma iterator
Liam R. Howlett [Fri, 20 Jan 2023 16:26:09 +0000 (11:26 -0500)]
mm/mmap: convert brk to use vma iterator

Use the vma iterator API for the brk() system call.  This will provide
type safety at compile time.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: expand vma iterator interface
Liam R. Howlett [Fri, 20 Jan 2023 16:26:08 +0000 (11:26 -0500)]
mm: expand vma iterator interface

Add wrappers for the maple tree to the vma iterator.  This will provide
type safety at compile time.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomaple_tree: fix mas_prev() and mas_find() state handling
Liam R. Howlett [Fri, 20 Jan 2023 16:26:07 +0000 (11:26 -0500)]
maple_tree: fix mas_prev() and mas_find() state handling

When mas_prev() does not find anything, set the state to MAS_NONE.

Handle the MAS_NONE in mas_find() like a MAS_START.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Reported-by: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomaple_tree: fix handle of invalidated state in mas_wr_store_setup()
Liam R. Howlett [Fri, 20 Jan 2023 16:26:06 +0000 (11:26 -0500)]
maple_tree: fix handle of invalidated state in mas_wr_store_setup()

If an invalidated maple state is encountered during write, reset the maple
state to MAS_START.  This will result in a re-walk of the tree to the
correct location for the write.

Link: https://lore.kernel.org/all/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Reported-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agotest_maple_tree: test modifications while iterating
Liam R. Howlett [Fri, 20 Jan 2023 16:26:05 +0000 (11:26 -0500)]
test_maple_tree: test modifications while iterating

Add a testcase to ensure the iterator detects bad states on modifications
and does what the user expects

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomaple_tree: reduce user error potential
Liam R. Howlett [Fri, 20 Jan 2023 16:26:04 +0000 (11:26 -0500)]
maple_tree: reduce user error potential

When iterating, a user may operate on the tree and cause the maple state
to be altered and left in an unintuitive state.  Detect this scenario and
correct it by setting to the limit and invalidating the state.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomaple_tree: fix potential rcu issue
Liam R. Howlett [Fri, 20 Jan 2023 16:26:03 +0000 (11:26 -0500)]
maple_tree: fix potential rcu issue

Ensure the node isn't dead after reading the node end.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomaple_tree: add mas_init() function
Liam R. Howlett [Fri, 20 Jan 2023 16:26:02 +0000 (11:26 -0500)]
maple_tree: add mas_init() function

Patch series "VMA tree type safety and remove __vma_adjust()", v4.

This patchset does two things: 1.  Clean up, including removal of
__vma_adjust() and 2.  Extends the VMA iterator API to provide type safety
to the VMA operations using the maple tree, as requested by Linus [1].

It also addresses another issue of usability brought up by Linus about
needing to modify the maple state within the loops.  The maple state has
been replaced by the VMA iterator and the iterator is now modified within
the MM code so the caller should not need to worry about doing the work
themselves when tree modifications occur.

This brought up a potential inconsistency of the iterator state and what
the user expects, so the inconsistency is addressed to keep the VMA
iterator safe for use after the looping over a VMA range.  This is
addressed in patch 3 ("maple_tree: Reduce user error potential") and 4
("test_maple_tree: Test modifications while iterating").

While cleaning up the state, the duplicate locking code in mm/mmap.c
introduced by the maple tree has been address by abstracting it to two
functions: vma_prepare() and vma_complete().  These abstractions allowed
for a much simpler __vma_adjust(), which eventually leads to the removal
of the __vma_adjust() function by placing the logic into the vma_merge()
function itself.

1. https://lore.kernel.org/linux-mm/CAHk-=wg9WQXBGkNdKD2bqocnN73rDswuWsavBB7T-tekykEn_A@mail.gmail.com/

This patch (of 49):

Add a function that will zero out the maple state struct and set some
basic defaults.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: fix memcpy_from_file_folio() integer underflow
Matthew Wilcox (Oracle) [Fri, 3 Feb 2023 21:28:40 +0000 (16:28 -0500)]
mm: fix memcpy_from_file_folio() integer underflow

If we have a HIGHMEM system with a large folio, 'offset' may be larger
than PAGE_SIZE, and so min_t will cap at 'len' instead of the intended
end-of-page.  That can overflow into the next page which is likely to be
unmapped and fault, but could theoretically copy the wrong data.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 00cdf76012ab ("mm: add memcpy_from_file_folio()")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: "Fabio M. De Francesco" <[email protected]>
Cc: Ira Weiny <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agoarm/mm: fix swp type masking in __swp_entry()
David Hildenbrand [Wed, 8 Feb 2023 14:08:01 +0000 (15:08 +0100)]
arm/mm: fix swp type masking in __swp_entry()

We're masking with the number of type bits instead of the type mask, which
is obviously wrong.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 20aae9eff5ac ("arm/mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE")
Signed-off-by: David Hildenbrand <[email protected]>
Reported-by: Mark Brown <[email protected]>
Tested-by: Mark Brown <[email protected]>
Cc: Russell King (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agompage: convert __mpage_writepage() to use a folio more fully
Matthew Wilcox (Oracle) [Thu, 26 Jan 2023 20:12:55 +0000 (20:12 +0000)]
mpage: convert __mpage_writepage() to use a folio more fully

This is just a conversion to the folio API.  While there are some nods
towards supporting multi-page folios in here, the blocks array is still
sized for one page's worth of blocks, and there are other assumptions such
as the blocks_per_page variable.

[[email protected]: fix accidentally-triggering WARN_ON_ONCE]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agofs: convert writepage_t callback to pass a folio
Matthew Wilcox (Oracle) [Thu, 26 Jan 2023 20:12:54 +0000 (20:12 +0000)]
fs: convert writepage_t callback to pass a folio

Patch series "Convert writepage_t to use a folio".

More folioisation.  I split out the mpage work from everything else
because it completely dominated the patch, but some implementations I just
converted outright.

This patch (of 2):

We always write back an entire folio, but that's currently passed as the
head page.  Convert all filesystems that use write_cache_pages() to expect
a folio instead of a page.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: add memcpy_from_file_folio()
Matthew Wilcox (Oracle) [Thu, 26 Jan 2023 20:15:52 +0000 (20:15 +0000)]
mm: add memcpy_from_file_folio()

This is the equivalent of memcpy_from_page().  It differs in that it takes
the position in a file instead of offset in a folio, it accepts the total
number of bytes to be copied (instead of the number of bytes to be copied
from this folio) and it returns how many bytes were copied from the folio,
rather than making the caller calculate that and then checking if the
caller got it right.

[[email protected]: fix typo in comment]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: "Fabio M. De Francesco" <[email protected]>
Cc: Ira Weiny <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agoblock: remove ->rw_page
Christoph Hellwig [Wed, 25 Jan 2023 13:34:36 +0000 (14:34 +0100)]
block: remove ->rw_page

The ->rw_page method is a special purpose bypass of the usual bio handling
path that is limited to single-page reads and writes and synchronous which
causes a lot of extra code in the drivers, callers and the block layer.

The only remaining user is the MM swap code.  Switch that swap code to
simply submit a single-vec on-stack bio an synchronously wait on it based
on a newly added QUEUE_FLAG_SYNCHRONOUS flag set by the drivers that
currently implement ->rw_page instead.  While this touches one extra cache
line and executes extra code, it simplifies the block layer and drivers
and ensures that all feastures are properly supported by all drivers, e.g.
right now ->rw_page bypassed cgroup writeback entirely.

[[email protected]: fix comment typo, per Dan]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: factor out a swap_writepage_bdev helper
Christoph Hellwig [Wed, 25 Jan 2023 13:34:35 +0000 (14:34 +0100)]
mm: factor out a swap_writepage_bdev helper

Split the block device case from swap_readpage into a separate helper,
following the abstraction for file based swap.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: remove the __swap_writepage return value
Christoph Hellwig [Wed, 25 Jan 2023 13:34:34 +0000 (14:34 +0100)]
mm: remove the __swap_writepage return value

__swap_writepage always returns 0.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: use an on-stack bio for synchronous swapin
Christoph Hellwig [Wed, 25 Jan 2023 13:34:33 +0000 (14:34 +0100)]
mm: use an on-stack bio for synchronous swapin

Optimize the synchronous swap in case by using an on-stack bio instead of
allocating one using bio_alloc.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: factor out a swap_readpage_bdev helper
Christoph Hellwig [Wed, 25 Jan 2023 13:34:32 +0000 (14:34 +0100)]
mm: factor out a swap_readpage_bdev helper

Split the block device case from swap_readpage into a separate helper,
following the abstraction for file based swap and frontswap.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: remove the swap_readpage return value
Christoph Hellwig [Wed, 25 Jan 2023 13:34:31 +0000 (14:34 +0100)]
mm: remove the swap_readpage return value

swap_readpage always returns 0, and no caller checks the return value.

[[email protected]: fix void-returning swap_readpage() stub, per Keith]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agompage: stop using bdev_{read,write}_page
Christoph Hellwig [Wed, 25 Jan 2023 13:34:30 +0000 (14:34 +0100)]
mpage: stop using bdev_{read,write}_page

Patch series "remove ->rw_page".

This series removes the ->rw_page block_device_operation, which is an old
and clumsy attempt at a simple read/write fast path for the block layer.
It isn't actually used by the fastest block layer operations that we
support (polled I/O through io_uring), but only used by the mpage buffered
I/O helpers which are some of the slowest I/O we have and do not make any
difference there at all, and zram which is a block device abused to
duplicate the zram functionality.

Given that zram is heavily used we need to make sure there is a good
replacement for synchronous I/O, so this series adds a new flag for
drivers that complete I/O synchronously and uses that flag to use on-stack
bios and synchronous submission for them in the swap code.

This patch (of 7):

These are micro-optimizations for synchronous I/O, which do not matter
compared to all the other inefficiencies in the legacy buffer_head based
mpage code.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: refactor va_remove_mappings
Christoph Hellwig [Sat, 21 Jan 2023 07:10:51 +0000 (08:10 +0100)]
mm: refactor va_remove_mappings

Move the VM_FLUSH_RESET_PERMS to the caller and rename the function to
better describe what it is doing.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: split __vunmap
Christoph Hellwig [Sat, 21 Jan 2023 07:10:50 +0000 (08:10 +0100)]
mm: split __vunmap

vunmap only needs to find and free the vmap_area and vm_strut, so open
code that there and merge the rest of the code into vfree.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: move debug checks from __vunmap to remove_vm_area
Christoph Hellwig [Sat, 21 Jan 2023 07:10:49 +0000 (08:10 +0100)]
mm: move debug checks from __vunmap to remove_vm_area

All these checks apply to the free_vm_area interface as well, so move them
to the common routine.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: use remove_vm_area in __vunmap
Christoph Hellwig [Sat, 21 Jan 2023 07:10:48 +0000 (08:10 +0100)]
mm: use remove_vm_area in __vunmap

Use the common helper to find and remove a vmap_area instead of open
coding it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: move __remove_vm_area out of va_remove_mappings
Christoph Hellwig [Sat, 21 Jan 2023 07:10:47 +0000 (08:10 +0100)]
mm: move __remove_vm_area out of va_remove_mappings

__remove_vm_area is the only part of va_remove_mappings that requires a
vmap_area.  Move the call out to the caller and only pass the vm_struct to
va_remove_mappings.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: call vfree instead of __vunmap from delayed_vfree_work
Christoph Hellwig [Sat, 21 Jan 2023 07:10:46 +0000 (08:10 +0100)]
mm: call vfree instead of __vunmap from delayed_vfree_work

This adds an extra, never taken, in_interrupt() branch, but will allow to
cut down the maze of vfree helpers.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: move vmalloc_init and free_work down in vmalloc.c
Christoph Hellwig [Sat, 21 Jan 2023 07:10:45 +0000 (08:10 +0100)]
mm: move vmalloc_init and free_work down in vmalloc.c

Move these two functions around a bit to avoid forward declarations.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: remove __vfree_deferred
Christoph Hellwig [Sat, 21 Jan 2023 07:10:44 +0000 (08:10 +0100)]
mm: remove __vfree_deferred

Fold __vfree_deferred into vfree_atomic, and call vfree_atomic early on
from vfree if called from interrupt context so that the extra low-level
helper can be avoided.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: remove __vfree
Christoph Hellwig [Sat, 21 Jan 2023 07:10:43 +0000 (08:10 +0100)]
mm: remove __vfree

__vfree is a subset of vfree that just skips a few checks, and which is
only used by vfree and an error cleanup path.  Fold __vfree into vfree and
switch the only other caller to call vfree() instead.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: reject vmap with VM_FLUSH_RESET_PERMS
Christoph Hellwig [Sat, 21 Jan 2023 07:10:42 +0000 (08:10 +0100)]
mm: reject vmap with VM_FLUSH_RESET_PERMS

Patch series "cleanup vfree and vunmap".

This little series untangles the vfree and vunmap code path a bit.

This patch (of 10):

VM_FLUSH_RESET_PERMS is just for use with vmalloc as it is tied to freeing
the underlying pages.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm, compaction: finish pageblocks on complete migration failure
Mel Gorman [Wed, 25 Jan 2023 13:44:34 +0000 (13:44 +0000)]
mm, compaction: finish pageblocks on complete migration failure

Commit 7efc3b726103 ("mm/compaction: fix set skip in
fast_find_migrateblock") address an issue where a pageblock selected by
fast_find_migrateblock() was ignored.  Unfortunately, the same fix
resulted in numerous reports of khugepaged or kcompactd stalling for long
periods of time or consuming 100% of CPU.

Tracing showed that there was a lot of rescanning between a small subset
of pageblocks because the conditions for marking the block skip are not
met.  The scan is not reaching the end of the pageblock because enough
pages were isolated but none were migrated successfully.  Eventually it
circles back to the same block.

Pageblock skip tracking tries to minimise both latency and excessive
scanning but tracking exactly when a block is fully scanned requires an
excessive amount of state.  This patch forcibly rescans a pageblock when
all isolated pages fail to migrate even though it could be for transient
reasons such as page writeback or page dirty.  This will sometimes migrate
too many pages but pageblocks will be marked skip and forward progress
will be made.

"Usemen" from the mmtests configuration
workload-usemem-stress-numa-compact was used to stress compaction.  The
compaction trace events were recorded using a 6.2-rc5 kernel that includes
commit 7efc3b726103 and count of unique ranges were measured.  The top 5
ranges were

   3076 range=(0x10ca00-0x10cc00)
   3076 range=(0x110a00-0x110c00)
   3098 range=(0x13b600-0x13b800)
   3104 range=(0x141c00-0x141e00)
  11424 range=(0x11b600-0x11b800)

While this workload is very different than what the bugs reported, the
pattern of the same subset of blocks being repeatedly scanned is observed.
At one point, *only* the range range=(0x11b600 ~ 0x11b800) was scanned
for 2 seconds.  14 seconds passed between the first migration-related
event and the last.

With the series applied including this patch, the top 5 ranges were

      1 range=(0x11607e-0x116200)
      1 range=(0x116200-0x116278)
      1 range=(0x116278-0x116400)
      1 range=(0x116400-0x116424)
      1 range=(0x116424-0x116600)

Only unique ranges were scanned and the time between the first
migration-related event was 0.11 milliseconds.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
Signed-off-by: Mel Gorman <[email protected]>
Cc: Chuyi Zhou <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Maxim Levitsky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Pedro Falcato <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm, compaction: finish scanning the current pageblock if requested
Mel Gorman [Wed, 25 Jan 2023 13:44:33 +0000 (13:44 +0000)]
mm, compaction: finish scanning the current pageblock if requested

cc->finish_pageblock is set when the current pageblock should be rescanned
but fast_find_migrateblock can select an alternative block.  Disable
fast_find_migrateblock when the current pageblock scan should be
completed.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mel Gorman <[email protected]>
Cc: Chuyi Zhou <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Maxim Levitsky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Pedro Falcato <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm, compaction: check if a page has been captured before draining PCP pages
Mel Gorman [Wed, 25 Jan 2023 13:44:32 +0000 (13:44 +0000)]
mm, compaction: check if a page has been captured before draining PCP pages

If a page has been captured then draining is unnecssary so check first for
a captured page.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mel Gorman <[email protected]>
Cc: Chuyi Zhou <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Maxim Levitsky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Pedro Falcato <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm, compaction: rename compact_control->rescan to finish_pageblock
Mel Gorman [Wed, 25 Jan 2023 13:44:31 +0000 (13:44 +0000)]
mm, compaction: rename compact_control->rescan to finish_pageblock

Patch series "Fix excessive CPU usage during compaction".

Commit 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
fixed a problem where pageblocks found by fast_find_migrateblock() were
ignored. Unfortunately there were numerous bug reports complaining about high
CPU usage and massive stalls once 6.1 was released. Due to the severity,
the patch was reverted by Vlastimil as a short-term fix[1] to -stable.

The underlying problem for each of the bugs is suspected to be the
repeated scanning of the same pageblocks.  This series should guarantee
forward progress even with commit 7efc3b726103.  More information is in
the changelog for patch 4.

[1] http://lore.kernel.org/r/20230113173345[email protected]

This patch (of 4):

The rescan field was not well named albeit accurate at the time.  Rename
the field to finish_pageblock to indicate that the remainder of the
pageblock should be scanned regardless of COMPACT_CLUSTER_MAX.  The intent
is that pageblocks with transient failures get marked for skipping to
avoid revisiting the same pageblock.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mel Gorman <[email protected]>
Cc: Chuyi Zhou <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Maxim Levitsky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Pedro Falcato <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/gup.c: fix typo in comments
Jongwoo Han [Wed, 25 Jan 2023 18:08:47 +0000 (03:08 +0900)]
mm/gup.c: fix typo in comments

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jongwoo Han <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agokasan: reset page tags properly with sampling
Andrey Konovalov [Tue, 24 Jan 2023 20:35:26 +0000 (21:35 +0100)]
kasan: reset page tags properly with sampling

The implementation of page_alloc poisoning sampling assumed that
tag_clear_highpage resets page tags for __GFP_ZEROTAGS allocations.
However, this is no longer the case since commit 70c248aca9e7 ("mm: kasan:
Skip unpoisoning of user pages").

This leads to kernel crashes when MTE-enabled userspace mappings are used
with Hardware Tag-Based KASAN enabled.

Reset page tags for __GFP_ZEROTAGS allocations in post_alloc_hook().

Also clarify and fix related comments.

[[email protected]: update comment]
Link: https://lkml.kernel.org/r/5dbd866714b4839069e2d8469ac45b60953db290.1674592780.git.andreyknvl@google.com
Link: https://lkml.kernel.org/r/24ea20c1b19c2b4b56cf9f5b354915f8dbccfc77.1674592496.git.andreyknvl@google.com
Fixes: 44383cef54c0 ("kasan: allow sampling page_alloc allocations for HW_TAGS")
Signed-off-by: Andrey Konovalov <[email protected]>
Reported-by: Peter Collingbourne <[email protected]>
Tested-by: Peter Collingbourne <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Marco Elver <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/sparse: fix "unused function 'pgdat_to_phys'" warning
Mike Rapoport [Sat, 21 Jan 2023 10:11:51 +0000 (12:11 +0200)]
mm/sparse: fix "unused function 'pgdat_to_phys'" warning

W=1 build with clangs complains:

mm/sparse.c:347:27: warning: unused function 'pgdat_to_phys' [-Wunused-function]
static inline phys_addr_t pgdat_to_phys(struct pglist_data *pgdat)
                             ^
1 warning generated.

pgdat_to_phys() is only used by functions defined when
CONFIG_MEMORY_HOTREMOVE=y.

Move pgdat_to_phys() under #ifdef CONFIG_MEMORY_HOTREMOVE
to make clang happy.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Reported-by: kernel test robot <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Cc: Miles Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/page_owner: record single timestamp value for high order allocations
Hyeonggon Yoo [Sat, 21 Jan 2023 16:50:54 +0000 (01:50 +0900)]
mm/page_owner: record single timestamp value for high order allocations

When allocating a high-order page, separate allocation timestamp is
recorded for each sub-page resulting in different timestamp values between
them.

This behavior is not consistent with the behavior when recording free
timestamp and caused confusion when analyzing memory dumps.  Record single
timestamp for the entire allocation, aligning with the behavior for free
timestamps.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hyeonggon Yoo <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: memory-failure: document memory failure stats
Jiaqi Yan [Fri, 20 Jan 2023 03:46:22 +0000 (03:46 +0000)]
mm: memory-failure: document memory failure stats

Add documentation for memory_failure's per NUMA node sysfs entries

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jiaqi Yan <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: memory-failure: bump memory failure stats to pglist_data
Jiaqi Yan [Fri, 20 Jan 2023 03:46:21 +0000 (03:46 +0000)]
mm: memory-failure: bump memory failure stats to pglist_data

Right before memory_failure finishes its handling, accumulate poisoned
page's resolution counters to pglist_data's memory_failure_stats, so as to
update the corresponding sysfs entries.

Tested:
1) Start an application to allocate memory buffer chunks
2) Convert random memory buffer addresses to physical addresses
3) Inject memory errors using EINJ at chosen physical addresses
4) Access poisoned memory buffer and recover from SIGBUS
5) Check counter values under
   /sys/devices/system/node/node*/memory_failure/*

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jiaqi Yan <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: memory-failure: add memory failure stats to sysfs
Jiaqi Yan [Fri, 20 Jan 2023 03:46:20 +0000 (03:46 +0000)]
mm: memory-failure: add memory failure stats to sysfs

Patch series "Introduce per NUMA node memory error statistics", v2.

Background
==========

In the RFC for Kernel Support of Memory Error Detection [1], one advantage
of software-based scanning over hardware patrol scrubber is the ability to
make statistics visible to system administrators.  The statistics include
2 categories:

* Memory error statistics, for example, how many memory error are
  encountered, how many of them are recovered by the kernel.  Note these
  memory errors are non-fatal to kernel: during the machine check
  exception (MCE) handling kernel already classified MCE's severity to be
  unnecessary to panic (but either action required or optional).

* Scanner statistics, for example how many times the scanner have fully
  scanned a NUMA node, how many errors are first detected by the scanner.

The memory error statistics are useful to userspace and actually not
specific to scanner detected memory errors, and are the focus of this
patchset.

Motivation
==========

Memory error stats are important to userspace but insufficient in kernel
today.  Datacenter administrators can better monitor a machine's memory
health with the visible stats.  For example, while memory errors are
inevitable on servers with 10+ TB memory, starting server maintenance when
there are only 1~2 recovered memory errors could be overreacting; in cloud
production environment maintenance usually means live migrate all the
workload running on the server and this usually causes nontrivial
disruption to the customer.  Providing insight into the scope of memory
errors on a system helps to determine the appropriate follow-up action.
In addition, the kernel's existing memory error stats need to be
standardized so that userspace can reliably count on their usefulness.

Today kernel provides following memory error info to userspace, but they
are not sufficient or have disadvantages:
* HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
  not per NUMA node stats though
* ras:memory_failure_event: only available after explicitly enabled
* /dev/mcelog provides many useful info about the MCEs, but doesn't
  capture how memory_failure recovered memory MCEs
* kernel logs: userspace needs to process log text

Exposing memory error stats is also a good start for the in-kernel memory
error detector.  Today the data source of memory error stats are either
direct memory error consumption, or hardware patrol scrubber detection
(either signaled as UCNA or SRAO).  Once in-kernel memory scanner is
implemented, it will be the main source as it is usually configured to
scan memory DIMMs constantly and faster than hardware patrol scrubber.

How Implemented
===============

As Naoya pointed out [2], exposing memory error statistics to userspace is
useful independent of software or hardware scanner.  Therefore we
implement the memory error statistics independent of the in-kernel memory
error detector.  It exposes the following per NUMA node memory error
counters:

  /sys/devices/system/node/node${X}/memory_failure/total
  /sys/devices/system/node/node${X}/memory_failure/recovered
  /sys/devices/system/node/node${X}/memory_failure/ignored
  /sys/devices/system/node/node${X}/memory_failure/failed
  /sys/devices/system/node/node${X}/memory_failure/delayed

These counters describe how many raw pages are poisoned and after the
attempted recoveries by the kernel, their resolutions: how many are
recovered, ignored, failed, or delayed respectively.  This approach can be
easier to extend for future use cases than /proc/meminfo, trace event, and
log.  The following math holds for the statistics:

* total = recovered + ignored + failed + delayed

These memory error stats are reset during machine boot.

The 1st commit introduces these sysfs entries.  The 2nd commit populates
memory error stats every time memory_failure attempts memory error
recovery.  The 3rd commit adds documentations for introduced stats.

[1] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#mc22959244f5388891c523882e61163c6e4d703af
[2] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6

This patch (of 3):

Today kernel provides following memory error info to userspace, but each
has its own disadvantage

* HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
  not per NUMA node stats though

* ras:memory_failure_event: only available after explicitly enabled

* /dev/mcelog provides many useful info about the MCEs, but
  doesn't capture how memory_failure recovered memory MCEs

* kernel logs: userspace needs to process log text

Exposes per NUMA node memory error stats as sysfs entries:

  /sys/devices/system/node/node${X}/memory_failure/total
  /sys/devices/system/node/node${X}/memory_failure/recovered
  /sys/devices/system/node/node${X}/memory_failure/ignored
  /sys/devices/system/node/node${X}/memory_failure/failed
  /sys/devices/system/node/node${X}/memory_failure/delayed

These counters describe how many raw pages are poisoned and after the
attempted recoveries by the kernel, their resolutions: how many are
recovered, ignored, failed, or delayed respectively.  The following math
holds for the statistics:

* total = recovered + ignored + failed + delayed

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jiaqi Yan <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: multi-gen LRU: simplify lru_gen_look_around()
T.J. Alumbaugh [Wed, 18 Jan 2023 00:18:27 +0000 (00:18 +0000)]
mm: multi-gen LRU: simplify lru_gen_look_around()

Update the folio generation in place with or without
current->reclaim_state->mm_walk.  The LRU lock is held for longer, if
mm_walk is NULL and the number of folios to update is more than
PAGEVEC_SIZE.

This causes a measurable regression from the LRU lock contention during a
microbencmark.  But a tiny regression is not worth the complexity.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: T.J. Alumbaugh <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: multi-gen LRU: improve walk_pmd_range()
T.J. Alumbaugh [Wed, 18 Jan 2023 00:18:26 +0000 (00:18 +0000)]
mm: multi-gen LRU: improve walk_pmd_range()

Improve readability of walk_pmd_range() and walk_pmd_range_locked().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: T.J. Alumbaugh <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: multi-gen LRU: improve lru_gen_exit_memcg()
T.J. Alumbaugh [Wed, 18 Jan 2023 00:18:25 +0000 (00:18 +0000)]
mm: multi-gen LRU: improve lru_gen_exit_memcg()

Add warnings and poison ->next.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: T.J. Alumbaugh <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: multi-gen LRU: section for memcg LRU
T.J. Alumbaugh [Wed, 18 Jan 2023 00:18:24 +0000 (00:18 +0000)]
mm: multi-gen LRU: section for memcg LRU

Move memcg LRU code into a dedicated section.  Improve the design doc to
outline its architecture.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: T.J. Alumbaugh <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: multi-gen LRU: section for Bloom filters
T.J. Alumbaugh [Wed, 18 Jan 2023 00:18:23 +0000 (00:18 +0000)]
mm: multi-gen LRU: section for Bloom filters

Move Bloom filters code into a dedicated section.  Improve the design doc
to explain Bloom filter usage and connection between aging and eviction in
their use.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: T.J. Alumbaugh <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: multi-gen LRU: section for rmap/PT walk feedback
T.J. Alumbaugh [Wed, 18 Jan 2023 00:18:22 +0000 (00:18 +0000)]
mm: multi-gen LRU: section for rmap/PT walk feedback

Add a section for lru_gen_look_around() in the code and the design doc.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: T.J. Alumbaugh <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: multi-gen LRU: section for working set protection
T.J. Alumbaugh [Wed, 18 Jan 2023 00:18:21 +0000 (00:18 +0000)]
mm: multi-gen LRU: section for working set protection

Patch series "mm: multi-gen LRU: improve".

This patch series improves a few MGLRU functions, collects related
functions, and adds additional documentation.

This patch (of 7):

Add a section for working set protection in the code and the design doc.
The admin doc already contains its usage.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: T.J. Alumbaugh <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: move KMEMLEAK's Kconfig items from lib to mm
Zhaoyang Huang [Thu, 19 Jan 2023 01:22:24 +0000 (09:22 +0800)]
mm: move KMEMLEAK's Kconfig items from lib to mm

Have the kmemleak's source code and Kconfig items be in the same directory.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zhaoyang Huang <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: ke.wang <[email protected]>
Cc: Mirsad Goran Todorovac <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/damon/core-test: add a test for damon_update_monitoring_results()
SeongJae Park [Thu, 19 Jan 2023 01:38:31 +0000 (01:38 +0000)]
mm/damon/core-test: add a test for damon_update_monitoring_results()

Add a simple unit test for damon_update_monitoring_results() function.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Brendan Higgins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/damon/core: update monitoring results for new monitoring attributes
SeongJae Park [Thu, 19 Jan 2023 01:38:30 +0000 (01:38 +0000)]
mm/damon/core: update monitoring results for new monitoring attributes

region->nr_accesses is the number of sampling intervals in the last
aggregation interval that access to the region has found, and region->age
is the number of aggregation intervals that its access pattern has
maintained.  Hence, the real meaning of the two fields' values is
depending on current sampling and aggregation intervals.

This means the values need to be updated for every sampling and/or
aggregation intervals updates.  As DAMON core doesn't, it is a duty of
in-kernel DAMON framework applications like DAMON sysfs interface, or the
userspace users.

Handling it in userspace or in-kernel DAMON application is complicated,
inefficient, and repetitive compared to doing the update in DAMON core.
Do the update in DAMON core.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Brendan Higgins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/damon: update comments in damon.h for damon_attrs
SeongJae Park [Thu, 19 Jan 2023 01:38:29 +0000 (01:38 +0000)]
mm/damon: update comments in damon.h for damon_attrs

Patch series "mm/damon: misc fixes".

This patchset contains three miscellaneous simple fixes for DAMON online
tuning.

This patch (of 3):

Commit cbeaa77b0449 ("mm/damon/core: use a dedicated struct for monitoring
attributes") moved monitoring intervals from damon_ctx to a new struct,
damon_attrs, but a comment in the header file has not updated for the
change.  Update it.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: cbeaa77b0449 ("mm/damon/core: use a dedicated struct for monitoring attributes")
Signed-off-by: SeongJae Park <[email protected]>
Cc: Brendan Higgins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/kmemleak: fix UAF bug in kmemleak_scan()
Waiman Long [Thu, 19 Jan 2023 04:01:11 +0000 (23:01 -0500)]
mm/kmemleak: fix UAF bug in kmemleak_scan()

Commit 6edda04ccc7c ("mm/kmemleak: prevent soft lockup in first object
iteration loop of kmemleak_scan()") fixes soft lockup problem in
kmemleak_scan() by periodically doing a cond_resched().  It does take a
reference of the current object before doing it.  Unfortunately, if the
object has been deleted from the object_list, the next object pointed to
by its next pointer may no longer be valid after coming back from
cond_resched().  This can result in use-after-free and other nasty
problem.

Fix this problem by adding a del_state flag into kmemleak_object structure
to synchronize the object deletion process between kmemleak_cond_resched()
and __remove_object() to make sure that the object remained in the
object_list in the duration of the cond_resched() call.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 6edda04ccc7c ("mm/kmemleak: prevent soft lockup in first object iteration loop of kmemleak_scan()")
Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/kmemleak: simplify kmemleak_cond_resched() usage
Waiman Long [Thu, 19 Jan 2023 04:01:10 +0000 (23:01 -0500)]
mm/kmemleak: simplify kmemleak_cond_resched() usage

Patch series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF", v2.

It was found that a KASAN use-after-free error was reported in the
kmemleak_scan() function.  After further examination, it is believe that
even though a reference is taken from the current object, it does not
prevent the object pointed to by the next pointer from going away after a
cond_resched().

To fix that, additional flags are added to make sure that the current
object won't be removed from the object_list during the duration of the
cond_resched() to ensure the validity of the next pointer.

While making the change, I also simplify the current usage of
kmemleak_cond_resched() to make it easier to understand.

This patch (of 2):

The presence of a pinned argument and the 64k loop count make
kmemleak_cond_resched() a bit more complex to read.  The pinned argument
is used only by first kmemleak_scan() loop.

Simplify the usage of kmemleak_cond_resched() by removing the pinned
argument and always do a get_object()/put_object() sequence.  In addition,
the 64k loop is removed by using need_resched() to decide if
kmemleak_cond_resched() should be called.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agokselftest: vm: add tests for memory-deny-write-execute
Kees Cook [Thu, 19 Jan 2023 16:03:44 +0000 (16:03 +0000)]
kselftest: vm: add tests for memory-deny-write-execute

Add some tests to cover the new PR_SET_MDWE prctl.

Link: https://lkml.kernel.org/r/[email protected]
Co-developed-by: Joey Gouly <[email protected]>
Signed-off-by: Joey Gouly <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Jeremy Linton <[email protected]>
Cc: Lennart Poettering <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: nd <[email protected]>
Cc: Szabolcs Nagy <[email protected]>
Cc: Topi Miettinen <[email protected]>
Cc: Zbigniew JÄ™drzejewski-Szmek <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm: implement memory-deny-write-execute as a prctl
Joey Gouly [Thu, 19 Jan 2023 16:03:43 +0000 (16:03 +0000)]
mm: implement memory-deny-write-execute as a prctl

Patch series "mm: In-kernel support for memory-deny-write-execute (MDWE)",
v2.

The background to this is that systemd has a configuration option called
MemoryDenyWriteExecute [2], implemented as a SECCOMP BPF filter.  Its aim
is to prevent a user task from inadvertently creating an executable
mapping that is (or was) writeable.  Since such BPF filter is stateless,
it cannot detect mappings that were previously writeable but subsequently
changed to read-only.  Therefore the filter simply rejects any
mprotect(PROT_EXEC).  The side-effect is that on arm64 with BTI support
(Branch Target Identification), the dynamic loader cannot change an ELF
section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().  For
libraries, it can resort to unmapping and re-mapping but for the main
executable it does not have a file descriptor.  The original bug report in
the Red Hat bugzilla - [3] - and subsequent glibc workaround for libraries
- [4].

This series adds in-kernel support for this feature as a prctl
PR_SET_MDWE, that is inherited on fork().  The prctl denies PROT_WRITE |
PROT_EXEC mappings.  Like the systemd BPF filter it also denies adding
PROT_EXEC to mappings.  However unlike the BPF filter it only denies it if
the mapping didn't previous have PROT_EXEC.  This allows to PROT_EXEC ->
PROT_EXEC | PROT_BTI with mprotect(), which is a problem with the BPF
filter.

This patch (of 2):

The aim of such policy is to prevent a user task from creating an
executable mapping that is also writeable.

An example of mmap() returning -EACCESS if the policy is enabled:

mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);

Similarly, mprotect() would return -EACCESS below:

addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);

The BPF filter that systemd MDWE uses is stateless, and disallows
mprotect() with PROT_EXEC completely. This new prctl allows PROT_EXEC to
be enabled if it was already PROT_EXEC, which allows the following case:

addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);

where PROT_BTI enables branch tracking identification on arm64.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Joey Gouly <[email protected]>
Co-developed-by: Catalin Marinas <[email protected]>
Signed-off-by: Catalin Marinas <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Jeremy Linton <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Lennart Poettering <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: nd <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Szabolcs Nagy <[email protected]>
Cc: Topi Miettinen <[email protected]>
Cc: Zbigniew JÄ™drzejewski-Szmek <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agotools/mm: allow users to provide additional cflags/ldflags
Herton R. Krzesinski [Mon, 16 Jan 2023 22:49:21 +0000 (19:49 -0300)]
tools/mm: allow users to provide additional cflags/ldflags

Right now there is no way to provide additional cflags/ldflags when
building tools/vm binaries.  And using eg.  make CFLAGS=<options> will
override the CFLAGS being set in the Makefile, making the build fail since
it requires the include of the ../lib dir (for libapi).

This change then allows you to specify:
CFLAGS=<options> LDFLAGS=<options> make V=1 -C tools/vm

And the options will be correctly appended as can be seen from the
make output.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Herton R. Krzesinski <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: Justin Forbes <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Scott Weaver <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agoDocumentation: mm: use `s/higmem/highmem/` fix typo for highmem
Deming Wang [Wed, 18 Jan 2023 02:54:03 +0000 (21:54 -0500)]
Documentation: mm: use `s/higmem/highmem/` fix typo for highmem

We should use highmem replace higmem.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Deming Wang <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Cc: "Fabio M. De Francesco" <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
2 years agomm/cma: fix potential memory loss on cma_declare_contiguous_nid
Levi Yun [Wed, 18 Jan 2023 08:05:23 +0000 (17:05 +0900)]
mm/cma: fix potential memory loss on cma_declare_contiguous_nid

Suppose memblock_alloc_range_nid() with highmem_start succeeds when
cma_declare_contiguous_nid is called with !fixed on a 32-bit system with
PHYS_ADDR_T_64BIT enabled with memblock.bottom_up == false.

But the next trial to memblock_alloc_range_nid() to allocate in [SIZE_4G,
limits) nullifies former successfully allocated addr and it retries
memblock_alloc_ragne_nid().

In this situation, the first successfully allocated address area is lost.

Change the order of allocation (SIZE_4G, high_memory and base) and check
whether the allocated succeeded to prevent potential memory loss.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Levi Yun <[email protected]>
Cc: Laurent Pinchart <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
This page took 0.125019 seconds and 4 git commands to generate.