Git Repo - linux.git/log

NTB: Fix an error code in ntb_msit_probe()

When the value of nm->isr_ctx is false, the value of ret is 0.
So, we set ret to -ENOMEM to indicate this error.

Clean up smatch warning:
drivers/ntb/test/ntb_msi_test.c:373 ntb_msit_probe() warn: missing
error code 'ret'.

Reported-by: Abaci Robot <[email protected]>
Signed-off-by: Yang Li <[email protected]>
Reviewed-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Jon Mason <[email protected]>

ntb: intel: remove invalid email address in header comment

Remove Jon's old email address.

Signed-off-by: Dave Jiang <[email protected]>
Signed-off-by: Jon Mason <[email protected]>

Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux

Pull MAP_DENYWRITE removal from David Hildenbrand:
"Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove
  VM_DENYWRITE.

  There are some (minor) user-visible changes:

   - We no longer deny write access to shared libaries loaded via legacy
     uselib(); this behavior matches modern user space e.g. dlopen().

   - We no longer deny write access to the elf interpreter after exec
     completed, treating it just like shared libraries (which it often
     is).

   - We always deny write access to the file linked via /proc/pid/exe:
     sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the
     file cannot be denied, and write access to the file will remain
     denied until the link is effectivel gone (exec, termination,
     sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.

  Cross-compiled for a bunch of architectures (alpha, microblaze, i386,
  s390x, ...) and verified via ltp that especially the relevant tests
  (i.e., creat07 and execve04) continue working as expected"

* tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux:
  fs: update documentation of get_write_access() and friends
  mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
  mm: remove VM_DENYWRITE
  binfmt: remove in-tree usage of MAP_DENYWRITE
  kernel/fork: always deny write access to current MM exe_file
  kernel/fork: factor out replacing the current MM exe_file
  binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()

Merge git://github.com/Paragon-Software-Group/linux-ntfs3

Merge NTFSv3 filesystem from Konstantin Komarov:
"This patch adds NTFS Read-Write driver to fs/ntfs3.

  Having decades of expertise in commercial file systems development and
  huge test coverage, we at Paragon Software GmbH want to make our
  contribution to the Open Source Community by providing implementation
  of NTFS Read-Write driver for the Linux Kernel.

  This is fully functional NTFS Read-Write driver. Current version works
  with NTFS (including v3.1) and normal/compressed/sparse files and
  supports journal replaying.

  We plan to support this version after the codebase once merged, and
  add new features and fix bugs. For example, full journaling support
  over JBD will be added in later updates"

Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
* git://github.com/Paragon-Software-Group/linux-ntfs3: (35 commits)
  fs/ntfs3: Change how module init/info messages are displayed
  fs/ntfs3: Remove GPL boilerplates from decompress lib files
  fs/ntfs3: Remove unnecessary condition checking from ntfs_file_read_iter
  fs/ntfs3: Fix integer overflow in ni_fiemap with fiemap_prep()
  fs/ntfs3: Restyle comments to better align with kernel-doc
  fs/ntfs3: Rework file operations
  fs/ntfs3: Remove fat ioctl's from ntfs3 driver for now
  fs/ntfs3: Restyle comments to better align with kernel-doc
  fs/ntfs3: Fix error handling in indx_insert_into_root()
  fs/ntfs3: Potential NULL dereference in hdr_find_split()
  fs/ntfs3: Fix error code in indx_add_allocate()
  fs/ntfs3: fix an error code in ntfs_get_acl_ex()
  fs/ntfs3: add checks for allocation failure
  fs/ntfs3: Use kcalloc/kmalloc_array over kzalloc/kmalloc
  fs/ntfs3: Do not use driver own alloc wrappers
  fs/ntfs3: Use kernel ALIGN macros over driver specific
  fs/ntfs3: Restyle comment block in ni_parse_reparse()
  fs/ntfs3: Remove unused including <linux/version.h>
  fs/ntfs3: Fix fall-through warnings for Clang
  fs/ntfs3: Fix one none utf8 char in source file
  ...

Merge tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
"In this cycle, we've addressed some performance issues such as lock
  contention, misbehaving compress_cache, allowing extent_cache for
  compressed files, and new sysfs to adjust ra_size for fadvise.

  In order to diagnose the performance issues quickly, we also added an
  iostat which shows the IO latencies periodically.

  On the stability side, we've found two memory leakage cases in the
  error path in compression flow. And, we've also fixed various corner
  cases in fiemap, quota, checkpoint=disable, zstd, and so on.

  Enhancements:
   - avoid long checkpoint latency by releasing nat_tree_lock
   - collect and show iostats periodically
   - support extent_cache for compressed files
   - add a sysfs entry to manage ra_size given fadvise(POSIX_FADV_SEQUENTIAL)
   - report f2fs GC status via sysfs
   - add discard_unit=%s in mount option to handle zoned device

  Bug fixes:
   - fix two memory leakages when an error happens in the compressed IO flow
   - fix commpress_cache to get the right LBA
   - fix fiemap to deal with compressed case correctly
   - fix wrong EIO returns due to SBI_NEED_FSCK
   - fix missing writes when enabling checkpoint back
   - fix quota deadlock
   - fix zstd level mount option

  In addition to the above major updates, we've cleaned up several code
  paths such as dio, unnecessary operations, debugfs/f2fs/status, sanity
  check, and typos"

* tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (46 commits)
  f2fs: should put a page beyond EOF when preparing a write
  f2fs: deallocate compressed pages when error happens
  f2fs: enable realtime discard iff device supports discard
  f2fs: guarantee to write dirty data when enabling checkpoint back
  f2fs: fix to unmap pages from userspace process in punch_hole()
  f2fs: fix unexpected ENOENT comes from f2fs_map_blocks()
  f2fs: fix to account missing .skipped_gc_rwsem
  f2fs: adjust unlock order for cleanup
  f2fs: Don't create discard thread when device doesn't support realtime discard
  f2fs: rebuild nat_bits during umount
  f2fs: introduce periodic iostat io latency traces
  f2fs: separate out iostat feature
  f2fs: compress: do sanity check on cluster
  f2fs: fix description about main_blkaddr node
  f2fs: convert S_IRUGO to 0444
  f2fs: fix to keep compatibility of fault injection interface
  f2fs: support fault injection for f2fs_kmem_cache_alloc()
  f2fs: compress: allow write compress released file after truncate to zero
  f2fs: correct comment in segment.h
  f2fs: improve sbi status info in debugfs/f2fs/status
  ...

cdrom: update uniform CD-ROM maintainership in MAINTAINERS file

Update maintainership for the uniform CD-ROM driver from Jens Axboe to
Phillip Potter in MAINTAINERS file, to reflect the attempt to pass on
maintainership of this driver to a different individual. Also remove
URL to site which is no longer active.

Suggested-by: Jens Axboe <[email protected]>
Signed-off-by: Phillip Potter <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

Merge tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
"New Features:
   - Better client responsiveness when server isn't replying
   - Use refcount_t in sunrpc rpc_client refcount tracking
   - Add srcaddr and dst_port to the sunrpc sysfs info files
   - Add basic support for connection sharing between servers with multiple NICs`

  Bugfixes and Cleanups:
   - Sunrpc tracepoint cleanups
   - Disconnect after ib_post_send() errors to avoid deadlocks
   - Fix for tearing down rpcrdma_reps
   - Fix a potential pNFS layoutget livelock loop
   - pNFS layout barrier fixes
   - Fix a potential memory corruption in rpc_wake_up_queued_task_set_status()
   - Fix reconnection locking
   - Fix return value of get_srcport()
   - Remove rpcrdma_post_sends()
   - Remove pNFS dead code
   - Remove copy size restriction for inter-server copies
   - Overhaul the NFS callback service
   - Clean up sunrpc TCP socket shutdowns
   - Always provide aligned buffers to RPC read layers"

* tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (39 commits)
  NFS: Always provide aligned buffers to the RPC read layers
  NFSv4.1 add network transport when session trunking is detected
  SUNRPC enforce creation of no more than max_connect xprts
  NFSv4 introduce max_connect mount options
  SUNRPC add xps_nunique_destaddr_xprts to xprt_switch_info in sysfs
  SUNRPC keep track of number of transports to unique addresses
  NFSv3: Delete duplicate judgement in nfs3_async_handle_jukebox
  SUNRPC: Tweak TCP socket shutdown in the RPC client
  SUNRPC: Simplify socket shutdown when not reusing TCP ports
  NFSv4.2: remove restriction of copy size for inter-server copy.
  NFS: Clean up the synopsis of callback process_op()
  NFS: Extract the xdr_init_encode/decode() calls from decode_compound
  NFS: Remove unused callback void decoder
  NFS: Add a private local dispatcher for NFSv4 callback operations
  SUNRPC: Eliminate the RQ_AUTHERR flag
  SUNRPC: Set rq_auth_stat in the pg_authenticate() callout
  SUNRPC: Add svc_rqst::rq_auth_stat
  SUNRPC: Add dst_port to the sysfs xprt info file
  SUNRPC: Add srcaddr as a file in sysfs
  sunrpc: Fix return value of get_srcport()
  ...

octeontx2-af: Fix some memory leaks in the error handling path of 'cgx_lmac_init()'

Memory allocated before 'lmac' is stored in 'cgx->lmac_idmap[]' must be
freed explicitly. Otherwise, in case of error, it will leak.

Rename the 'err_irq' label to better describe what is done at this place in
the error handling path.

Fixes: 6f14078e3ee5 ("octeontx2-af: DMAC filter support in MAC block")
Signed-off-by: Christophe JAILLET <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-af: Add a 'rvu_free_bitmap()' function

In order to match 'rvu_alloc_bitmap()', add a 'rvu_free_bitmap()' function

Signed-off-by: Christophe JAILLET <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: Fix an error code in cxgb2.c

When adapter->registered_device_map is NULL, the value of err is
uncertain, we set err to -EINVAL to avoid ambiguity.

Clean up smatch warning:
drivers/net/ethernet/chelsio/cxgb/cxgb2.c:1114 init_one() warn: missing
error code 'err'

Reported-by: Abaci Robot <[email protected]>
Signed-off-by: Yang Li <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qlcnic: Remove redundant unlock in qlcnic_pinit_from_rom

Previous commit 68233c583ab4 removes the qlcnic_rom_lock()
in qlcnic_pinit_from_rom(), but remains its corresponding
unlock function, which is odd. I'm not very sure whether the
lock is missing, or the unlock is redundant. This bug is
suggested by a static analysis tool, please advise.

Fixes: 68233c583ab4 ("qlcnic: updated reset sequence")
Signed-off-by: Dinghao Liu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

fq_codel: reject silly quantum parameters

syzbot found that forcing a big quantum attribute would crash hosts fast,
essentially using this:

tc qd replace dev eth0 root fq_codel quantum 4294967295

This is because fq_codel_dequeue() would have to loop
~2^31 times in :

if (flow->deficit <= 0) {
flow->deficit += q->quantum;
list_move_tail(&flow->flowchain, &q->old_flows);
goto begin;
}

SFQ max quantum is 2^19 (half a megabyte)
Lets adopt a max quantum of one megabyte for FQ_CODEL.

Fixes: 4b549a2ef4be ("fq_codel: Fair Queue Codel AQM")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mm, slub: convert kmem_cpu_slab protection to local_lock

Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions of
local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT that's
equivalent, with better lockdep visibility. On PREEMPT_RT that means better
preemption.

However, the cost on PREEMPT_RT is the loss of lockless fast paths which only
work with cpu freelist. Those are designed to detect and recover from being
preempted by other conflicting operations (both fast or slow path), but the
slow path operations assume they cannot be preempted by a fast path operation,
which is guaranteed naturally with disabled irqs. With local locks on
PREEMPT_RT, the fast paths now also need to take the local lock to avoid races.

In the allocation fastpath slab_alloc_node() we can just defer to the slowpath
__slab_alloc() which also works with cpu freelist, but under the local lock.
In the free fastpath do_slab_free() we have to add a new local lock protected
version of freeing to the cpu freelist, as the existing slowpath only works
with the page freelist.

Also update the comment about locking scheme in SLUB to reflect changes done
by this series.

[ Mike Galbraith <[email protected]>: use local_lock() without irq in PREEMPT_RT
scope; debugging of RT crashes resulting in put_cpu_partial() locking changes ]
Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: use migrate_disable() on PREEMPT_RT

We currently use preempt_disable() (directly or via get_cpu_ptr()) to stabilize
the pointer to kmem_cache_cpu. On PREEMPT_RT this would be incompatible with
the list_lock spinlock. We can use migrate_disable() instead, but that
increases overhead on !PREEMPT_RT as it's an unconditional function call.

In order to get the best available mechanism on both PREEMPT_RT and
!PREEMPT_RT, introduce private slub_get_cpu_ptr() and slub_put_cpu_ptr()
wrappers and use them.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg

Jann Horn reported [1] the following theoretically possible race:

  task A: put_cpu_partial() calls preempt_disable()
  task A: oldpage = this_cpu_read(s->cpu_slab->partial)
  interrupt: kfree() reaches unfreeze_partials() and discards the page
  task B (on another CPU): reallocates page as page cache
  task A: reads page->pages and page->pobjects, which are actually
  halves of the pointer page->lru.prev
  task B (on another CPU): frees page
  interrupt: allocates page as SLUB page and places it on the percpu partial list
  task A: this_cpu_cmpxchg() succeeds

  which would cause page->pages and page->pobjects to end up containing
  halves of pointers that would then influence when put_cpu_partial()
  happens and show up in root-only sysfs files. Maybe that's acceptable,
  I don't know. But there should probably at least be a comment for now
  to point out that we're reading union fields of a page that might be
  in a completely different state.

Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only safe
against s->cpu_slab->partial manipulation in ___slab_alloc() if the latter
disables irqs, otherwise a __slab_free() in an irq handler could call
put_cpu_partial() in the middle of ___slab_alloc() manipulating ->partial
and corrupt it. This becomes an issue on RT after a local_lock is introduced
in later patch. The fix means taking the local_lock also in put_cpu_partial()
on RT.

After debugging this issue, Mike Galbraith suggested [2] that to avoid
different locking schemes on RT and !RT, we can just protect put_cpu_partial()
with disabled irqs (to be converted to local_lock_irqsave() later) everywhere.
This should be acceptable as it's not a fast path, and moving the actual
partial unfreezing outside of the irq disabled section makes it short, and with
the retry loop gone the code can be also simplified. In addition, the race
reported by Jann should no longer be possible.

[1] https://lore.kernel.org/lkml/CAG48ez1mvUuXwg0YPH5ANzhQLpbphqk-ZS+jbRz+H66fvm4FcA@mail.gmail.com/
[2] https://lore.kernel.org/linux-rt-users/e3470ab357b48bccfbd1f5133b982178a7d2befb [email protected]/

Reported-by: Jann Horn <[email protected]>
Suggested-by: Mike Galbraith <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: make slab_lock() disable irqs with PREEMPT_RT

We need to disable irqs around slab_lock() (a bit spinlock) to make it
irq-safe. Most calls to slab_lock() are nested under spin_lock_irqsave() which
doesn't disable irqs on PREEMPT_RT, so add explicit disabling with PREEMPT_RT.
The exception is cmpxchg_double_slab() which already disables irqs, so use a
__slab_[un]lock() variant without irq disable there.

slab_[un]lock() thus needs a flags pointer parameter, which is unused on !RT.
free_debug_processing() now has two flags variables, which looks odd, but only
one is actually used - the one used in spin_lock_irqsave() on !RT and the one
used in slab_lock() on RT.

As a result, __cmpxchg_double_slab() and cmpxchg_double_slab() become
effectively identical on RT, as both will disable irqs, which is necessary on
RT as most callers of this function also rely on irqsaving lock operations.
Thus, assert that irqs are already disabled in __cmpxchg_double_slab() only on
!RT and also change the VM_BUG_ON assertion to the more standard lockdep_assert
one.

Signed-off-by: Vlastimil Babka <[email protected]>

mm: slub: make object_map_lock a raw_spinlock_t

The variable object_map is protected by object_map_lock. The lock is always
acquired in debug code and within already atomic context

Make object_map_lock a raw_spinlock_t.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>

loop: reduce the loop_ctl_mutex scope

syzbot is reporting circular locking problem at __loop_clr_fd() [1], for
commit a160c6159d4a0cf8 ("block: add an optional probe callback to
major_names") is calling the module's probe function with major_names_lock
held.

Fortunately, since commit 990e78116d38059c ("block: loop: fix deadlock
between open and remove") stopped holding loop_ctl_mutex in lo_open(),
current role of loop_ctl_mutex is to serialize access to loop_index_idr
and loop_add()/loop_remove(); in other words, management of id for IDR.
To avoid holding loop_ctl_mutex during whole add/remove operation, use
a bool flag to indicate whether the loop device is ready for use.

loop_unregister_transfer() which is called from cleanup_cryptoloop()
currently has possibility of use-after-free problem due to lack of
serialization between kfree() from loop_remove() from loop_control_remove()
and mutex_lock() from unregister_transfer_cb(). But since lo->lo_encryption
should be already NULL when this function is called due to module unload,
and commit 222013f9ac30b9ce ("cryptoloop: add a deprecation warning")
indicates that we will remove this function shortly, this patch updates
this function to emit warning instead of checking lo->lo_encryption.

Holding loop_ctl_mutex in loop_exit() is pointless, for all users must
close /dev/loop-control and /dev/loop$num (in order to drop module's
refcount to 0) before loop_exit() starts, and nobody can open
/dev/loop-control or /dev/loop$num afterwards.

Link: https://syzkaller.appspot.com/bug?id=7bb10e8b62f83e4d445cdf4c13d69e407e629558
Reported-by: syzbot <[email protected]>
Signed-off-by: Tetsuo Handa <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Protect nft_ct template with global mutex, from Pavel Skripkin.

2) Two recent commits switched inet rt and nexthop exception hashes
   from jhash to siphash. If those two spots are problematic then
   conntrack is affected as well, so switch voer to siphash too.
   While at it, add a hard upper limit on chain lengths and reject
   insertion if this is hit. Patches from Florian Westphal.

3) Fix use-after-scope in nf_socket_ipv6 reported by KASAN,
   from Benjamin Hesmans.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
  netfilter: socket: icmp6: fix use-after-scope
  netfilter: refuse insertion if chain has grown too large
  netfilter: conntrack: switch to siphash
  netfilter: conntrack: sanitize table size default settings
  netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ionic: fix a sleeping in atomic bug

This code is holding spin_lock_bh(&lif->rx_filters.lock); so the
allocation needs to be atomic.

Fixes: 969f84394604 ("ionic: sync the filters in the work task")
Signed-off-by: Dan Carpenter <[email protected]>
Signed-off-by: Shannon Nelson <[email protected]>
Link: https://lore.kernel.org/r/20210903131856.GA25934@kili
Signed-off-by: Jakub Kicinski <[email protected]>

mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context

flush_all() flushes a specific SLAB cache on each CPU (where the cache
is present). The deactivate_slab()/__free_slab() invocation happens
within IPI handler and is problematic for PREEMPT_RT.

The flush operation is not a frequent operation or a hot path. The
per-CPU flush operation can be moved to within a workqueue.

Because a workqueue handler, unlike IPI handler, does not disable irqs,
flush_slab() now has to disable them for working with the kmem_cache_cpu
fields. deactivate_slab() is safe to call with irqs enabled.

[[email protected]: adapt to new SLUB changes]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>

mm, slab: split out the cpu offline variant of flush_slab()

flush_slab() is called either as part IPI handler on given live cpu, or as a
cleanup on behalf of another cpu that went offline. The first case needs to
protect updating the kmem_cache_cpu fields with disabled irqs. Currently the
whole call happens with irqs disabled by the IPI handler, but the following
patch will change from IPI to workqueue, and flush_slab() will have to disable
irqs (to be replaced with a local lock later) in the critical part.

To prepare for this change, replace the call to flush_slab() for the dead cpu
handling with an opencoded variant that will not disable irqs nor take a local
lock.

Suggested-by: Mike Galbraith <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: don't disable irqs in slub_cpu_dead()

slub_cpu_dead() cleans up for an offlined cpu from another cpu and calls only
functions that are now irq safe, so we don't need to disable irqs anymore.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: only disable irq with spin_lock in __unfreeze_partials()

__unfreeze_partials() no longer needs to have irqs disabled, except for making
the spin_lock operations irq-safe, so convert the spin_locks operations and
remove the separate irq handling.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing

Unfreezing partial list can be split to two phases - detaching the list from
struct kmem_cache_cpu, and processing the list. The whole operation does not
need to be protected by disabled irqs. Restructure the code to separate the
detaching (with disabled irqs) and unfreezing (with irq disabling to be reduced
in the next patch).

Also, unfreeze_partials() can be called from another cpu on behalf of a cpu
that is being offlined, where disabling irqs on the local cpu has no sense, so
restructure the code as follows:

- __unfreeze_partials() is the bulk of unfreeze_partials() that processes the
  detached percpu partial list
- unfreeze_partials() detaches list from current cpu with irqs disabled and
  calls __unfreeze_partials()
- unfreeze_partials_cpu() is to be called for the offlined cpu so it needs no
  irq disabling, and is called from __flush_cpu_slab()
- flush_cpu_slab() is for the local cpu thus it needs to call
  unfreeze_partials(). So it can't simply call
  __flush_cpu_slab(smp_processor_id()) anymore and we have to open-code the
  proper calls.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: detach whole partial list at once in unfreeze_partials()

Instead of iterating through the live percpu partial list, detach it from the
kmem_cache_cpu at once. This is simpler and will allow further optimization.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: discard slabs in unfreeze_partials() without irqs disabled

No need for disabled irqs when discarding slabs, so restore them before
discarding.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: move irq control into unfreeze_partials()

unfreeze_partials() can be optimized so that it doesn't need irqs disabled for
the whole time. As the first step, move irq control into the function and
remove it from the put_cpu_partial() caller.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: call deactivate_slab() without disabling irqs

The function is now safe to be called with irqs enabled, so move the calls
outside of irq disabled sections.

When called from ___slab_alloc() -> flush_slab() we have irqs disabled, so to
reenable them before deactivate_slab() we need to open-code flush_slab() in
___slab_alloc() and reenable irqs after modifying the kmem_cache_cpu fields.
But that means a IRQ handler meanwhile might have assigned a new page to
kmem_cache_cpu.page so we have to retry the whole check.

The remaining callers of flush_slab() are the IPI handler which has disabled
irqs anyway, and slub_cpu_dead() which will be dealt with in the following
patch.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: make locking in deactivate_slab() irq-safe

dectivate_slab() now no longer touches the kmem_cache_cpu structure, so it will
be possible to call it with irqs enabled. Just convert the spin_lock calls to
their irq saving/restoring variants to make it irq-safe.

Note we now have to use cmpxchg_double_slab() for irq-safe slab_lock(), because
in some situations we don't take the list_lock, which would disable irqs.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: move reset of c->page and freelist out of deactivate_slab()

deactivate_slab() removes the cpu slab by merging the cpu freelist with slab's
freelist and putting the slab on the proper node's list. It also sets the
respective kmem_cache_cpu pointers to NULL.

By extracting the kmem_cache_cpu operations from the function, we can make it
not dependent on disabled irqs.

Also if we return a single free pointer from ___slab_alloc, we no longer have
to assign kmem_cache_cpu.page before deactivation or care if somebody preempted
us and assigned a different page to our kmem_cache_cpu in the process.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: stop disabling irqs around get_partial()

The function get_partial() does not need to have irqs disabled as a whole. It's
sufficient to convert spin_lock operations to their irq saving/restoring
versions.

As a result, it's now possible to reach the page allocator from the slab
allocator without disabling and re-enabling interrupts on the way.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: check new pages with restored irqs

Building on top of the previous patch, re-enable irqs before checking new
pages. alloc_debug_processing() is now called with enabled irqs so we need to
remove VM_BUG_ON(!irqs_disabled()); in check_slab() - there doesn't seem to be
a need for it anyway.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: validate slab from partial list or page allocator before making it cpu slab

When we obtain a new slab page from node partial list or page allocator, we
assign it to kmem_cache_cpu, perform some checks, and if they fail, we undo
the assignment.

In order to allow doing the checks without irq disabled, restructure the code
so that the checks are done first, and kmem_cache_cpu.page assignment only
after they pass.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: restore irqs around calling new_slab()

allocate_slab() currently re-enables irqs before calling to the page allocator.
It depends on gfpflags_allow_blocking() to determine if it's safe to do so.
Now we can instead simply restore irq before calling it through new_slab().
The other caller early_kmem_cache_node_alloc() is unaffected by this.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc()

Continue reducing the irq disabled scope. Check for per-cpu partial slabs with
first with irqs enabled and then recheck with irqs disabled before grabbing
the slab page. Mostly preparatory for the following patches.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: do initial checks in ___slab_alloc() with irqs enabled

As another step of shortening irq disabled sections in ___slab_alloc(), delay
disabling irqs until we pass the initial checks if there is a cached percpu
slab and it's suitable for our allocation.

Now we have to recheck c->page after actually disabling irqs as an allocation
in irq handler might have replaced it.

Because we call pfmemalloc_match() as one of the checks, we might hit
VM_BUG_ON_PAGE(!PageSlab(page)) in PageSlabPfmemalloc in case we get
interrupted and the page is freed. Thus introduce a pfmemalloc_match_unsafe()
variant that lacks the PageSlab check.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>

mm, slub: move disabling/enabling irqs to ___slab_alloc()

Currently __slab_alloc() disables irqs around the whole ___slab_alloc(). This
includes cases where this is not needed, such as when the allocation ends up in
the page allocator and has to awkwardly enable irqs back based on gfp flags.
Also the whole kmem_cache_alloc_bulk() is executed with irqs disabled even when
it hits the __slab_alloc() slow path, and long periods with disabled interrupts
are undesirable.

As a first step towards reducing irq disabled periods, move irq handling into
___slab_alloc(). Callers will instead prevent the s->cpu_slab percpu pointer
from becoming invalid via get_cpu_ptr(), thus preempt_disable(). This does not
protect against modification by an irq handler, which is still done by disabled
irq for most of ___slab_alloc(). As a small immediate benefit,
slab_out_of_memory() from ___slab_alloc() is now called with irqs enabled.

kmem_cache_alloc_bulk() disables irqs for its fastpath and then re-enables them
before calling ___slab_alloc(), which then disables them at its discretion. The
whole kmem_cache_alloc_bulk() operation also disables preemption.

When ___slab_alloc() calls new_slab() to allocate a new page, re-enable
preemption, because new_slab() will re-enable interrupts in contexts that allow
blocking (this will be improved by later patches).

The patch itself will thus increase overhead a bit due to disabled preemption
(on configs where it matters) and increased disabling/enabling irqs in
kmem_cache_alloc_bulk(), but that will be gradually improved in the following
patches.

Note in __slab_alloc() we need to change the #ifdef CONFIG_PREEMPT guard to
CONFIG_PREEMPT_COUNT to make sure preempt disable/enable is properly paired in
all configurations. On configs without involuntary preemption and debugging
the re-read of kmem_cache_cpu pointer is still compiled out as it was before.

[ Mike Galbraith <[email protected]>: Fix kmem_cache_alloc_bulk() error path ]
Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: simplify kmem_cache_cpu and tid setup

In slab_alloc_node() and do_slab_free() fastpaths we need to guarantee that
our kmem_cache_cpu pointer is from the same cpu as the tid value. Currently
that's done by reading the tid first using this_cpu_read(), then the
kmem_cache_cpu pointer and verifying we read the same tid using the pointer and
plain READ_ONCE().

This can be simplified to just fetching kmem_cache_cpu pointer and then reading
tid using the pointer. That guarantees they are from the same cpu. We don't
need to read the tid using this_cpu_read() because the value will be validated
by this_cpu_cmpxchg_double(), making sure we are on the correct cpu and the
freelist didn't change by anyone preempting us since reading the tid.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>

mm, slub: restructure new page checks in ___slab_alloc()

When we allocate slab object from a newly acquired page (from node's partial
list or page allocator), we usually also retain the page as a new percpu slab.
There are two exceptions - when pfmemalloc status of the page doesn't match our
gfp flags, or when the cache has debugging enabled.

The current code for these decisions is not easy to follow, so restructure it
and add comments. The new structure will also help with the following changes.
No functional change.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>

mm, slub: return slab page from get_partial() and set c->page afterwards

The function get_partial() finds a suitable page on a partial list, acquires
and returns its freelist and assigns the page pointer to kmem_cache_cpu.
In later patch we will need more control over the kmem_cache_cpu.page
assignment, so instead of passing a kmem_cache_cpu pointer, pass a pointer to a
pointer to a page that get_partial() can fill and the caller can assign the
kmem_cache_cpu.page pointer. No functional change as all of this still happens
with disabled IRQs.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: dissolve new_slab_objects() into ___slab_alloc()

The later patches will need more fine grained control over individual actions
in ___slab_alloc(), the only caller of new_slab_objects(), so dissolve it
there. This is a preparatory step with no functional change.

The only minor change is moving WARN_ON_ONCE() for using a constructor together
with __GFP_ZERO to new_slab(), which makes it somewhat less frequent, but still
able to catch a development change introducing a systematic misuse.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: Mel Gorman <[email protected]>

mm, slub: extract get_partial() from new_slab_objects()

The later patches will need more fine grained control over individual actions
in ___slab_alloc(), the only caller of new_slab_objects(), so this is a first
preparatory step with no functional change.

This adds a goto label that appears unnecessary at this point, but will be
useful for later changes.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Christoph Lameter <[email protected]>

io_uring: io_uring_complete() trace should take an integer

It currently takes a long, and while that's normally OK, the io_uring
limit is an int. Internally in io_uring it's an int, but sometimes it's
passed as a long. That can yield confusing results where a completions
seems to generate a huge result:

ou-sqp-1297-1298 [001] ...1 788.056371: io_uring_complete: ring 000000000e98e046, user_data 0x0, result 4294967171, cflags 0

which is due to -ECANCELED being stored in an unsigned, and then passed
in as a long. Using the right int type, the trace looks correct:

iou-sqp-338-339 [002] ...1 15.633098: io_uring_complete: ring 00000000e0ac60cf, user_data 0x0, result -125, cflags 0

Cc: [email protected]
Signed-off-by: Jens Axboe <[email protected]>

Merge tag 'linux-kselftest-next-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull Kselftest updates from Shuah Khan:
"Fixes to build and test failures:

   - openat2 test failure for O_LARGEFILE flag on ARM64

   - x86 test build failures related to glibc 2.34 adding support for
     variable sized MINSIGSTKSZ and SIGSTKSZ

   - removing obsolete configs in sync and cpufreq config files

   - minor spelling and duplicate header include cleanups"

* tag 'linux-kselftest-next-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests/cpufreq: Rename DEBUG_PI_LIST to DEBUG_PLIST
  selftests/sync: Remove the deprecated config SYNC
  selftests: safesetid: Fix spelling mistake "cant" -> "can't"
  selftests/x86: Fix error: variably modified 'altstack_data' at file scope
  kselftest:sched: remove duplicate include in cs_prctl_test.c
  selftests: openat2: Fix testing failure for O_LARGEFILE flag

Merge tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

- Add -s option (strict mode) to merge_config.sh to make it fail when
   any symbol is redefined.

- Show a warning if a different compiler is used for building external
   modules.

- Infer --target from ARCH for CC=clang to let you cross-compile the
   kernel without CROSS_COMPILE.

- Make the integrated assembler default (LLVM_IAS=1) for CC=clang.

- Add <linux/stdarg.h> to the kernel source instead of borrowing
   <stdarg.h> from the compiler.

- Add Nick Desaulniers as a Kbuild reviewer.

- Drop stale cc-option tests.

- Fix the combination of CONFIG_TRIM_UNUSED_KSYMS and CONFIG_LTO_CLANG
   to handle symbols in inline assembly.

- Show a warning if 'FORCE' is missing for if_changed rules.

- Various cleanups

* tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (39 commits)
  kbuild: redo fake deps at include/ksym/*.h
  kbuild: clean up objtool_args slightly
  modpost: get the *.mod file path more simply
  checkkconfigsymbols.py: Fix the '--ignore' option
  kbuild: merge vmlinux_link() between ARCH=um and other architectures
  kbuild: do not remove 'linux' link in scripts/link-vmlinux.sh
  kbuild: merge vmlinux_link() between the ordinary link and Clang LTO
  kbuild: remove stale *.symversions
  kbuild: remove unused quiet_cmd_update_lto_symversions
  gen_compile_commands: extract compiler command from a series of commands
  x86: remove cc-option-yn test for -mtune=
  arc: replace cc-option-yn uses with cc-option
  s390: replace cc-option-yn uses with cc-option
  ia64: move core-y in arch/ia64/Makefile to arch/ia64/Kbuild
  sparc: move the install rule to arch/sparc/Makefile
  security: remove unneeded subdir-$(CONFIG_...)
  kbuild: sh: remove unused install script
  kbuild: Fix 'no symbols' warning when CONFIG_TRIM_UNUSD_KSYMS=y
  kbuild: Switch to 'f' variants of integrated assembler flag
  kbuild: Shuffle blank line to improve comment meaning
  ...

Merge branch 'stable/for-linus-5.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft

Pull ibft fix from Konrad Rzeszutek Wilk:
"An arm64 compile fix for the new code that fixed the iBFT KASLR
handling. I missed the original 0-day build email report"

* 'stable/for-linus-5.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft:
iscsi_ibft: Fix isa_bus_to_virt not working under ARM

mm, slub: remove redundant unfreeze_partials() from put_cpu_partial()

Commit d6e0b7fa1186 ("slub: make dead caches discard free slabs immediately")
introduced cpu partial flushing for kmemcg caches, based on setting the target
cpu_partial to 0 and adding a flushing check in put_cpu_partial().
This code that sets cpu_partial to 0 was later moved by c9fc586403e7 ("slab:
introduce __kmemcg_cache_deactivate()") and ultimately removed by 9855609bde03
("mm: memcg/slab: use a single set of kmem_caches for all accounted
allocations"). However the check and flush in put_cpu_partial() was never
removed, although it's effectively a dead code. So this patch removes it.

Note that d6e0b7fa1186 also added preempt_disable()/enable() to
unfreeze_partials() which could be thus also considered unnecessary. But
further patches will rely on it, so keep it.

Signed-off-by: Vlastimil Babka <[email protected]>

mm, slub: don't disable irq for debug_check_no_locks_freed()

In slab_free_hook() we disable irqs around the debug_check_no_locks_freed()
call, which is unnecessary, as irqs are already being disabled inside the call.
This seems to be leftover from the past where there were more calls inside the
irq disabled sections. Remove the irq disable/enable operations.

Mel noted:
> Looks like it was needed for kmemcheck which went away back in 4.15

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>

mm, slub: allocate private object map for validate_slab_cache()

validate_slab_cache() is called either to handle a sysfs write, or from a
self-test context. In both situations it's straightforward to preallocate a
private object bitmap instead of grabbing the shared static one meant for
critical sections, so let's do that.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: Mel Gorman <[email protected]>

mm, slub: allocate private object map for debugfs listings

Slub has a static spinlock protected bitmap for marking which objects are on
freelist when it wants to list them, for situations where dynamically
allocating such map can lead to recursion or locking issues, and on-stack
bitmap would be too large.

The handlers of debugfs files alloc_traces and free_traces also currently use this
shared bitmap, but their syscall context makes it straightforward to allocate a
private map before entering locked sections, so switch these processing paths
to use a private bitmap.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: Mel Gorman <[email protected]>

mm, slub: don't call flush_all() from slab_debug_trace_open()

slab_debug_trace_open() can only be called on caches with SLAB_STORE_USER flag
and as with all slub debugging flags, such caches avoid cpu or percpu partial
slabs altogether, so there's nothing to flush.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Christoph Lameter <[email protected]>

Merge tag 'powerpc-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:

- Convert pseries & powernv to use MSI IRQ domains.

- Rework the pseries CPU numbering so that CPUs that are removed, and
   later re-added, are given a CPU number on the same node as
   previously, when possible.

- Add support for a new more flexible device-tree format for specifying
   NUMA distances.

- Convert powerpc to GENERIC_PTDUMP.

- Retire sbc8548 and sbc8641d board support.

- Various other small features and fixes.

Thanks to Alexey Kardashevskiy, Aneesh Kumar K.V, Anton Blanchard,
Cédric Le Goater, Christophe Leroy, Emmanuel Gil Peyrot, Fabiano Rosas,
Fangrui Song, Finn Thain, Gautham R.  Shenoy, Hari Bathini, Joel
Stanley, Jordan Niethe, Kajol Jain, Laurent Dufour, Leonardo Bras, Lukas
Bulwahn, Marc Zyngier, Masahiro Yamada, Michal Suchanek, Nathan
Chancellor, Nicholas Piggin, Parth Shah, Paul Gortmaker, Pratik R.
Sampat, Randy Dunlap, Sebastian Andrzej Siewior, Srikar Dronamraju, Wan
Jiabing, Xiongwei Song, and Zheng Yongjun.

* tag 'powerpc-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (154 commits)
  powerpc/bug: Cast to unsigned long before passing to inline asm
  powerpc/ptdump: Fix generic ptdump for 64-bit
  KVM: PPC: Fix clearing never mapped TCEs in realmode
  powerpc/pseries/iommu: Rename "direct window" to "dma window"
  powerpc/pseries/iommu: Make use of DDW for indirect mapping
  powerpc/pseries/iommu: Find existing DDW with given property name
  powerpc/pseries/iommu: Update remove_dma_window() to accept property name
  powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helper
  powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()
  powerpc/pseries/iommu: Allow DDW windows starting at 0x00
  powerpc/pseries/iommu: Add ddw_list_new_entry() helper
  powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper
  powerpc/kernel/iommu: Add new iommu_table_in_use() helper
  powerpc/pseries/iommu: Replace hard-coded page shift
  powerpc/numa: Update cpu_cpu_map on CPU online/offline
  powerpc/numa: Print debug statements only when required
  powerpc/numa: convert printk to pr_xxx
  powerpc/numa: Drop dbg in favour of pr_debug
  powerpc/smp: Enable CACHE domain for shared processor
  powerpc/smp: Update cpu_core_map on all PowerPc systems
  ...

Merge tag 'for-5.15/parisc-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux

Pull parisc architecture fixes from Helge Deller:
"Fix an unaligned-access crash in the bootloader and drop asm/swab.h"

* tag 'for-5.15/parisc-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix unaligned-access crash in bootloader
parisc: Drop __arch_swab16(), arch_swab24(), _arch_swab32() and __arch_swab64() functions

Merge tag 'mips_5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux

Pull MIPS updates from Thomas Bogendoerfer:

- converted Pistachio platform to use MIPS generic kernel

- fixes and cleanups

* tag 'mips_5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (29 commits)
  MIPS: Malta: fix alignment of the devicetree buffer
  MIPS: ingenic: Unconditionally enable clock of CPU #0
  MIPS: mscc: ocelot: mark the phy-mode for internal PHY ports
  MIPS: mscc: ocelot: disable all switch ports by default
  MAINTAINERS: adjust PISTACHIO SOC SUPPORT after its retirement
  MIPS: Return true/false (not 1/0) from bool functions
  MIPS: generic: Return true/false (not 1/0) from bool functions
  MIPS: Make a alias for pistachio_defconfig
  MIPS: Retire MACH_PISTACHIO
  MIPS: config: generic: Add config for Marduk board
  pinctrl: pistachio: Make it as an option
  phy: pistachio-usb: Depend on MIPS || COMPILE_TEST
  clocksource/drivers/pistachio: Make it selectable for MIPS
  clk: pistachio: Make it selectable for generic MIPS kernel
  MIPS: DTS: Pistachio add missing cpc and cdmm
  MIPS: generic: Allow generating FIT image for Marduk board
  MIPS: locking/atomic: Fix atomic{_64,}_sub_if_positive
  MIPS: loongson2ef: don't build serial.o unconditionally
  MIPS: Replace deprecated CPU-hotplug functions.
  MIPS: Alchemy: Fix spelling contraction "cant" -> "can't"
  ...

Merge tag 'for-linus' of git://github.com/openrisc/linux

Pull OpenRISC updates from Stafford Horne:
"A few cleanups and compiler warning fixes for OpenRISC.

  Also, this includes dts and defconfig updates to enable Ethernet on
  OpenRISC/Litex FPGA SoC's now that the LiteEth driver has gone
  upstream"

* tag 'for-linus' of git://github.com/openrisc/linux:
  openrisc/litex: Update defconfig
  openrisc/litex: Add ethernet device
  openrisc/litex: Update uart address
  openrisc: Fix compiler warnings in setup
  openrisc: rename or32 code & comments to or1k
  openrisc: don't printk() unconditionally

Merge tag 'livepatching-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching

Pull livepatching update from Petr Mladek.

* tag 'livepatching-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
livepatch: Replace deprecated CPU-hotplug functions.

Merge tag 'iommu-updates-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu updates from Joerg Roedel:

- New DART IOMMU driver for Apple Silicon M1 chips

- Optimizations for iommu_[map/unmap] performance

- Selective TLB flush support for the AMD IOMMU driver to make it more
   efficient on emulated IOMMUs

- Rework IOVA setup and default domain type setting to move more code
   out of IOMMU drivers and to support runtime switching between certain
   types of default domains

- VT-d Updates from Lu Baolu:
      - Update the virtual command related registers
      - Enable Intel IOMMU scalable mode by default
      - Preset A/D bits for user space DMA usage
      - Allow devices to have more than 32 outstanding PRs
      - Various cleanups

- ARM SMMU Updates from Will Deacon:
      SMMUv3:
       - Minor optimisation to avoid zeroing struct members on CMD submission
       - Increased use of batched commands to reduce submission latency
       - Refactoring in preparation for ECMDQ support
      SMMUv2:
       - Fix races when probing devices with identical StreamIDs
       - Optimise walk cache flushing for Qualcomm implementations
       - Allow deep sleep states for some Qualcomm SoCs with shared clocks

- Various smaller optimizations, cleanups, and fixes

* tag 'iommu-updates-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (85 commits)
  iommu/io-pgtable: Abstract iommu_iotlb_gather access
  iommu/arm-smmu: Fix missing unlock on error in arm_smmu_device_group()
  iommu/vt-d: Add present bit check in pasid entry setup helpers
  iommu/vt-d: Use pasid_pte_is_present() helper function
  iommu/vt-d: Drop the kernel doc annotation
  iommu/vt-d: Allow devices to have more than 32 outstanding PRs
  iommu/vt-d: Preset A/D bits for user space DMA usage
  iommu/vt-d: Enable Intel IOMMU scalable mode by default
  iommu/vt-d: Refactor Kconfig a bit
  iommu/vt-d: Remove unnecessary oom message
  iommu/vt-d: Update the virtual command related registers
  iommu: Allow enabling non-strict mode dynamically
  iommu: Merge strictness and domain type configs
  iommu: Only log strictness for DMA domains
  iommu: Expose DMA domain strictness via sysfs
  iommu: Express DMA strictness via the domain type
  iommu/vt-d: Prepare for multiple DMA domain types
  iommu/arm-smmu: Prepare for multiple DMA domain types
  iommu/amd: Prepare for multiple DMA domain types
  iommu: Introduce explicit type for non-strict DMA domains
  ...

Merge branch 'stable/for-linus-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb

Pull swiotlb updates from Konrad Rzeszutek Wilk:
"A new feature called restricted DMA pools. It allows SWIOTLB to
  utilize per-device (or per-platform) allocated memory pools instead of
  using the global one.

  The first big user of this is ARM Confidential Computing where the
  memory for DMA operations can be set per platform"

* 'stable/for-linus-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb: (23 commits)
  swiotlb: use depends on for DMA_RESTRICTED_POOL
  of: restricted dma: Don't fail device probe on rmem init failure
  of: Move of_dma_set_restricted_buffer() into device.c
  powerpc/svm: Don't issue ultracalls if !mem_encrypt_active()
  s390/pv: fix the forcing of the swiotlb
  swiotlb: Free tbl memory in swiotlb_exit()
  swiotlb: Emit diagnostic in swiotlb_exit()
  swiotlb: Convert io_default_tlb_mem to static allocation
  of: Return success from of_dma_set_restricted_buffer() when !OF_ADDRESS
  swiotlb: add overflow checks to swiotlb_bounce
  swiotlb: fix implicit debugfs declarations
  of: Add plumbing for restricted DMA pool
  dt-bindings: of: Add restricted DMA pool
  swiotlb: Add restricted DMA pool initialization
  swiotlb: Add restricted DMA alloc/free support
  swiotlb: Refactor swiotlb_tbl_unmap_single
  swiotlb: Move alloc_size to swiotlb_find_slots
  swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing
  swiotlb: Update is_swiotlb_active to add a struct device argument
  swiotlb: Update is_swiotlb_buffer to add a struct device argument
  ...

Merge branch 'akpm' (patches from Andrew)

Merge misc updates from Andrew Morton:
"173 patches.

  Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
  pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
  bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
  hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
  oom-kill, migration, ksm, percpu, vmstat, and madvise)"

* emailed patches from Andrew Morton <[email protected]>: (173 commits)
  mm/madvise: add MADV_WILLNEED to process_madvise()
  mm/vmstat: remove unneeded return value
  mm/vmstat: simplify the array size calculation
  mm/vmstat: correct some wrong comments
  mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
  selftests: vm: add COW time test for KSM pages
  selftests: vm: add KSM merging time test
  mm: KSM: fix data type
  selftests: vm: add KSM merging across nodes test
  selftests: vm: add KSM zero page merging test
  selftests: vm: add KSM unmerge test
  selftests: vm: add KSM merge test
  mm/migrate: correct kernel-doc notation
  mm: wire up syscall process_mrelease
  mm: introduce process_mrelease system call
  memblock: make memblock_find_in_range method private
  mm/mempolicy.c: use in_task() in mempolicy_slab_node()
  mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
  mm/mempolicy: advertise new MPOL_PREFERRED_MANY
  mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  ...

mm/madvise: add MADV_WILLNEED to process_madvise()

There is a usecase in Android that an app process's memory is swapped out
by process_madvise() with MADV_PAGEOUT, such as the memory is swapped to
zram or a backing device. When the process is scheduled to running, like
switch to foreground, multiple page faults may cause the app dropped
frames.

To reduce the problem, System Management Software can read-ahead memory
of the process immediately when the app switches to forground. Calling
process_madvise() with MADV_WILLNEED can meet this need.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: zhangkui <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmstat: remove unneeded return value

The return value of pagetypeinfo_showfree and pagetypeinfo_showblockcount
are unused now. Remove them.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmstat: simplify the array size calculation

We can replace the array_num * sizeof(array[0]) with sizeof(array) to
simplify the code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmstat: correct some wrong comments

Patch series "Cleanup for vmstat".

This series contains cleanups to remove unneeded return value, correct
wrong comment and simplify the array size calculation. More details can
be found in the respective changelogs.

This patch (of 3):

Correct wrong fls(mem+1) to fls(mem)+1 and remove the duplicated comment
with quiet_vmstat().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()

Commit b239f7daf553 ("percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE")
removed the parameter 'for_alloc', so remove this comment.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jing Xiangfeng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

selftests: vm: add COW time test for KSM pages

Since merged pages are copied every time they need to be modified, the
write access time is different between shared and non-shared pages.  Add
ksm_cow_time() function which evaluates latency of these COW breaks.
First, 4000 pages are allocated and the time, required to modify 1 byte in
every other page, is measured.  After this, the pages are merged into 2000
pairs and in each pair, 1 page is modified (i.e.  they are decoupled) to
detect COW breaks.  The time needed to break COW of merged pages is then
compared with performance of non-shared pages.

The test is run as follows: ./ksm_tests -C
The output:
Total size:    15 MiB

Not merged pages:
Total time:     0.002185489 s
Average speed:  3202.945 MiB/s

Merged pages:
Total time:     0.004386872 s
Average speed:  1595.670 MiB/s

Link: https://lkml.kernel.org/r/1d03ee0d1b341959d4b61672c6401d498bff5652.1629386192.git.zhansayabagdaulet@gmail.com
Signed-off-by: Zhansaya Bagdauletkyzy <[email protected]>
Reviewed-by: Tyler Hicks <[email protected]>
Reviewed-by: Pavel Tatashin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

selftests: vm: add KSM merging time test

Patch series "add KSM performance tests", v3.

Extend KSM self tests with a performance benchmark.  These tests are not
part of regular regression testing, as they are mainly intended to be used
by developers making changes to the memory management subsystem.

This patch (of 2):

Add ksm_merge_time() function to determine speed and time needed for
merging.  The total spent time is shown in seconds while speed is in
MiB/s.  User must specify the size of duplicated memory area (in MiB)
before running the test.

The test is run as follows: ./ksm_tests -P -s 100
The output:
Total size:    100 MiB
Total time:    0.201106786 s
Average speed:  497.248 MiB/s

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/318b946ac80cc9205c89d0962048378f7ce0705b.1629386192.git.zhansayabagdaulet@gmail.com
Signed-off-by: Zhansaya Bagdauletkyzy <[email protected]>
Reviewed-by: Tyler Hicks <[email protected]>
Reviewed-by: Pavel Tatashin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: KSM: fix data type

ksm_stable_node_chains_prune_millisecs is declared as int, but in
stable__node_chains_prune_millisecs_store(), it can store values up to
UINT_MAX. Change its type to unsigned int.

Link: https://lkml.kernel.org/r/20210806111351.GA71845@asus
Signed-off-by: Zhansaya Bagdauletkyzy <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

selftests: vm: add KSM merging across nodes test

Add check_ksm_numa_merge() function to test that pages in different NUMA
nodes are being handled properly.  First, two duplicate pages are
allocated in two separate NUMA nodes using the libnuma library.  Since
there is one unique page in each node, with merge_across_nodes = 0, there
won't be any shared pages.  If merge_across_nodes is set to 1, the pages
will be treated as usual duplicate pages and will be merged.  If NUMA
config is not enabled or the number of NUMA nodes is less than two, then
the test is skipped.  The test is run as follows: ./ksm_tests -N

Link: https://lkml.kernel.org/r/071c17b5b04ebb0dfeba137acc495e5dd9d2a719.1626252248.git.zhansayabagdaulet@gmail.com
Signed-off-by: Zhansaya Bagdauletkyzy <[email protected]>
Reviewed-by: Pavel Tatashin <[email protected]>
Reviewed-by: Tyler Hicks <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

selftests: vm: add KSM zero page merging test

Add check_ksm_zero_page_merge() function to test that empty pages are
being handled properly.  For this, several zero pages are allocated and
merged using madvise.  If use_zero_pages is enabled, the pages must be
shared with the special kernel zero pages; otherwise, they are merged as
usual duplicate pages.  The test is run as follows: ./ksm_tests -Z

Link: https://lkml.kernel.org/r/6d0caab00d4bdccf5e3791cb95cf6dfd5eb85e45.1626252248.git.zhansayabagdaulet@gmail.com
Signed-off-by: Zhansaya Bagdauletkyzy <[email protected]>
Reviewed-by: Pavel Tatashin <[email protected]>
Reviewed-by: Tyler Hicks <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

selftests: vm: add KSM unmerge test

Add check_ksm_unmerge() function to verify that KSM is properly unmerging
shared pages.  For this, two duplicate pages are merged first and then
their contents are modified.  Since they are not identical anymore, the
pages must be unmerged and the number of merged pages has to be 0.  The
test is run as follows: ./ksm_tests -U

Link: https://lkml.kernel.org/r/c0f55420440d704d5b094275b4365aa1b2ad46b5.1626252248.git.zhansayabagdaulet@gmail.com
Signed-off-by: Zhansaya Bagdauletkyzy <[email protected]>
Reviewed-by: Pavel Tatashin <[email protected]>
Reviewed-by: Tyler Hicks <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

selftests: vm: add KSM merge test

Patch series "add KSM selftests".

Introduce selftests to validate the functionality of KSM.  The tests are
run on private anonymous pages.  Since some KSM tunables are modified,
their starting values are saved and restored after testing.  At the start,
run is set to 2 to ensure that only test pages will be merged (we assume
that no applications make madvise syscalls in the background).  If KSM
config not enabled, all tests will be skipped.

This patch (of 4):

Add check_ksm_merge() function to check the basic merging feature of KSM.
First, some number of identical pages are allocated and the MADV_MERGEABLE
advice is given to merge these pages.  Then, pages_shared and
pages_sharing values are compared with the expected numbers using
assert_ksm_pages_count() function.  The number of pages can be changed
using -p option.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/90287685c13300972ea84de93d1f3f900373f9fe.1626252248.git.zhansayabagdaulet@gmail.com
Signed-off-by: Zhansaya Bagdauletkyzy <[email protected]>
Reviewed-by: Pavel Tatashin <[email protected]>
Reviewed-by: Tyler Hicks <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate: correct kernel-doc notation

Use the expected "Return:" format to prevent a kernel-doc warning.

mm/migrate.c:1157: warning: Excess function parameter 'returns' description in 'next_demotion_node'

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: wire up syscall process_mrelease

Split off from prev patch in the series that implements the syscall.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Acked-by: Geert Uytterhoeven <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: Jan Engelhardt <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Tim Murray <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: introduce process_mrelease system call

In modern systems it's not unusual to have a system component monitoring
memory conditions of the system and tasked with keeping system memory
pressure under control.  One way to accomplish that is to kill
non-essential processes to free up memory for more important ones.
Examples of this are Facebook's OOM killer daemon called oomd and
Android's low memory killer daemon called lmkd.

For such system component it's important to be able to free memory quickly
and efficiently.  Unfortunately the time process takes to free up its
memory after receiving a SIGKILL might vary based on the state of the
process (uninterruptible sleep), size and OPP level of the core the
process is running.  A mechanism to free resources of the target process
in a more predictable way would improve system's ability to control its
memory pressure.

Introduce process_mrelease system call that releases memory of a dying
process from the context of the caller.  This way the memory is freed in a
more controllable way with CPU affinity and priority of the caller.  The
workload of freeing the memory will also be charged to the caller.  The
operation is allowed only on a dying process.

After previous discussions [1, 2, 3] the decision was made [4] to
introduce a dedicated system call to cover this use case.

The API is as follows,

          int process_mrelease(int pidfd, unsigned int flags);

        DESCRIPTION
          The process_mrelease() system call is used to free the memory of
          an exiting process.

          The pidfd selects the process referred to by the PID file
          descriptor.
          (See pidfd_open(2) for further information)

          The flags argument is reserved for future use; currently, this
          argument must be specified as 0.

        RETURN VALUE
          On success, process_mrelease() returns 0. On error, -1 is
          returned and errno is set to indicate the error.

        ERRORS
          EBADF  pidfd is not a valid PID file descriptor.

          EAGAIN Failed to release part of the address space.

          EINTR  The call was interrupted by a signal; see signal(7).

          EINVAL flags is not 0.

          EINVAL The memory of the task cannot be released because the
                 process is not exiting, the address space is shared
                 with another live process or there is a core dump in
                 progress.

          ENOSYS This system call is not supported, for example, without
                 MMU support built into Linux.

          ESRCH  The target process does not exist (i.e., it has terminated
                 and been waited on).

[1] https://lore.kernel.org/lkml/20190411014353 [email protected]/
[2] https://lore.kernel.org/linux-api/20201113173448.1863419 [email protected]/
[3] https://lore.kernel.org/linux-api/20201124053943.1684874 [email protected]/
[4] https://lore.kernel.org/linux-api/20201223075712 [email protected]/

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: Jan Engelhardt <[email protected]>
Cc: Tim Murray <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memblock: make memblock_find_in_range method private

There are a lot of uses of memblock_find_in_range() along with
memblock_reserve() from the times memblock allocation APIs did not exist.

memblock_find_in_range() is the very core of memblock allocations, so any
future changes to its internal behaviour would mandate updates of all the
users outside memblock.

Replace the calls to memblock_find_in_range() with an equivalent calls to
memblock_phys_alloc() and memblock_phys_alloc_range() and make
memblock_find_in_range() private method of memblock.

This simplifies the callers, ensures that (unlikely) errors in
memblock_reserve() are handled and improves maintainability of
memblock_find_in_range().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]> [arm64]
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]> [ACPI]
Acked-by: Russell King (Oracle) <[email protected]>
Acked-by: Nick Kossifidis <[email protected]> [riscv]
Tested-by: Guenter Roeck <[email protected]>
Acked-by: Rob Herring <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mempolicy.c: use in_task() in mempolicy_slab_node()

Obsoleted in_intrrupt() include task context with disabled BH, it's better
to use in_task() instead.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vasily Averin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies

As they all do the same thing: sanity check and save nodemask info, create
one mpol_new_nodemask() to reduce redundancy.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Feng Tang <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Ben Widawsky <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mempolicy: advertise new MPOL_PREFERRED_MANY

Adds a new mode to the existing mempolicy modes, MPOL_PREFERRED_MANY.

MPOL_PREFERRED_MANY will be adequately documented in the internal
admin-guide with this patch.  Eventually, the man pages for mbind(2),
get_mempolicy(2), set_mempolicy(2) and numactl(8) will also have text
about this mode.  Those shall contain the canonical reference.

NUMA systems continue to become more prevalent.  New technologies like
PMEM make finer grain control over memory access patterns increasingly
desirable.  MPOL_PREFERRED_MANY allows userspace to specify a set of nodes
that will be tried first when performing allocations.  If those
allocations fail, all remaining nodes will be tried.  It's a straight
forward API which solves many of the presumptive needs of system
administrators wanting to optimize workloads on such machines.  The mode
will work either per VMA, or per thread.

[Michal Hocko: refine kernel doc for MPOL_PREFERRED_MANY]

Link: https://lore.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY

Implement the missing huge page allocation functionality while obeying the
preferred node semantics. This is similar to the implementation for
general page allocation, as it uses a fallback mechanism to try multiple
preferred nodes first, and then all other nodes.

To avoid adding too many "#ifdef CONFIG_NUMA" check, add a helper function
in mempolicy.h to check whether a mempolicy is MPOL_PREFERRED_MANY.

[[email protected]: fix compiling issue when merging with other hugetlb patch]
[Thanks to 0day bot for catching the !CONFIG_NUMA compiling issue]
[[email protected]: suggest to remove the #ifdef CONFIG_NUMA check]
[[email protected]: add helpers to avoid ifdefs]
Link: https://lore.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: initialize page to NULL in alloc_buddy_huge_page_with_mpol()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
Co-developed-by: Feng Tang <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy

The semantics of MPOL_PREFERRED_MANY is similar to MPOL_PREFERRED, that it
will first try to allocate memory from the preferred node(s), and fallback
to all nodes in system when first try fails.

Add a dedicated function alloc_pages_preferred_many() for it just like for
'interleave' policy, which will be used by 2 general memoory allocation
APIs: alloc_pages() and alloc_pages_vma()

Link: https://lore.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Michal Hocko <[email protected]>
Originally-by: Ben Widawsky <[email protected]>
Co-developed-by: Ben Widawsky <[email protected]>
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes

Patch series "Introduce multi-preference mempolicy", v7.

This patch series introduces the concept of the MPOL_PREFERRED_MANY
mempolicy.  This mempolicy mode can be used with either the
set_mempolicy(2) or mbind(2) interfaces.  Like the MPOL_PREFERRED
interface, it allows an application to set a preference for nodes which
will fulfil memory allocation requests.  Unlike the MPOL_PREFERRED mode,
it takes a set of nodes.  Like the MPOL_BIND interface, it works over a
set of nodes.  Unlike MPOL_BIND, it will not cause a SIGSEGV or invoke the
OOM killer if those preferred nodes are not available.

Along with these patches are patches for libnuma, numactl, numademo, and
memhog.  They still need some polish, but can be found here:
https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many It allows new
usage: `numactl -P 0,3,4`

The goal of the new mode is to enable some use-cases when using tiered memory
usage models which I've lovingly named.

1a. The Hare - The interconnect is fast enough to meet bandwidth and
    latency requirements allowing preference to be given to all nodes with
    "fast" memory.
1b. The Indiscriminate Hare - An application knows it wants fast
    memory (or perhaps slow memory), but doesn't care which node it runs
    on.  The application can prefer a set of nodes and then xpu bind to
    the local node (cpu, accelerator, etc).  This reverses the nodes are
    chosen today where the kernel attempts to use local memory to the CPU
    whenever possible.  This will attempt to use the local accelerator to
    the memory.
2.  The Tortoise - The administrator (or the application itself) is
    aware it only needs slow memory, and so can prefer that.

Much of this is almost achievable with the bind interface, but the bind
interface suffers from an inability to fallback to another set of nodes if
binding fails to all nodes in the nodemask.

Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
preference.

> /* Set first two nodes as preferred in an 8 node system. */
> const unsigned long nodes = 0x3
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);

> /* Mimic interleave policy, but have fallback *.
> const unsigned long nodes = 0xaa
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);

Some internal discussion took place around the interface. There are two
alternatives which we have discussed, plus one I stuck in:

1. Ordered list of nodes.  Currently it's believed that the added
   complexity is nod needed for expected usecases.
2. A flag for bind to allow falling back to other nodes.  This
   confuses the notion of binding and is less flexible than the current
   solution.
3. Create flags or new modes that helps with some ordering.  This
   offers both a friendlier API as well as a solution for more customized
   usage.  It's unknown if it's worth the complexity to support this.
   Here is sample code for how this might work:

> // Prefer specific nodes for some something wacky
> set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024);
>
> // Default
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> // which is the same as
> set_mempolicy(MPOL_DEFAULT, NULL, 0);
>
> // The Hare
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
>
> // The Tortoise
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
>
> // Prefer the fast memory of the first two sockets
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
>

This patch (of 5):

The NUMA APIs currently allow passing in a "preferred node" as a single
bit set in a nodemask.  If more than one bit it set, bits after the first
are ignored.

This single node is generally OK for location-based NUMA where memory
being allocated will eventually be operated on by a single CPU.  However,
in systems with multiple memory types, folks want to target a *type* of
memory instead of a location.  For instance, someone might want some
high-bandwidth memory but do not care about the CPU next to which it is
allocated.  Or, they want a cheap, high capacity allocation and want to
target all NUMA nodes which have persistent memory in volatile mode.  In
both of these cases, the application wants to target a *set* of nodes, but
does not want strict MPOL_BIND behavior as that could lead to OOM killer
or SIGSEGV.

So add MPOL_PREFERRED_MANY policy to support the multiple preferred nodes
requirement.  This is not a pie-in-the-sky dream for an API.  This was a
response to a specific ask of more than one group at Intel.  Specifically:

1. There are existing libraries that target memory types such as
   https://github.com/memkind/memkind.  These are known to suffer from
   SIGSEGV's when memory is low on targeted memory "kinds" that span more
   than one node.  The MCDRAM on a Xeon Phi in "Cluster on Die" mode is an
   example of this.

2. Volatile-use persistent memory users want to have a memory policy
   which is targeted at either "cheap and slow" (PMEM) or "expensive and
   fast" (DRAM).  However, they do not want to experience allocation
   failures when the targeted type is unavailable.

3. Allocate-then-run.  Generally, we let the process scheduler decide
   on which physical CPU to run a task.  That location provides a default
   allocation policy, and memory availability is not generally considered
   when placing tasks.  For situations where memory is valuable and
   constrained, some users want to allocate memory first, *then* allocate
   close compute resources to the allocation.  This is the reverse of the
   normal (CPU) model.  Accelerators such as GPUs that operate on
   core-mm-managed memory are interested in this model.

A check is added in sanitize_mpol_flags() to not permit 'prefer_many'
policy to be used for now, and will be removed in later patch after all
implementations for 'prefer_many' are ready, as suggested by Michal Hocko.

[[email protected]: suggest to refine policy_node/policy_nodemask handling]

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Co-developed-by: Ben Widawsky <[email protected]>
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
Cc: Michal Hocko <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Huang Ying <[email protected]>b
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mempolicy: use readable NUMA_NO_NODE macro instead of magic number

The caller of mpol_misplaced() already use NUMA_NO_NODE to check whether
current page node is misplaced, thus using NUMA_NO_NODE in
mpol_misplaced() instead of magic number is more readable.

Link: https://lkml.kernel.org/r/1b77c0ce21183fa86f4db250b115cf5e27396528.1627558356.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: compaction: support triggering of proactive compaction by user

The proactive compaction[1] gets triggered for every 500msec and run
compaction on the node for COMPACTION_HPAGE_ORDER (usually order-9) pages
based on the value set to sysctl.compaction_proactiveness.  Triggering the
compaction for every 500msec in search of COMPACTION_HPAGE_ORDER pages is
not needed for all applications, especially on the embedded system
usecases which may have few MB's of RAM.  Enabling the proactive
compaction in its state will endup in running almost always on such
systems.

Other side, proactive compaction can still be very much useful for getting
a set of higher order pages in some controllable manner(controlled by
using the sysctl.compaction_proactiveness).  So, on systems where enabling
the proactive compaction always may proove not required, can trigger the
same from user space on write to its sysctl interface.  As an example, say
app launcher decide to launch the memory heavy application which can be
launched fast if it gets more higher order pages thus launcher can prepare
the system in advance by triggering the proactive compaction from
userspace.

This triggering of proactive compaction is done on a write to
sysctl.compaction_proactiveness by user.

[1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=facdaa917c4d5a376d09d25865f5a863f906234a

[[email protected]: tweak vm.rst, per Mike]

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Charan Teja Reddy <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Rafael Aquini <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Iurii Zaikin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Vinayak Menon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: compaction: optimize proactive compaction deferrals

Vlastimil Babka figured out that when fragmentation score didn't go down
across the proactive compaction i.e.  when no progress is made, next wake
up for proactive compaction is deferred for 1 << COMPACT_MAX_DEFER_SHIFT,
i.e.  64 times, with each wakeup interval of
HPAGE_FRAG_CHECK_INTERVAL_MSEC(=500).  In each of this wakeup, it just
decrement 'proactive_defer' counter and goes sleep i.e.  it is getting
woken to just decrement a counter.

The same deferral time can also achieved by simply doing the
HPAGE_FRAG_CHECK_INTERVAL_MSEC << COMPACT_MAX_DEFER_SHIFT thus unnecessary
wakeup of kcompact thread is avoided thus also removes the need of
'proactive_defer' thread counter.

[[email protected]: tweak comment]

Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Charan Teja Reddy <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: Khalid Aziz <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Vinayak Menon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, vmscan: guarantee drop_slab_node() termination

drop_slab_node() is called as part of echo 2>/proc/sys/vm/drop_caches
operation.  It iterates over all memcgs and calls shrink_slab() which in
turn iterates over all slab shrinkers.  Freed objects are counted and as
long as the total number of freed objects from all memcgs and shrinkers is
higher than 10, drop_slab_node() loops for another full memcgs*shrinkers
iteration.

This arbitrary constant threshold of 10 can result in effectively an
infinite loop on a system with large number of memcgs and/or parallel
activity that allocates new objects.  This has been reported previously by
Chunxin Zang [1] and recently by our customer.

The previous report [1] has resulted in commit 069c411de40a ("mm/vmscan:
fix infinite loop in drop_slab_node") which added a check for signals
allowing the user to terminate the command writing to drop_caches.  At the
time it was also considered to make the threshold grow with each iteration
to guarantee termination, but such patch hasn't been formally proposed
yet.

This patch implements the dynamically growing threshold.  At first
iteration it's enough to free one object to continue, and this threshold
effectively doubles with each iteration.  Our customer's feedback was
positive.

There is always a risk that this change will result on some system in a
previously terminating drop_caches operation to terminate sooner and free
fewer objects.  Ideally the semantics would guarantee freeing all freeable
objects that existed at the moment of starting the operation, while not
looping forever for newly allocated objects, but that's not feasible to
track.  In the less ideal solution based on thresholds, arguably the
termination guarantee is more important than the exhaustiveness guarantee.
If there are reports of large regression wrt being exhaustive, we can
tune how fast the threshold grows.

[1] https://lore.kernel.org/lkml/20200909152047 [email protected]/T/#u

[[email protected]: avoid undefined shift behaviour]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Reported-by: Chunxin Zang <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Chris Down <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Kefeng Wang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmscan: add 'else' to remove check_pending label

We could add 'else' to remove the somewhat odd check_pending label to make
code core succinct.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmscan: remove unneeded return value of kswapd_run()

The return value of kswapd_run() is unused now. Clean it up.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmscan: remove misleading setting to sc->priority

The priority field of sc is used to control how many pages we should scan
at once while we always traverse the list to shrink the pages in these
functions. So these settings are unneeded and misleading.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmscan: remove the PageDirty check after MADV_FREE pages are page_ref_freezed

Patch series "Cleanups for vmscan", v2.

This series contains cleanups to remove unneeded return value, misleading
setting and so on.  Also this remove the PageDirty check after MADV_FREE
pages are page_ref_freezed.  More details can be found in the respective
changelogs.

This patch (of 4):

If the MADV_FREE pages are redirtied before they could be reclaimed, put
the pages back to anonymous LRU list by setting SwapBacked flag and the
pages will be reclaimed in normal swapout way.  But as Yu Zhao pointed
out, "The page has only one reference left, which is from the isolation.
After the caller puts the page back on lru and drops the reference, the
page will be freed anyway.  It doesn't matter which lru it goes." So we
don't bother checking PageDirty here.

[Yu Zhao's comment is also quoted in the code.]

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Yu Zhao <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: John Hubbard <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmpressure: replace vmpressure_to_css() with vmpressure_to_memcg()

We can get memcg directly form vmpr instead of vmpr->memcg->css->memcg, so
add a new func helper vmpressure_to_memcg(). And no code will use
vmpressure_to_css(), so delete it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hui Su <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Chris Down <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate: add sysfs interface to enable reclaim migration

Some method is obviously needed to enable reclaim-based migration.

Just like traditional autonuma, there will be some workloads that will
benefit like workloads with more "static" configurations where hot pages
stay hot and cold pages stay cold.  If pages come and go from the hot and
cold sets, the benefits of this approach will be more limited.

The benefits are truly workload-based and *not* hardware-based.  We do not
believe that there is a viable threshold where certain hardware
configurations should have this mechanism enabled while others do not.

To be conservative, earlier work defaulted to disable reclaim- based
migration and did not include a mechanism to enable it.  This proposes add
a new sysfs file

  /sys/kernel/mm/numa/demotion_enabled

as a method to enable it.

We are open to any alternative that allows end users to enable this
mechanism or disable it if workload harm is detected (just like
traditional autonuma).

Once this is enabled page demotion may move data to a NUMA node that does
not fall into the cpuset of the allocating process.  This could be
construed to violate the guarantees of cpusets.  However, since this is an
opt-in mechanism, the assumption is that anyone enabling it is content to
relax the guarantees.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Huang Ying <[email protected]>
Originally-by: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmscan: never demote for memcg reclaim

Global reclaim aims to reduce the amount of memory used on a given node or
set of nodes.  Migrating pages to another node serves this purpose.

memcg reclaim is different.  Its goal is to reduce the total memory
consumption of the entire memcg, across all nodes.  Migration does not
assist memcg reclaim because it just moves page contents between nodes
rather than actually reducing memory consumption.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: "Huang, Ying" <[email protected]>
Suggested-by: Yang Shi <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Keith Busch <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmscan: Consider anonymous pages without swap

Reclaim anonymous pages if a migration path is available now that demotion
provides a non-swap recourse for reclaiming anon pages.

Note that this check is subtly different from the can_age_anon_pages()
checks. This mechanism checks whether a specific page in a specific
context can actually be reclaimed, given current swap space and cgroup
limits.

can_age_anon_pages() is a much simpler and more preliminary check which
just says whether there is a possibility of future reclaim.

[[email protected]: v11]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Cc: Keith Busch <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmscan: add helper for querying ability to age anonymous pages

Anonymous pages are kept on their own LRU(s).  These lists could
theoretically always be scanned and maintained.  But, without swap, there
is currently nothing the kernel can *do* with the results of a scanned,
sorted LRU for anonymous pages.

A check for '!total_swap_pages' currently serves as a valid check as to
whether anonymous LRUs should be maintained.  However, another method will
be added shortly: page demotion.

Abstract out the 'total_swap_pages' checks into a helper, give it a
logically significant name, and check for the possibility of page
demotion.

[[email protected]: v11]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Reviewed-by: Greg Thelen <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmscan: add page demotion counter

Account the number of demoted pages.

Add pgdemote_kswapd and pgdemote_direct VM counters showed in
/proc/vmstat.

[ daveh:
- __count_vm_events() a bit, and made them look at the THP
size directly rather than getting data from migrate_pages()
]

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Reviewed-by: Wei Xu <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Keith Busch <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate: demote pages during reclaim

This is mostly derived from a patch from Yang Shi:

https://lore.kernel.org/linux-mm/1560468577 [email protected]/

Add code to the reclaim path (shrink_page_list()) to "demote" data to
another NUMA node instead of discarding the data.  This always avoids the
cost of I/O needed to read the page back in and sometimes avoids the
writeout cost when the page is dirty.

A second pass through shrink_page_list() will be made if any demotions
fail.  This essentially falls back to normal reclaim behavior in the case
that demotions fail.  Previous versions of this patch may have simply
failed to reclaim pages which were eligible for demotion but were unable
to be demoted in practice.

For some cases, for example, MADV_PAGEOUT, the pages are always discarded
instead of demoted to follow the kernel API definition.  Because
MADV_PAGEOUT is defined as freeing specified pages regardless in which
tier they are.

Note: This just adds the start of infrastructure for migration.  It is
actually disabled next to the FIXME in migrate_demote_page_ok().

[[email protected]: v11]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Reviewed-by: Wei Xu <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Keith Busch <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate: enable returning precise migrate_pages() success count

Under normal circumstances, migrate_pages() returns the number of pages
migrated. In error conditions, it returns an error code. When returning
an error code, there is no way to know how many pages were migrated or not
migrated.

Make migrate_pages() return how many pages are demoted successfully for
all cases, including when encountering errors. Page reclaim behavior will
depend on this in subsequent patches.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: "Huang, Ying" <[email protected]>
Suggested-by: Oscar Salvador <[email protected]> [optional parameter]
Reviewed-by: Yang Shi <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Keith Busch <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate: update node demotion order on hotplug events

Reclaim-based migration is attempting to optimize data placement in memory
based on the system topology.  If the system changes, so must the
migration ordering.

The implementation is conceptually simple and entirely unoptimized.  On
any memory or CPU hotplug events, assume that a node was added or removed
and recalculate all migration targets.  This ensures that the
node_demotion[] array is always ready to be used in case the new reclaim
mode is enabled.

This recalculation is far from optimal, most glaringly that it does not
even attempt to figure out the hotplug event would have some *actual*
effect on the demotion order.  But, given the expected paucity of hotplug
events, this should be fine.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/numa: automatically generate node migration order

Patch series "Migrate Pages in lieu of discard", v11.

We're starting to see systems with more and more kinds of memory such as
Intel's implementation of persistent memory.

Let's say you have a system with some DRAM and some persistent memory.
Today, once DRAM fills up, reclaim will start and some of the DRAM
contents will be thrown out.  Allocations will, at some point, start
falling over to the slower persistent memory.

That has two nasty properties.  First, the newer allocations can end up in
the slower persistent memory.  Second, reclaimed data in DRAM are just
discarded even if there are gobs of space in persistent memory that could
be used.

This patchset implements a solution to these problems.  At the end of the
reclaim process in shrink_page_list() just before the last page refcount
is dropped, the page is migrated to persistent memory instead of being
dropped.

While I've talked about a DRAM/PMEM pairing, this approach would function
in any environment where memory tiers exist.

This is not perfect.  It "strands" pages in slower memory and never brings
them back to fast DRAM.  Huang Ying has follow-on work which repurposes
NUMA balancing to promote hot pages back to DRAM.

This is also all based on an upstream mechanism that allows persistent
memory to be onlined and used as if it were volatile:

http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

With that, the DRAM and PMEM in each socket will be represented as 2
separate NUMA nodes, with the CPUs sit in the DRAM node.  So the
general inter-NUMA demotion mechanism introduced in the patchset can
migrate the cold DRAM pages to the PMEM node.

We have tested the patchset with the postgresql and pgbench.  On a
2-socket server machine with DRAM and PMEM, the kernel with the patchset
can improve the score of pgbench up to 22.1% compared with that of the
DRAM only + disk case.  This comes from the reduced disk read throughput
(which reduces up to 70.8%).

== Open Issues ==

* Memory policies and cpusets that, for instance, restrict allocations
   to DRAM can be demoted to PMEM whenever they opt in to this
   new mechanism.  A cgroup-level API to opt-in or opt-out of
   these migrations will likely be required as a follow-on.
* Could be more aggressive about where anon LRU scanning occurs
   since it no longer necessarily involves I/O.  get_scan_count()
   for instance says: "If we have no swap space, do not bother
   scanning anon pages"

This patch (of 9):

Prepare for the kernel to auto-migrate pages to other memory nodes with a
node migration table.  This allows creating single migration target for
each NUMA node to enable the kernel to do NUMA page migrations instead of
simply discarding colder pages.  A node with no target is a "terminal
node", so reclaim acts normally there.  The migration target does not
fundamentally _need_ to be a single node, but this implementation starts
there to limit complexity.

When memory fills up on a node, memory contents can be automatically
migrated to another node.  The biggest problems are knowing when to
migrate and to where the migration should be targeted.

The most straightforward way to generate the "to where" list would be to
follow the page allocator fallback lists.  Those lists already tell us if
memory is full where to look next.  It would also be logical to move
memory in that order.

But, the allocator fallback lists have a fatal flaw: most nodes appear in
all the lists.  This would potentially lead to migration cycles (A->B,
B->A, A->B, ...).

Instead of using the allocator fallback lists directly, keep a separate
node migration ordering.  But, reuse the same data used to generate page
allocator fallback in the first place: find_next_best_node().

This means that the firmware data used to populate node distances
essentially dictates the ordering for now.  It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().

RCU is used to allow lock-less read of node_demotion[] and prevent
demotion cycles been observed.  If multiple reads of node_demotion[] are
performed, a single rcu_read_lock() must be held over all reads to ensure
no cycles are observed.  Details are as follows.

=== What does RCU provide? ===

Imagine a simple loop which walks down the demotion path looking
for the last node:

        terminal_node = start_node;
        while (node_demotion[terminal_node] != NUMA_NO_NODE) {
                terminal_node = node_demotion[terminal_node];
        }

The initial values are:

        node_demotion[0] = 1;
        node_demotion[1] = NUMA_NO_NODE;

and are updated to:

        node_demotion[0] = NUMA_NO_NODE;
        node_demotion[1] = 0;

What guarantees that the cycle is not observed:

        node_demotion[0] = 1;
        node_demotion[1] = 0;

and would loop forever?

With RCU, a rcu_read_lock/unlock() can be placed around the loop.  Since
the write side does a synchronize_rcu(), the loop that observed the old
contents is known to be complete before the synchronize_rcu() has
completed.

RCU, combined with disable_all_migrate_targets(), ensures that the old
migration state is not visible by the time __set_migration_target_nodes()
is called.

=== What does READ_ONCE() provide? ===

READ_ONCE() forbids the compiler from merging or reordering successive
reads of node_demotion[].  This ensures that any updates are *eventually*
observed.

Consider the above loop again.  The compiler could theoretically read the
entirety of node_demotion[] into local storage (registers) and never go
back to memory, and *permanently* observe bad values for node_demotion[].

Note: RCU does not provide any universal compiler-ordering
guarantees:

https://lore.kernel.org/lkml/20150921204327 [email protected]/

This code is unused for now.  It will be called later in the
series.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>