]> Git Repo - J-linux.git/commitdiff
Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel...
authorLinus Torvalds <[email protected]>
Fri, 3 Nov 2023 05:38:47 +0000 (19:38 -1000)
committerLinus Torvalds <[email protected]>
Fri, 3 Nov 2023 05:38:47 +0000 (19:38 -1000)
Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Kemeng Shi has contributed some compation maintenance work in the
     series 'Fixes and cleanups to compaction'

   - Joel Fernandes has a patchset ('Optimize mremap during mutual
     alignment within PMD') which fixes an obscure issue with mremap()'s
     pagetable handling during a subsequent exec(), based upon an
     implementation which Linus suggested

   - More DAMON/DAMOS maintenance and feature work from SeongJae Park i
     the following patch series:

mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval

   - In the series 'Do not try to access unaccepted memory' Adrian
     Hunter provides some fixups for the recently-added 'unaccepted
     memory' feature. To increase the feature's checking coverage. 'Plug
     a few gaps where RAM is exposed without checking if it is
     unaccepted memory'

   - In the series 'cleanups for lockless slab shrink' Qi Zheng has done
     some maintenance work which is preparation for the lockless slab
     shrinking code

   - Qi Zheng has redone the earlier (and reverted) attempt to make slab
     shrinking lockless in the series 'use refcount+RCU method to
     implement lockless slab shrink'

   - David Hildenbrand contributes some maintenance work for the rmap
     code in the series 'Anon rmap cleanups'

   - Kefeng Wang does more folio conversions and some maintenance work
     in the migration code. Series 'mm: migrate: more folio conversion
     and unification'

   - Matthew Wilcox has fixed an issue in the buffer_head code which was
     causing long stalls under some heavy memory/IO loads. Some cleanups
     were added on the way. Series 'Add and use bdev_getblk()'

   - In the series 'Use nth_page() in place of direct struct page
     manipulation' Zi Yan has fixed a potential issue with the direct
     manipulation of hugetlb page frames

   - In the series 'mm: hugetlb: Skip initialization of gigantic tail
     struct pages if freed by HVO' has improved our handling of gigantic
     pages in the hugetlb vmmemmep optimizaton code. This provides
     significant boot time improvements when significant amounts of
     gigantic pages are in use

   - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
     rationalization and folio conversions in the hugetlb code

   - Yin Fengwei has improved mlock()'s handling of large folios in the
     series 'support large folio for mlock'

   - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
     added statistics for memcg v1 users which are available (and
     useful) under memcg v2

   - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
     prctl so that userspace may direct the kernel to not automatically
     propagate the denial to child processes. The series is named 'MDWE
     without inheritance'

   - Kefeng Wang has provided the series 'mm: convert numa balancing
     functions to use a folio' which does what it says

   - In the series 'mm/ksm: add fork-exec support for prctl' Stefan
     Roesch makes is possible for a process to propagate KSM treatment
     across exec()

   - Huang Ying has enhanced memory tiering's calculation of memory
     distances. This is used to permit the dax/kmem driver to use 'high
     bandwidth memory' in addition to Optane Data Center Persistent
     Memory Modules (DCPMM). The series is named 'memory tiering:
     calculate abstract distance based on ACPI HMAT'

   - In the series 'Smart scanning mode for KSM' Stefan Roesch has
     optimized KSM by teaching it to retain and use some historical
     information from previous scans

   - Yosry Ahmed has fixed some inconsistencies in memcg statistics in
     the series 'mm: memcg: fix tracking of pending stats updates
     values'

   - In the series 'Implement IOCTL to get and optionally clear info
     about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
     which permits us to atomically read-then-clear page softdirty
     state. This is mainly used by CRIU

   - Hugh Dickins contributed the series 'shmem,tmpfs: general
     maintenance', a bunch of relatively minor maintenance tweaks to
     this code

   - Matthew Wilcox has increased the use of the VMA lock over
     file-backed page faults in the series 'Handle more faults under the
     VMA lock'. Some rationalizations of the fault path became possible
     as a result

   - In the series 'mm/rmap: convert page_move_anon_rmap() to
     folio_move_anon_rmap()' David Hildenbrand has implemented some
     cleanups and folio conversions

   - In the series 'various improvements to the GUP interface' Lorenzo
     Stoakes has simplified and improved the GUP interface with an eye
     to providing groundwork for future improvements

   - Andrey Konovalov has sent along the series 'kasan: assorted fixes
     and improvements' which does those things

   - Some page allocator maintenance work from Kemeng Shi in the series
     'Two minor cleanups to break_down_buddy_pages'

   - In thes series 'New selftest for mm' Breno Leitao has developed
     another MM self test which tickles a race we had between madvise()
     and page faults

   - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
     and an optimization to the core pagecache code

   - Nhat Pham has added memcg accounting for hugetlb memory in the
     series 'hugetlb memcg accounting'

   - Cleanups and rationalizations to the pagemap code from Lorenzo
     Stoakes, in the series 'Abstract vma_merge() and split_vma()'

   - Audra Mitchell has fixed issues in the procfs page_owner code's new
     timestamping feature which was causing some misbehaviours. In the
     series 'Fix page_owner's use of free timestamps'

   - Lorenzo Stoakes has fixed the handling of new mappings of sealed
     files in the series 'permit write-sealed memfd read-only shared
     mappings'

   - Mike Kravetz has optimized the hugetlb vmemmap optimization in the
     series 'Batch hugetlb vmemmap modification operations'

   - Some buffer_head folio conversions and cleanups from Matthew Wilcox
     in the series 'Finish the create_empty_buffers() transition'

   - As a page allocator performance optimization Huang Ying has added
     automatic tuning to the allocator's per-cpu-pages feature, in the
     series 'mm: PCP high auto-tuning'

   - Roman Gushchin has contributed the patchset 'mm: improve
     performance of accounted kernel memory allocations' which improves
     their performance by ~30% as measured by a micro-benchmark

   - folio conversions from Kefeng Wang in the series 'mm: convert page
     cpupid functions to folios'

   - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
     kmemleak'

   - Qi Zheng has improved our handling of memoryless nodes by keeping
     them off the allocation fallback list. This is done in the series
     'handle memoryless nodes more appropriately'

   - khugepaged conversions from Vishal Moola in the series 'Some
     khugepaged folio conversions'"

[ bcachefs conflicts with the dynamically allocated shrinkers have been
  resolved as per Stephen Rothwell in

     https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/

  with help from Qi Zheng.

  The clone3 test filtering conflict was half-arsed by yours truly ]

* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
  mm/damon/sysfs: update monitoring target regions for online input commit
  mm/damon/sysfs: remove requested targets when online-commit inputs
  selftests: add a sanity check for zswap
  Documentation: maple_tree: fix word spelling error
  mm/vmalloc: fix the unchecked dereference warning in vread_iter()
  zswap: export compression failure stats
  Documentation: ubsan: drop "the" from article title
  mempolicy: migration attempt to match interleave nodes
  mempolicy: mmap_lock is not needed while migrating folios
  mempolicy: alloc_pages_mpol() for NUMA policy without vma
  mm: add page_rmappable_folio() wrapper
  mempolicy: remove confusing MPOL_MF_LAZY dead code
  mempolicy: mpol_shared_policy_init() without pseudo-vma
  mempolicy trivia: use pgoff_t in shared mempolicy tree
  mempolicy trivia: slightly more consistent naming
  mempolicy trivia: delete those ancient pr_debug()s
  mempolicy: fix migrate_pages(2) syscall return nr_failed
  kernfs: drop shared NUMA mempolicy hooks
  hugetlbfs: drop shared NUMA mempolicy pretence
  mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
  ...

69 files changed:
1  2 
Documentation/admin-guide/cgroup-v2.rst
MAINTAINERS
arch/arm64/kernel/mte.c
arch/x86/include/asm/bitops.h
arch/x86/kvm/mmu/mmu.c
drivers/acpi/acpi_pad.c
drivers/firmware/efi/unaccepted_memory.c
drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
drivers/gpu/drm/i915/i915_drv.h
drivers/gpu/drm/msm/msm_drv.c
drivers/gpu/drm/msm/msm_drv.h
drivers/gpu/drm/panfrost/panfrost_device.h
drivers/gpu/drm/panfrost/panfrost_drv.c
drivers/gpu/drm/panfrost/panfrost_gem.h
drivers/md/bcache/bcache.h
drivers/md/dm-cache-metadata.c
drivers/md/raid5.c
drivers/virtio/virtio_balloon.c
fs/bcachefs/btree_cache.c
fs/bcachefs/btree_key_cache.c
fs/bcachefs/btree_types.h
fs/bcachefs/fs.c
fs/bcachefs/sysfs.c
fs/btrfs/super.c
fs/erofs/utils.c
fs/ext4/ext4.h
fs/ext4/extents_status.c
fs/ext4/inode.c
fs/ext4/super.c
fs/f2fs/super.c
fs/gfs2/bmap.c
fs/gfs2/glock.c
fs/gfs2/quota.c
fs/hugetlbfs/inode.c
fs/iomap/buffered-io.c
fs/nfs/super.c
fs/nfsd/filecache.c
fs/nfsd/nfs4state.c
fs/ntfs3/file.c
fs/ocfs2/aops.c
fs/proc/task_mmu.c
fs/quota/dquot.c
fs/reiserfs/inode.c
fs/super.c
fs/ubifs/super.c
fs/ufs/inode.c
fs/xfs/xfs_buf.c
fs/xfs/xfs_buf.h
include/linux/cgroup-defs.h
include/linux/fs.h
include/linux/mm.h
include/linux/mm_types.h
include/linux/sched.h
include/linux/sched/numa_balancing.h
kernel/cgroup/cgroup.c
kernel/exit.c
kernel/fork.c
kernel/rcu/tree.c
kernel/sched/fair.c
mm/mempolicy.c
mm/mmap.c
mm/nommu.c
mm/percpu.c
mm/shmem.c
mm/util.c
net/sunrpc/auth.c
tools/testing/selftests/clone3/clone3.c
tools/testing/selftests/damon/sysfs.sh
tools/testing/selftests/mm/mremap_test.c

index 3f081459a5be89c1871cf01c6ec1d5bd92e1fcc2,606b2e0eac4b17332fdb0480fd6c9b652379e1da..3f85254f3cef2c22012e08ed61061795c5f8c99a
@@@ -210,6 -210,35 +210,35 @@@ cgroup v2 currently supports the follow
          relying on the original semantics (e.g. specifying bogusly
          high 'bypass' protection values at higher tree levels).
  
+   memory_hugetlb_accounting
+         Count HugeTLB memory usage towards the cgroup's overall
+         memory usage for the memory controller (for the purpose of
+         statistics reporting and memory protetion). This is a new
+         behavior that could regress existing setups, so it must be
+         explicitly opted in with this mount option.
+         A few caveats to keep in mind:
+         * There is no HugeTLB pool management involved in the memory
+           controller. The pre-allocated pool does not belong to anyone.
+           Specifically, when a new HugeTLB folio is allocated to
+           the pool, it is not accounted for from the perspective of the
+           memory controller. It is only charged to a cgroup when it is
+           actually used (for e.g at page fault time). Host memory
+           overcommit management has to consider this when configuring
+           hard limits. In general, HugeTLB pool management should be
+           done via other mechanisms (such as the HugeTLB controller).
+         * Failure to charge a HugeTLB folio to the memory controller
+           results in SIGBUS. This could happen even if the HugeTLB pool
+           still has pages available (but the cgroup limit is hit and
+           reclaim attempt fails).
+         * Charging HugeTLB memory towards the memory controller affects
+           memory protection and reclaim dynamics. Any userspace tuning
+           (of low, min limits for e.g) needs to take this into account.
+         * HugeTLB pages utilized while this option is not selected
+           will not be tracked by the memory controller (even if cgroup
+           v2 is remounted later on).
  
  Organizing Processes and Threads
  --------------------------------
@@@ -364,13 -393,6 +393,13 @@@ constraint, a threaded controller must 
  between threads in a non-leaf cgroup and its child cgroups.  Each
  threaded controller defines how such competitions are handled.
  
 +Currently, the following controllers are threaded and can be enabled
 +in a threaded cgroup::
 +
 +- cpu
 +- cpuset
 +- perf_event
 +- pids
  
  [Un]populated Notification
  --------------------------
@@@ -1539,6 -1561,15 +1568,15 @@@ PAGE_SIZE multiple when read back
                collapsing an existing range of pages. This counter is not
                present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
  
+         thp_swpout (npn)
+               Number of transparent hugepages which are swapout in one piece
+               without splitting.
+         thp_swpout_fallback (npn)
+               Number of transparent hugepages which were split before swapout.
+               Usually because failed to allocate some continuous swap space
+               for the huge page.
    memory.numa_stat
        A read-only nested-keyed file which exists on non-root cgroups.
  
@@@ -2030,7 -2061,7 +2068,7 @@@ IO Priorit
  ~~~~~~~~~~~
  
  A single attribute controls the behavior of the I/O priority cgroup policy,
 -namely the blkio.prio.class attribute. The following values are accepted for
 +namely the io.prio.class attribute. The following values are accepted for
  that attribute:
  
    no-change
@@@ -2059,11 -2090,9 +2097,11 @@@ The following numerical values are asso
  +----------------+---+
  | no-change      | 0 |
  +----------------+---+
 -| rt-to-be       | 2 |
 +| promote-to-rt  | 1 |
 ++----------------+---+
 +| restrict-to-be | 2 |
  +----------------+---+
 -| all-to-idle    | 3 |
 +| idle           | 3 |
  +----------------+---+
  
  The numerical value that corresponds to each I/O priority class is as follows:
@@@ -2083,7 -2112,7 +2121,7 @@@ The algorithm to set the I/O priority c
  - If I/O priority class policy is promote-to-rt, change the request I/O
    priority class to IOPRIO_CLASS_RT and change the request I/O priority
    level to 4.
 -- If I/O priorityt class is not promote-to-rt, translate the I/O priority
 +- If I/O priority class policy is not promote-to-rt, translate the I/O priority
    class policy into a number, then change the request I/O priority class
    into the maximum of the I/O priority class policy number and the numerical
    I/O priority class.
@@@ -2235,49 -2264,6 +2273,49 @@@ Cpuset Interface File
  
        Its value will be affected by memory nodes hotplug events.
  
 +  cpuset.cpus.exclusive
 +      A read-write multiple values file which exists on non-root
 +      cpuset-enabled cgroups.
 +
 +      It lists all the exclusive CPUs that are allowed to be used
 +      to create a new cpuset partition.  Its value is not used
 +      unless the cgroup becomes a valid partition root.  See the
 +      "cpuset.cpus.partition" section below for a description of what
 +      a cpuset partition is.
 +
 +      When the cgroup becomes a partition root, the actual exclusive
 +      CPUs that are allocated to that partition are listed in
 +      "cpuset.cpus.exclusive.effective" which may be different
 +      from "cpuset.cpus.exclusive".  If "cpuset.cpus.exclusive"
 +      has previously been set, "cpuset.cpus.exclusive.effective"
 +      is always a subset of it.
 +
 +      Users can manually set it to a value that is different from
 +      "cpuset.cpus".  The only constraint in setting it is that the
 +      list of CPUs must be exclusive with respect to its sibling.
 +
 +      For a parent cgroup, any one of its exclusive CPUs can only
 +      be distributed to at most one of its child cgroups.  Having an
 +      exclusive CPU appearing in two or more of its child cgroups is
 +      not allowed (the exclusivity rule).  A value that violates the
 +      exclusivity rule will be rejected with a write error.
 +
 +      The root cgroup is a partition root and all its available CPUs
 +      are in its exclusive CPU set.
 +
 +  cpuset.cpus.exclusive.effective
 +      A read-only multiple values file which exists on all non-root
 +      cpuset-enabled cgroups.
 +
 +      This file shows the effective set of exclusive CPUs that
 +      can be used to create a partition root.  The content of this
 +      file will always be a subset of "cpuset.cpus" and its parent's
 +      "cpuset.cpus.exclusive.effective" if its parent is not the root
 +      cgroup.  It will also be a subset of "cpuset.cpus.exclusive"
 +      if it is set.  If "cpuset.cpus.exclusive" is not set, it is
 +      treated to have an implicit value of "cpuset.cpus" in the
 +      formation of local partition.
 +
    cpuset.cpus.partition
        A read-write single value file which exists on non-root
        cpuset-enabled cgroups.  This flag is owned by the parent cgroup
          "isolated"    Partition root without load balancing
          ==========    =====================================
  
 -      The root cgroup is always a partition root and its state
 -      cannot be changed.  All other non-root cgroups start out as
 -      "member".
 +      A cpuset partition is a collection of cpuset-enabled cgroups with
 +      a partition root at the top of the hierarchy and its descendants
 +      except those that are separate partition roots themselves and
 +      their descendants.  A partition has exclusive access to the
 +      set of exclusive CPUs allocated to it.  Other cgroups outside
 +      of that partition cannot use any CPUs in that set.
 +
 +      There are two types of partitions - local and remote.  A local
 +      partition is one whose parent cgroup is also a valid partition
 +      root.  A remote partition is one whose parent cgroup is not a
 +      valid partition root itself.  Writing to "cpuset.cpus.exclusive"
 +      is optional for the creation of a local partition as its
 +      "cpuset.cpus.exclusive" file will assume an implicit value that
 +      is the same as "cpuset.cpus" if it is not set.  Writing the
 +      proper "cpuset.cpus.exclusive" values down the cgroup hierarchy
 +      before the target partition root is mandatory for the creation
 +      of a remote partition.
 +
 +      Currently, a remote partition cannot be created under a local
 +      partition.  All the ancestors of a remote partition root except
 +      the root cgroup cannot be a partition root.
 +
 +      The root cgroup is always a partition root and its state cannot
 +      be changed.  All other non-root cgroups start out as "member".
  
        When set to "root", the current cgroup is the root of a new
 -      partition or scheduling domain that comprises itself and all
 -      its descendants except those that are separate partition roots
 -      themselves and their descendants.
 +      partition or scheduling domain.  The set of exclusive CPUs is
 +      determined by the value of its "cpuset.cpus.exclusive.effective".
  
 -      When set to "isolated", the CPUs in that partition root will
 +      When set to "isolated", the CPUs in that partition will
        be in an isolated state without any load balancing from the
        scheduler.  Tasks placed in such a partition with multiple
        CPUs should be carefully distributed and bound to each of the
        individual CPUs for optimal performance.
  
 -      The value shown in "cpuset.cpus.effective" of a partition root
 -      is the CPUs that the partition root can dedicate to a potential
 -      new child partition root. The new child subtracts available
 -      CPUs from its parent "cpuset.cpus.effective".
 -
        A partition root ("root" or "isolated") can be in one of the
        two possible states - valid or invalid.  An invalid partition
        root is in a degraded state where some state information may
        In the case of an invalid partition root, a descriptive string on
        why the partition is invalid is included within parentheses.
  
 -      For a partition root to become valid, the following conditions
 +      For a local partition root to be valid, the following conditions
        must be met.
  
 -      1) The "cpuset.cpus" is exclusive with its siblings , i.e. they
 -         are not shared by any of its siblings (exclusivity rule).
 -      2) The parent cgroup is a valid partition root.
 -      3) The "cpuset.cpus" is not empty and must contain at least
 -         one of the CPUs from parent's "cpuset.cpus", i.e. they overlap.
 -      4) The "cpuset.cpus.effective" cannot be empty unless there is
 +      1) The parent cgroup is a valid partition root.
 +      2) The "cpuset.cpus.exclusive.effective" file cannot be empty,
 +         though it may contain offline CPUs.
 +      3) The "cpuset.cpus.effective" cannot be empty unless there is
           no task associated with this partition.
  
 -      External events like hotplug or changes to "cpuset.cpus" can
 -      cause a valid partition root to become invalid and vice versa.
 -      Note that a task cannot be moved to a cgroup with empty
 -      "cpuset.cpus.effective".
 +      For a remote partition root to be valid, all the above conditions
 +      except the first one must be met.
  
 -      For a valid partition root with the sibling cpu exclusivity
 -      rule enabled, changes made to "cpuset.cpus" that violate the
 -      exclusivity rule will invalidate the partition as well as its
 -      sibling partitions with conflicting cpuset.cpus values. So
 -      care must be taking in changing "cpuset.cpus".
 +      External events like hotplug or changes to "cpuset.cpus" or
 +      "cpuset.cpus.exclusive" can cause a valid partition root to
 +      become invalid and vice versa.  Note that a task cannot be
 +      moved to a cgroup with empty "cpuset.cpus.effective".
  
        A valid non-root parent partition may distribute out all its CPUs
 -      to its child partitions when there is no task associated with it.
 +      to its child local partitions when there is no task associated
 +      with it.
  
 -      Care must be taken to change a valid partition root to
 -      "member" as all its child partitions, if present, will become
 +      Care must be taken to change a valid partition root to "member"
 +      as all its child local partitions, if present, will become
        invalid causing disruption to tasks running in those child
        partitions. These inactivated partitions could be recovered if
        their parent is switched back to a partition root with a proper
 -      set of "cpuset.cpus".
 +      value in "cpuset.cpus" or "cpuset.cpus.exclusive".
  
        Poll and inotify events are triggered whenever the state of
        "cpuset.cpus.partition" changes.  That includes changes caused
        to "cpuset.cpus.partition" without the need to do continuous
        polling.
  
 +      A user can pre-configure certain CPUs to an isolated state
 +      with load balancing disabled at boot time with the "isolcpus"
 +      kernel boot command line option.  If those CPUs are to be put
 +      into a partition, they have to be used in an isolated partition.
 +
  
  Device controller
  -----------------
diff --combined MAINTAINERS
index 7ddf1db587c1a193604ce83079ba24347131f3e6,7fd72948d10e58f9172bc4f005c42f366ccd18ef..c6bafe60419d5842de8ac49449816b2f4108bcbb
@@@ -378,9 -378,8 +378,9 @@@ F: drivers/acpi/viot.
  F:    include/linux/acpi_viot.h
  
  ACPI WMI DRIVER
 +M:    Armin Wolf <[email protected]>
  L:    [email protected]
 -S:    Orphan
 +S:    Maintained
  F:    Documentation/driver-api/wmi.rst
  F:    Documentation/wmi/
  F:    drivers/platform/x86/wmi.c
@@@ -471,6 -470,7 +471,6 @@@ F: drivers/hwmon/adm1029.
  ADM8211 WIRELESS DRIVER
  L:    [email protected]
  S:    Orphan
 -W:    https://wireless.wiki.kernel.org/
  F:    drivers/net/wireless/admtek/adm8211.*
  
  ADP1653 FLASH CONTROLLER DRIVER
@@@ -908,7 -908,7 +908,7 @@@ F: drivers/crypto/ccp
  F:    include/linux/ccp.h
  
  AMD CRYPTOGRAPHIC COPROCESSOR (CCP) DRIVER - SEV SUPPORT
 -M:    Brijesh Singh <brijesh.singh@amd.com>
 +M:    Ashish Kalra <ashish.kalra@amd.com>
  M:    Tom Lendacky <[email protected]>
  L:    [email protected]
  S:    Supported
@@@ -1460,6 -1460,7 +1460,6 @@@ F:      drivers/hwmon/applesmc.
  APPLETALK NETWORK LAYER
  L:    [email protected]
  S:    Odd fixes
 -F:    drivers/net/appletalk/
  F:    include/linux/atalk.h
  F:    include/uapi/linux/atalk.h
  F:    net/appletalk/
@@@ -1584,17 -1585,6 +1584,17 @@@ F:    arch/arm/include/asm/arch_timer.
  F:    arch/arm64/include/asm/arch_timer.h
  F:    drivers/clocksource/arm_arch_timer.c
  
 +ARM GENERIC INTERRUPT CONTROLLER DRIVERS
 +M:    Marc Zyngier <[email protected]>
 +L:    [email protected] (moderated for non-subscribers)
 +S:    Maintained
 +F:    Documentation/devicetree/bindings/interrupt-controller/arm,gic*
 +F:    arch/arm/include/asm/arch_gicv3.h
 +F:    arch/arm64/include/asm/arch_gicv3.h
 +F:    drivers/irqchip/irq-gic*.[ch]
 +F:    include/linux/irqchip/arm-gic*.h
 +F:    include/linux/irqchip/arm-vgic-info.h
 +
  ARM HDLCD DRM DRIVER
  M:    Liviu Dudau <[email protected]>
  S:    Supported
@@@ -1636,13 -1626,13 +1636,13 @@@ F:   drivers/gpu/drm/arm/display/include
  F:    drivers/gpu/drm/arm/display/komeda/
  
  ARM MALI PANFROST DRM DRIVER
 +M:    Boris Brezillon <[email protected]>
  M:    Rob Herring <[email protected]>
 -M:    Tomeu Vizoso <[email protected]>
  R:    Steven Price <[email protected]>
 -R:    Alyssa Rosenzweig <[email protected]>
  L:    [email protected]
  S:    Supported
  T:    git git://anongit.freedesktop.org/drm/drm-misc
 +F:    Documentation/gpu/panfrost.rst
  F:    drivers/gpu/drm/panfrost/
  F:    include/uapi/drm/panfrost_drm.h
  
@@@ -1798,7 -1788,7 +1798,7 @@@ F:      drivers/irqchip/irq-owl-sirq.
  F:    drivers/mmc/host/owl-mmc.c
  F:    drivers/net/ethernet/actions/
  F:    drivers/pinctrl/actions/*
 -F:    drivers/soc/actions/
 +F:    drivers/pmdomain/actions/
  F:    include/dt-bindings/power/owl-*
  F:    include/dt-bindings/reset/actions,*
  F:    include/linux/soc/actions/
@@@ -1826,13 -1816,6 +1826,13 @@@ N:    allwinne
  N:    sun[x456789]i
  N:    sun[25]0i
  
 +ARM/AMD PENSANDO ARM64 ARCHITECTURE
 +M:    Brad Larson <[email protected]>
 +L:    [email protected] (moderated for non-subscribers)
 +S:    Supported
 +F:    Documentation/devicetree/bindings/*/amd,pensando*
 +F:    arch/arm64/boot/dts/amd/elba*
 +
  ARM/Amlogic Meson SoC CLOCK FRAMEWORK
  M:    Neil Armstrong <[email protected]>
  M:    Jerome Brunet <[email protected]>
@@@ -2228,28 -2211,21 +2228,28 @@@ F:   arch/arm/boot/dts/ti/omap/omap3-igep
  ARM/INTEL IXP4XX ARM ARCHITECTURE
  M:    Linus Walleij <[email protected]>
  M:    Imre Kaloz <[email protected]>
 -M:    Krzysztof Halasa <[email protected]>
  L:    [email protected] (moderated for non-subscribers)
  S:    Maintained
  F:    Documentation/devicetree/bindings/arm/intel-ixp4xx.yaml
 -F:    Documentation/devicetree/bindings/gpio/intel,ixp4xx-gpio.txt
 +F:    Documentation/devicetree/bindings/gpio/intel,ixp4xx-gpio.yaml
  F:    Documentation/devicetree/bindings/interrupt-controller/intel,ixp4xx-interrupt.yaml
  F:    Documentation/devicetree/bindings/memory-controllers/intel,ixp4xx-expansion*
 +F:    Documentation/devicetree/bindings/rng/intel,ixp46x-rng.yaml
  F:    Documentation/devicetree/bindings/timer/intel,ixp4xx-timer.yaml
  F:    arch/arm/boot/dts/intel/ixp/
  F:    arch/arm/mach-ixp4xx/
  F:    drivers/bus/intel-ixp4xx-eb.c
 +F:    drivers/char/hw_random/ixp4xx-rng.c
  F:    drivers/clocksource/timer-ixp4xx.c
  F:    drivers/crypto/intel/ixp4xx/ixp4xx_crypto.c
  F:    drivers/gpio/gpio-ixp4xx.c
  F:    drivers/irqchip/irq-ixp4xx.c
 +F:    drivers/net/ethernet/xscale/ixp4xx_eth.c
 +F:    drivers/net/wan/ixp4xx_hss.c
 +F:    drivers/soc/ixp4xx/ixp4xx-npe.c
 +F:    drivers/soc/ixp4xx/ixp4xx-qmgr.c
 +F:    include/linux/soc/ixp4xx/npe.h
 +F:    include/linux/soc/ixp4xx/qmgr.h
  
  ARM/INTEL KEEMBAY ARCHITECTURE
  M:    Paul J. Murphy <[email protected]>
@@@ -2351,7 -2327,7 +2351,7 @@@ F:      drivers/rtc/rtc-mt7622.
  
  ARM/Mediatek SoC support
  M:    Matthias Brugger <[email protected]>
 -R:    AngeloGioacchino Del Regno <[email protected]>
 +M:    AngeloGioacchino Del Regno <[email protected]>
  L:    [email protected]
  L:    [email protected] (moderated for non-subscribers)
  L:    [email protected] (moderated for non-subscribers)
@@@ -3488,14 -3464,6 +3488,14 @@@ W:    http://bcache.evilpiepirate.or
  C:    irc://irc.oftc.net/bcache
  F:    drivers/md/bcache/
  
 +BCACHEFS
 +M:    Kent Overstreet <[email protected]>
 +R:    Brian Foster <[email protected]>
 +L:    [email protected]
 +S:    Supported
 +C:    irc://irc.oftc.net/bcache
 +F:    fs/bcachefs/
 +
  BDISP ST MEDIA DRIVER
  M:    Fabien Dessenne <[email protected]>
  L:    [email protected]
@@@ -3628,10 -3596,9 +3628,10 @@@ F:    Documentation/devicetree/bindings/ii
  F:    drivers/iio/accel/bma400*
  
  BPF JIT for ARM
 -M:    Shubham Bansal <[email protected]>
 +M:    Russell King <[email protected]>
 +M:    Puranjay Mohan <[email protected]>
  L:    [email protected]
 -S:    Odd Fixes
 +S:    Maintained
  F:    arch/arm/net/
  
  BPF JIT for ARM64
  S:    Odd Fixes
  K:    (?:\b|_)bpf(?:\b|_)
  
 +BPF [NETKIT] (BPF-programmable network device)
 +M:    Daniel Borkmann <[email protected]>
 +M:    Nikolay Aleksandrov <[email protected]>
 +L:    [email protected]
 +L:    [email protected]
 +S:    Supported
 +F:    drivers/net/netkit.c
 +F:    include/net/netkit.h
 +
  BPF [NETWORKING] (struct_ops, reuseport)
  M:    Martin KaFai Lau <[email protected]>
  L:    [email protected]
@@@ -4362,7 -4320,8 +4362,7 @@@ F:      drivers/net/ethernet/broadcom/bcmsys
  F:    drivers/net/ethernet/broadcom/unimac.h
  
  BROADCOM TG3 GIGABIT ETHERNET DRIVER
 -M:    Siva Reddy Kallam <[email protected]>
 -M:    Prashant Sreedharan <[email protected]>
 +M:    Pavan Chebbi <[email protected]>
  M:    Michael Chan <[email protected]>
  L:    [email protected]
  S:    Supported
@@@ -4835,13 -4794,6 +4835,13 @@@ X:    drivers/char/ipmi
  X:    drivers/char/random.c
  X:    drivers/char/tpm/
  
 +CHARGERLAB POWER-Z HARDWARE MONITOR DRIVER
 +M:    Thomas Weißschuh <[email protected]>
 +L:    [email protected]
 +S:    Maintained
 +F:    Documentation/hwmon/powerz.rst
 +F:    drivers/hwmon/powerz.c
 +
  CHECKPATCH
  M:    Andy Whitcroft <[email protected]>
  M:    Joe Perches <[email protected]>
@@@ -4959,7 -4911,6 +4959,7 @@@ F:      drivers/spi/spi-cs42l43
  F:    include/dt-bindings/sound/cs*
  F:    include/linux/mfd/cs42l43*
  F:    include/sound/cs*
 +F:    sound/pci/hda/cirrus*
  F:    sound/pci/hda/cs*
  F:    sound/pci/hda/hda_cs_dsp_ctl.*
  F:    sound/soc/codecs/cs*
@@@ -5099,14 -5050,6 +5099,14 @@@ T:    git git://git.kernel.org/pub/scm/lin
  F:    Documentation/devicetree/bindings/timer/
  F:    drivers/clocksource/
  
 +CLOSURES
 +M:    Kent Overstreet <[email protected]>
 +L:    [email protected]
 +S:    Supported
 +C:    irc://irc.oftc.net/bcache
 +F:    include/linux/closure.h
 +F:    lib/closure.c
 +
  CMPC ACPI DRIVER
  M:    Thadeu Lima de Souza Cascardo <[email protected]>
  M:    Daniel Oliveira Nascimento <[email protected]>
@@@ -5252,12 -5195,6 +5252,12 @@@ S:    Orpha
  W:    http://accessrunner.sourceforge.net/
  F:    drivers/usb/atm/cxacru.c
  
 +CONFIDENTIAL COMPUTING THREAT MODEL FOR X86 VIRTUALIZATION (SNP/TDX)
 +M:    Elena Reshetova <[email protected]>
 +M:    Carlos Bilbao <[email protected]>
 +S:    Maintained
 +F:    Documentation/security/snp-tdx-threat-model.rst
 +
  CONFIGFS
  M:    Joel Becker <[email protected]>
  M:    Christoph Hellwig <[email protected]>
@@@ -5332,6 -5269,7 +5332,7 @@@ S:      Maintaine
  F:    mm/memcontrol.c
  F:    mm/swap_cgroup.c
  F:    tools/testing/selftests/cgroup/memcg_protection.m
+ F:    tools/testing/selftests/cgroup/test_hugetlb_memcg.c
  F:    tools/testing/selftests/cgroup/test_kmem.c
  F:    tools/testing/selftests/cgroup/test_memcontrol.c
  
@@@ -5372,6 -5310,12 +5373,6 @@@ M:     Bence Csókás <[email protected]
  S:    Maintained
  F:    drivers/i2c/busses/i2c-cp2615.c
  
 -CPMAC ETHERNET DRIVER
 -M:    Florian Fainelli <[email protected]>
 -L:    [email protected]
 -S:    Maintained
 -F:    drivers/net/ethernet/ti/cpmac.c
 -
  CPU FREQUENCY DRIVERS - VEXPRESS SPC ARM BIG LITTLE
  M:    Viresh Kumar <[email protected]>
  M:    Sudeep Holla <[email protected]>
@@@ -5655,7 -5599,7 +5656,7 @@@ M:      Andrew Donnellan <[email protected]
  L:    [email protected]
  S:    Supported
  F:    Documentation/ABI/testing/sysfs-class-cxl
 -F:    Documentation/powerpc/cxl.rst
 +F:    Documentation/arch/powerpc/cxl.rst
  F:    arch/powerpc/platforms/powernv/pci-cxl.c
  F:    drivers/misc/cxl/
  F:    include/misc/cxl*
@@@ -5667,7 -5611,7 +5668,7 @@@ M:      Matthew R. Ochs <[email protected]
  M:    Uma Krishnan <[email protected]>
  L:    [email protected]
  S:    Supported
 -F:    Documentation/powerpc/cxlflash.rst
 +F:    Documentation/arch/powerpc/cxlflash.rst
  F:    drivers/scsi/cxlflash/
  F:    include/uapi/scsi/cxlflash_ioctl.h
  
@@@ -6042,9 -5986,8 +6043,9 @@@ F:      include/linux/devm-helpers.
  DEVICE-MAPPER  (LVM)
  M:    Alasdair Kergon <[email protected]>
  M:    Mike Snitzer <[email protected]>
 -M:    [email protected]
 -L:    [email protected]
 +M:    Mikulas Patocka <[email protected]>
 +M:    [email protected]
 +L:    [email protected]
  S:    Maintained
  W:    http://sources.redhat.com/dm
  Q:    http://patchwork.kernel.org/project/dm-devel/list/
@@@ -6173,11 -6116,11 +6174,11 @@@ F:   drivers/video/fbdev/udlfb.
  F:    include/video/udlfb.h
  
  DISTRIBUTED LOCK MANAGER (DLM)
 -M:    Christine Caulfield <ccaulfie@redhat.com>
 +M:    Alexander Aring <aahringo@redhat.com>
  M:    David Teigland <[email protected]>
  L:    [email protected]
  S:    Supported
 -W:    http://sources.redhat.com/cluster/
 +W:    https://pagure.io/dlm
  T:    git git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm.git
  F:    fs/dlm/
  
@@@ -6190,7 -6133,6 +6191,7 @@@ L:      [email protected] (mode
  S:    Maintained
  T:    git git://anongit.freedesktop.org/drm/drm-misc
  F:    Documentation/driver-api/dma-buf.rst
 +F:    Documentation/userspace-api/dma-buf-alloc-exchange.rst
  F:    drivers/dma-buf/
  F:    include/linux/*fence.h
  F:    include/linux/dma-buf.h
@@@ -6393,17 -6335,6 +6394,17 @@@ F:    Documentation/networking/device_driv
  F:    drivers/net/ethernet/freescale/dpaa2/dpaa2-switch*
  F:    drivers/net/ethernet/freescale/dpaa2/dpsw*
  
 +DPLL SUBSYSTEM
 +M:    Vadim Fedorenko <[email protected]>
 +M:    Arkadiusz Kubalewski <[email protected]>
 +M:    Jiri Pirko <[email protected]>
 +L:    [email protected]
 +S:    Supported
 +F:    Documentation/driver-api/dpll.rst
 +F:    drivers/dpll/*
 +F:    include/linux/dpll.h
 +F:    include/uapi/linux/dpll.h
 +
  DRBD DRIVER
  M:    Philipp Reisner <[email protected]>
  M:    Lars Ellenberg <[email protected]>
@@@ -6683,7 -6614,6 +6684,7 @@@ S:      Maintaine
  B:    https://gitlab.freedesktop.org/drm/msm/-/issues
  T:    git https://gitlab.freedesktop.org/drm/msm.git
  F:    Documentation/devicetree/bindings/display/msm/
 +F:    drivers/gpu/drm/ci/xfails/msm*
  F:    drivers/gpu/drm/msm/
  F:    include/uapi/drm/msm_drm.h
  
@@@ -6835,8 -6765,7 +6836,8 @@@ DRM DRIVER FOR SOLOMON SSD130X OLED DIS
  M:    Javier Martinez Canillas <[email protected]>
  S:    Maintained
  T:    git git://anongit.freedesktop.org/drm/drm-misc
 -F:    Documentation/devicetree/bindings/display/solomon,ssd1307fb.yaml
 +F:    Documentation/devicetree/bindings/display/solomon,ssd-common.yaml
 +F:    Documentation/devicetree/bindings/display/solomon,ssd13*.yaml
  F:    drivers/gpu/drm/solomon/ssd130x*
  
  DRM DRIVER FOR ST-ERICSSON MCDE
@@@ -6931,26 -6860,12 +6932,26 @@@ M:   Thomas Zimmermann <tzimmermann@suse.
  S:    Maintained
  W:    https://01.org/linuxgraphics/gfx-docs/maintainer-tools/drm-misc.html
  T:    git git://anongit.freedesktop.org/drm/drm-misc
 +F:    Documentation/devicetree/bindings/display/
 +F:    Documentation/devicetree/bindings/gpu/
  F:    Documentation/gpu/
 -F:    drivers/gpu/drm/*
 +F:    drivers/gpu/drm/
  F:    drivers/gpu/vga/
 -F:    include/drm/drm*
 +F:    include/drm/drm
  F:    include/linux/vga*
 -F:    include/uapi/drm/drm*
 +F:    include/uapi/drm/
 +X:    drivers/gpu/drm/amd/
 +X:    drivers/gpu/drm/armada/
 +X:    drivers/gpu/drm/etnaviv/
 +X:    drivers/gpu/drm/exynos/
 +X:    drivers/gpu/drm/i915/
 +X:    drivers/gpu/drm/kmb/
 +X:    drivers/gpu/drm/mediatek/
 +X:    drivers/gpu/drm/msm/
 +X:    drivers/gpu/drm/nouveau/
 +X:    drivers/gpu/drm/radeon/
 +X:    drivers/gpu/drm/renesas/
 +X:    drivers/gpu/drm/tegra/
  
  DRM DRIVERS FOR ALLWINNER A10
  M:    Maxime Ripard <[email protected]>
@@@ -6971,7 -6886,6 +6972,7 @@@ T:      git git://anongit.freedesktop.org/dr
  F:    Documentation/devicetree/bindings/display/amlogic,meson-dw-hdmi.yaml
  F:    Documentation/devicetree/bindings/display/amlogic,meson-vpu.yaml
  F:    Documentation/gpu/meson.rst
 +F:    drivers/gpu/drm/ci/xfails/meson*
  F:    drivers/gpu/drm/meson/
  
  DRM DRIVERS FOR ATMEL HLCDC
@@@ -6995,9 -6909,7 +6996,9 @@@ T:      git git://anongit.freedesktop.org/dr
  F:    Documentation/devicetree/bindings/display/bridge/
  F:    drivers/gpu/drm/bridge/
  F:    drivers/gpu/drm/drm_bridge.c
 +F:    drivers/gpu/drm/drm_bridge_connector.c
  F:    include/drm/drm_bridge.h
 +F:    include/drm/drm_bridge_connector.h
  
  DRM DRIVERS FOR EXYNOS
  M:    Inki Dae <[email protected]>
@@@ -7021,12 -6933,10 +7022,12 @@@ F:   Documentation/devicetree/bindings/di
  F:    Documentation/devicetree/bindings/display/fsl,tcon.txt
  F:    drivers/gpu/drm/fsl-dcu/
  
 -DRM DRIVERS FOR FREESCALE IMX
 +DRM DRIVERS FOR FREESCALE IMX 5/6
  M:    Philipp Zabel <[email protected]>
  L:    [email protected]
  S:    Maintained
 +T:    git git://anongit.freedesktop.org/drm/drm-misc
 +T:    git git://git.pengutronix.de/git/pza/linux
  F:    Documentation/devicetree/bindings/display/imx/
  F:    drivers/gpu/drm/imx/ipuv3/
  F:    drivers/gpu/ipu-v3/
@@@ -7045,7 -6955,7 +7046,7 @@@ DRM DRIVERS FOR GMA500 (Poulsbo, Moores
  M:    Patrik Jakobsson <[email protected]>
  L:    [email protected]
  S:    Maintained
 -T:    git git://github.com/patjak/drm-gma500
 +T:    git git://anongit.freedesktop.org/drm/drm-misc
  F:    drivers/gpu/drm/gma500/
  
  DRM DRIVERS FOR HISILICON
@@@ -7084,7 -6994,6 +7085,7 @@@ L:      [email protected]
  L:    [email protected] (moderated for non-subscribers)
  S:    Supported
  F:    Documentation/devicetree/bindings/display/mediatek/
 +F:    drivers/gpu/drm/ci/xfails/mediatek*
  F:    drivers/gpu/drm/mediatek/
  F:    drivers/phy/mediatek/phy-mtk-dp.c
  F:    drivers/phy/mediatek/phy-mtk-hdmi*
@@@ -7125,7 -7034,6 +7126,7 @@@ L:      [email protected]
  S:    Maintained
  T:    git git://anongit.freedesktop.org/drm/drm-misc
  F:    Documentation/devicetree/bindings/display/rockchip/
 +F:    drivers/gpu/drm/ci/xfails/rockchip*
  F:    drivers/gpu/drm/rockchip/
  
  DRM DRIVERS FOR STI
@@@ -7222,7 -7130,7 +7223,7 @@@ F:      Documentation/devicetree/bindings/di
  F:    drivers/gpu/drm/xlnx/
  
  DRM GPU SCHEDULER
 -M:    Luben Tuikov <luben.tuikov@amd.com>
 +M:    Luben Tuikov <ltuikov89@gmail.com>
  L:    [email protected]
  S:    Maintained
  T:    git git://anongit.freedesktop.org/drm/drm-misc
@@@ -7231,7 -7139,6 +7232,7 @@@ F:      include/drm/gpu_scheduler.
  
  DRM PANEL DRIVERS
  M:    Neil Armstrong <[email protected]>
 +R:    Jessica Zhang <[email protected]>
  R:    Sam Ravnborg <[email protected]>
  L:    [email protected]
  S:    Maintained
@@@ -8187,7 -8094,7 +8188,7 @@@ F:      include/linux/arm_ffa.
  
  FIRMWARE LOADER (request_firmware)
  M:    Luis Chamberlain <[email protected]>
 -M:    Russ Weight <russ[email protected]>
 +M:    Russ Weight <russ[email protected]>
  L:    [email protected]
  S:    Maintained
  F:    Documentation/firmware_class/
@@@ -8218,7 -8125,7 +8219,7 @@@ M:      Geoffrey D. Bennett <[email protected]
  L:    [email protected] (moderated for non-subscribers)
  S:    Maintained
  T:    git git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git
 -F:    sound/usb/mixer_scarlett_gen2.c
 +F:    sound/usb/mixer_scarlett2.c
  
  FORCEDETH GIGABIT ETHERNET DRIVER
  M:    Rain River <[email protected]>
@@@ -8709,8 -8616,6 +8710,8 @@@ L:      [email protected]
  S:    Maintained
  T:    git git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/hardening
  F:    Documentation/kbuild/gcc-plugins.rst
 +F:    include/linux/stackleak.h
 +F:    kernel/stackleak.c
  F:    scripts/Makefile.gcc-plugins
  F:    scripts/gcc-plugins/
  
@@@ -8826,13 -8731,6 +8827,13 @@@ S:    Supporte
  T:    git git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm.git
  F:    drivers/pmdomain/
  
 +GENERIC RADIX TREE
 +M:    Kent Overstreet <[email protected]>
 +S:    Supported
 +C:    irc://irc.oftc.net/bcache
 +F:    include/linux/generic-radix-tree.h
 +F:    lib/generic-radix-tree.c
 +
  GENERIC RESISTIVE TOUCHSCREEN ADC DRIVER
  M:    Eugen Hristev <[email protected]>
  L:    [email protected]
@@@ -9169,7 -9067,6 +9170,7 @@@ T:      git https://git.kernel.org/pub/scm/l
  F:    Documentation/ABI/testing/debugfs-driver-habanalabs
  F:    Documentation/ABI/testing/sysfs-driver-habanalabs
  F:    drivers/accel/habanalabs/
 +F:    include/linux/habanalabs/
  F:    include/trace/events/habanalabs.h
  F:    include/uapi/drm/habanalabs_accel.h
  
@@@ -9635,8 -9532,10 +9636,8 @@@ F:     Documentation/devicetree/bindings/ii
  F:    drivers/iio/pressure/mprls0025pa.c
  
  HOST AP DRIVER
 -M:    Jouni Malinen <[email protected]>
  L:    [email protected]
  S:    Obsolete
 -W:    http://w1.fi/hostap-driver.html
  F:    drivers/net/wireless/intersil/hostap/
  
  HP BIOSCFG DRIVER
@@@ -9754,6 -9653,7 +9755,7 @@@ F:      include/linux/hugetlb.
  F:    mm/hugetlb.c
  F:    mm/hugetlb_vmemmap.c
  F:    mm/hugetlb_vmemmap.h
+ F:    tools/testing/selftests/cgroup/test_hugetlb_memcg.c
  
  HVA ST MEDIA DRIVER
  M:    Jean-Christophe Trotin <[email protected]>
@@@ -10036,6 -9936,12 +10038,6 @@@ F:   Documentation/driver-api/i3
  F:    drivers/i3c/
  F:    include/linux/i3c/
  
 -IA64 (Itanium) PLATFORM
 -L:    [email protected]
 -S:    Orphan
 -F:    Documentation/arch/ia64/
 -F:    arch/ia64/
 -
  IBM Operation Panel Input Driver
  M:    Eddie James <[email protected]>
  L:    [email protected]
@@@ -10531,6 -10437,7 +10533,6 @@@ F:   drivers/platform/x86/intel/atomisp2/
  
  INTEL BIOS SAR INT1092 DRIVER
  M:    Shravan Sudhakar <[email protected]>
 -M:    Intel Corporation <[email protected]>
  L:    [email protected]
  S:    Maintained
  F:    drivers/platform/x86/intel/int1092/
@@@ -10570,7 -10477,6 +10572,7 @@@ C:   irc://irc.oftc.net/intel-gf
  T:    git git://anongit.freedesktop.org/drm-intel
  F:    Documentation/ABI/testing/sysfs-driver-intel-i915-hwmon
  F:    Documentation/gpu/i915.rst
 +F:    drivers/gpu/drm/ci/xfails/i915*
  F:    drivers/gpu/drm/i915/
  F:    include/drm/i915*
  F:    include/uapi/drm/i915_drm.h
  S:    Maintained
  F:    drivers/crypto/intel/ixp4xx/ixp4xx_crypto.c
  
 -INTEL IXP4XX QMGR, NPE, ETHERNET and HSS SUPPORT
 -M:    Krzysztof Halasa <[email protected]>
 -S:    Maintained
 -F:    drivers/net/ethernet/xscale/ixp4xx_eth.c
 -F:    drivers/net/wan/ixp4xx_hss.c
 -F:    drivers/soc/ixp4xx/ixp4xx-npe.c
 -F:    drivers/soc/ixp4xx/ixp4xx-qmgr.c
 -F:    include/linux/soc/ixp4xx/npe.h
 -F:    include/linux/soc/ixp4xx/qmgr.h
 -
 -INTEL IXP4XX RANDOM NUMBER GENERATOR SUPPORT
 -M:    Deepak Saxena <[email protected]>
 -S:    Maintained
 -F:    Documentation/devicetree/bindings/rng/intel,ixp46x-rng.yaml
 -F:    drivers/char/hw_random/ixp4xx-rng.c
 -
  INTEL KEEM BAY DRM DRIVER
  M:    Anitha Chrisanthus <[email protected]>
  M:    Edmund Dea <[email protected]>
@@@ -10781,7 -10703,7 +10783,7 @@@ F:   drivers/mfd/intel-m10-bmc
  F:    include/linux/mfd/intel-m10-bmc.h
  
  INTEL MAX10 BMC SECURE UPDATES
 -M:    Russ Weight <russell.h.weight@intel.com>
 +M:    Peter Colberg <peter.colberg@intel.com>
  L:    [email protected]
  S:    Maintained
  F:    Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-sec-update
@@@ -10961,6 -10883,7 +10963,6 @@@ F:   drivers/platform/x86/intel/wmi/thund
  
  INTEL WWAN IOSM DRIVER
  M:    M Chetan Kumar <[email protected]>
 -M:    Intel Corporation <[email protected]>
  L:    [email protected]
  S:    Maintained
  F:    drivers/net/wwan/iosm/
@@@ -11142,7 -11065,7 +11144,7 @@@ F:   Documentation/devicetree/bindings/so
  F:    sound/soc/codecs/sma*
  
  IRQ DOMAINS (IRQ NUMBER MAPPING LIBRARY)
 -M:    Marc Zyngier <[email protected]>
 +M:    Thomas Gleixner <[email protected]>
  S:    Maintained
  T:    git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core
  F:    Documentation/core-api/irq/irq-domain.rst
@@@ -11161,6 -11084,7 +11163,6 @@@ F:   lib/group_cpus.
  
  IRQCHIP DRIVERS
  M:    Thomas Gleixner <[email protected]>
 -M:    Marc Zyngier <[email protected]>
  L:    [email protected]
  S:    Maintained
  T:    git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core
@@@ -11221,6 -11145,7 +11223,6 @@@ M:   Sagi Grimberg <[email protected]
  L:    [email protected]
  L:    [email protected]
  S:    Supported
 -W:    http://www.linux-iscsi.org
  T:    git git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git master
  F:    drivers/infiniband/ulp/isert
  
@@@ -11470,20 -11395,16 +11472,20 @@@ F:        usr
  
  KERNEL HARDENING (not covered by other areas)
  M:    Kees Cook <[email protected]>
 +R:    Gustavo A. R. Silva <[email protected]>
  L:    [email protected]
  S:    Supported
  T:    git git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/hardening
  F:    Documentation/ABI/testing/sysfs-kernel-oops_count
  F:    Documentation/ABI/testing/sysfs-kernel-warn_count
 +F:    arch/*/configs/hardening.config
  F:    include/linux/overflow.h
  F:    include/linux/randomize_kstack.h
 +F:    kernel/configs/hardening.config
  F:    mm/usercopy.c
  K:    \b(add|choose)_random_kstack_offset\b
  K:    \b__check_(object_size|heap_object)\b
 +K:    \b__counted_by\b
  
  KERNEL JANITORS
  L:    [email protected]
@@@ -11604,18 -11525,6 +11606,18 @@@ F: include/kvm/arm_
  F:    tools/testing/selftests/kvm/*/aarch64/
  F:    tools/testing/selftests/kvm/aarch64/
  
 +KERNEL VIRTUAL MACHINE FOR LOONGARCH (KVM/LoongArch)
 +M:    Tianrui Zhao <[email protected]>
 +M:    Bibo Mao <[email protected]>
 +M:    Huacai Chen <[email protected]>
 +L:    [email protected]
 +L:    [email protected]
 +S:    Maintained
 +T:    git git://git.kernel.org/pub/scm/virt/kvm/kvm.git
 +F:    arch/loongarch/include/asm/kvm*
 +F:    arch/loongarch/include/uapi/asm/kvm*
 +F:    arch/loongarch/kvm/
 +
  KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips)
  M:    Huacai Chen <[email protected]>
  L:    [email protected]
@@@ -11652,7 -11561,6 +11654,7 @@@ F:   arch/riscv/include/asm/kvm
  F:    arch/riscv/include/uapi/asm/kvm*
  F:    arch/riscv/kvm/
  F:    tools/testing/selftests/kvm/*/riscv/
 +F:    tools/testing/selftests/kvm/riscv/
  
  KERNEL VIRTUAL MACHINE for s390 (KVM/s390)
  M:    Christian Borntraeger <[email protected]>
@@@ -12190,7 -12098,7 +12192,7 @@@ F:   Documentation/ABI/stable/sysfs-firmw
  F:    Documentation/devicetree/bindings/i2c/i2c-opal.txt
  F:    Documentation/devicetree/bindings/powerpc/
  F:    Documentation/devicetree/bindings/rtc/rtc-opal.txt
 -F:    Documentation/powerpc/
 +F:    Documentation/arch/powerpc/
  F:    arch/powerpc/
  F:    drivers/*/*/*pasemi*
  F:    drivers/*/*pasemi*
@@@ -12546,14 -12454,6 +12548,14 @@@ F: drivers/hwmon/ltc2947-i2c.
  F:    drivers/hwmon/ltc2947-spi.c
  F:    drivers/hwmon/ltc2947.h
  
 +LTC2991 HARDWARE MONITOR DRIVER
 +M:    Antoniu Miclaus <[email protected]>
 +L:    [email protected]
 +S:    Supported
 +W:    https://ez.analog.com/linux-software-drivers
 +F:    Documentation/devicetree/bindings/hwmon/adi,ltc2991.yaml
 +F:    drivers/hwmon/ltc2991.c
 +
  LTC2983 IIO TEMPERATURE DRIVER
  M:    Nuno Sá <[email protected]>
  L:    [email protected]
@@@ -13605,6 -13505,7 +13607,6 @@@ F:   net/dsa/tag_mtk.
  
  MEDIATEK T7XX 5G WWAN MODEM DRIVER
  M:    Chandrashekar Devegowda <[email protected]>
 -M:    Intel Corporation <[email protected]>
  R:    Chiranjeevi Rapolu <[email protected]>
  R:    Liu Haijun <[email protected]>
  R:    M Chetan Kumar <[email protected]>
@@@ -13625,7 -13526,7 +13627,7 @@@ F:   drivers/usb/mtu3
  
  MEGACHIPS STDPXXXX-GE-B850V3-FW LVDS/DP++ BRIDGES
  M:    Peter Senna Tschudin <[email protected]>
 -M:    Martin Donnelly <martin.donnell[email protected]>
 +M:    Ian Ray <ian.ra[email protected]>
  M:    Martyn Welch <[email protected]>
  S:    Maintained
  F:    Documentation/devicetree/bindings/display/bridge/megachips-stdpxxxx-ge-b850v3-fw.txt
@@@ -13635,7 -13536,6 +13637,7 @@@ MEGARAID SCSI/SAS DRIVER
  M:    Kashyap Desai <[email protected]>
  M:    Sumit Saxena <[email protected]>
  M:    Shivasharan S <[email protected]>
 +M:    Chandrakanth patil <[email protected]>
  L:    [email protected]
  L:    [email protected]
  S:    Maintained
@@@ -13949,10 -13849,9 +13951,10 @@@ F: Documentation/devicetree/bindings/me
  F:    drivers/staging/media/meson/vdec/
  
  METHODE UDPU SUPPORT
 -M:    Vladimir Vid <vladimir.vid@sartura.hr>
 +M:    Robert Marko <robert.marko@sartura.hr>
  S:    Maintained
 -F:    arch/arm64/boot/dts/marvell/armada-3720-uDPU.dts
 +F:    arch/arm64/boot/dts/marvell/armada-3720-eDPU.dts
 +F:    arch/arm64/boot/dts/marvell/armada-3720-uDPU.*
  
  MHI BUS
  M:    Manivannan Sadhasivam <[email protected]>
@@@ -14131,7 -14030,7 +14133,7 @@@ F:   Documentation/devicetree/bindings/ii
  F:    drivers/iio/adc/mcp3911.c
  
  MICROCHIP MMC/SD/SDIO MCI DRIVER
 -M:    Ludovic Desroches <ludovic.desroche[email protected]>
 +M:    Aubin Constans <aubin.constan[email protected]>
  S:    Maintained
  F:    drivers/mmc/host/atmel-mci.c
  
@@@ -14450,11 -14349,9 +14452,11 @@@ MIPS/LOONGSON1 ARCHITECTUR
  M:    Keguang Zhang <[email protected]>
  L:    [email protected]
  S:    Maintained
 +F:    Documentation/devicetree/bindings/*/loongson,ls1*.yaml
  F:    arch/mips/include/asm/mach-loongson32/
  F:    arch/mips/loongson32/
  F:    drivers/*/*loongson1*
 +F:    drivers/net/ethernet/stmicro/stmmac/dwmac-loongson1.c
  
  MIPS/LOONGSON2EF ARCHITECTURE
  M:    Jiaxun Yang <[email protected]>
@@@ -14482,11 -14379,6 +14484,11 @@@ W: https://linuxtv.or
  T:    git git://linuxtv.org/media_tree.git
  F:    drivers/media/radio/radio-miropcm20*
  
 +MITSUMI MM8013 FG DRIVER
 +M:    Konrad Dybcio <[email protected]>
 +F:    Documentation/devicetree/bindings/power/supply/mitsumi,mm8013.yaml
 +F:    drivers/power/supply/mm8013.c
 +
  MMP SUPPORT
  R:    Lubomir Rintel <[email protected]>
  L:    [email protected] (moderated for non-subscribers)
  S:    Maintained
  T:    git git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git modules-next
  F:    include/linux/kmod.h
 -F:    include/linux/module.h
 +F:    include/linux/module*.h
  F:    kernel/module/
  F:    lib/test_kmod.c
  F:    scripts/module*
@@@ -15057,7 -14949,7 +15059,7 @@@ K:   macse
  K:    \bmdo_
  
  NETWORKING [MPTCP]
 -M:    Matthieu Baerts <matt[email protected]>
 +M:    Matthieu Baerts <matt[email protected]>
  M:    Mat Martineau <[email protected]>
  L:    [email protected]
  L:    [email protected]
@@@ -15066,11 -14958,10 +15068,11 @@@ W:        https://github.com/multipath-tcp/mpt
  B:    https://github.com/multipath-tcp/mptcp_net-next/issues
  T:    git https://github.com/multipath-tcp/mptcp_net-next.git export-net
  T:    git https://github.com/multipath-tcp/mptcp_net-next.git export
 +F:    Documentation/netlink/specs/mptcp.yaml
  F:    Documentation/networking/mptcp-sysctl.rst
  F:    include/net/mptcp.h
  F:    include/trace/events/mptcp.h
 -F:    include/uapi/linux/mptcp.h
 +F:    include/uapi/linux/mptcp*.h
  F:    net/mptcp/
  F:    tools/testing/selftests/bpf/*/*mptcp*.c
  F:    tools/testing/selftests/net/mptcp/
@@@ -15243,7 -15134,7 +15245,7 @@@ NOLIBC HEADER FIL
  M:    Willy Tarreau <[email protected]>
  M:    Thomas Weißschuh <[email protected]>
  S:    Maintained
 -T:    git git://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/nolibc.git
 +T:    git git://git.kernel.org/pub/scm/linux/kernel/git/nolibc/linux-nolibc.git
  F:    tools/include/nolibc/
  F:    tools/testing/selftests/nolibc/
  
@@@ -15463,7 -15354,6 +15465,7 @@@ M:   Laurentiu Palcu <laurentiu.palcu@oss
  R:    Lucas Stach <[email protected]>
  L:    [email protected]
  S:    Maintained
 +T:    git git://anongit.freedesktop.org/drm/drm-misc
  F:    Documentation/devicetree/bindings/display/imx/nxp,imx8mq-dcss.yaml
  F:    drivers/gpu/drm/imx/dcss/
  
@@@ -15536,7 -15426,7 +15538,7 @@@ NXP TFA9879 DRIVE
  M:    Peter Rosin <[email protected]>
  L:    [email protected] (moderated for non-subscribers)
  S:    Maintained
 -F:    Documentation/devicetree/bindings/sound/tfa9879.txt
 +F:    Documentation/devicetree/bindings/sound/nxp,tfa9879.yaml
  F:    sound/soc/codecs/tfa9879*
  
  NXP-NCI NFC DRIVER
@@@ -15573,13 -15463,6 +15575,13 @@@ F: include/linux/objagg.
  F:    lib/objagg.c
  F:    lib/test_objagg.c
  
 +OBJPOOL
 +M:    Matt Wu <[email protected]>
 +S:    Supported
 +F:    include/linux/objpool.h
 +F:    lib/objpool.c
 +F:    lib/test_objpool.c
 +
  OBJTOOL
  M:    Josh Poimboeuf <[email protected]>
  M:    Peter Zijlstra <[email protected]>
@@@ -16091,7 -15974,6 +16093,7 @@@ F:   Documentation/ABI/testing/sysfs-firm
  F:    drivers/of/
  F:    include/linux/of*.h
  F:    scripts/dtc/
 +F:    tools/testing/selftests/dt/
  K:    of_overlay_notifier_
  K:    of_overlay_fdt_apply
  K:    of_overlay_remove
  S:    Maintained
  F:    drivers/i2c/muxes/i2c-mux-pca9541.c
  
 -PCDP - PRIMARY CONSOLE AND DEBUG PORT
 -M:    Khalid Aziz <[email protected]>
 -S:    Maintained
 -F:    drivers/firmware/pcdp.*
 -
  PCI DRIVER FOR AARDVARK (Marvell Armada 3700)
  M:    Thomas Petazzoni <[email protected]>
  M:    Pali Rohár <[email protected]>
  S:    Maintained
  F:    Documentation/devicetree/bindings/pci/*rcar*
  F:    drivers/pci/controller/*rcar*
 +F:    drivers/pci/controller/dwc/*rcar*
  
  PCI DRIVER FOR SAMSUNG EXYNOS
  M:    Jingoo Han <[email protected]>
@@@ -16608,7 -16494,7 +16610,7 @@@ R:   Oliver O'Halloran <[email protected]
  L:    [email protected]
  S:    Supported
  F:    Documentation/PCI/pci-error-recovery.rst
 -F:    Documentation/powerpc/eeh-pci-error-recovery.rst
 +F:    Documentation/arch/powerpc/eeh-pci-error-recovery.rst
  F:    arch/powerpc/include/*/eeh*.h
  F:    arch/powerpc/kernel/eeh*.c
  F:    arch/powerpc/platforms/*/eeh*.c
@@@ -17718,7 -17604,6 +17720,7 @@@ M:   Kalle Valo <[email protected]
  M:    Jeff Johnson <[email protected]>
  L:    [email protected]
  S:    Supported
 +W:    https://wireless.wiki.kernel.org/en/users/Drivers/ath12k
  T:    git git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git
  F:    drivers/net/wireless/ath/ath12k/
  
@@@ -17918,18 -17803,6 +17920,18 @@@ S: Maintaine
  F:    Documentation/devicetree/bindings/mtd/qcom,nandc.yaml
  F:    drivers/mtd/nand/raw/qcom_nandc.c
  
 +QUALCOMM QSEECOM DRIVER
 +M:    Maximilian Luz <[email protected]>
 +L:    [email protected]
 +S:    Maintained
 +F:    drivers/firmware/qcom/qcom_qseecom.c
 +
 +QUALCOMM QSEECOM UEFISECAPP DRIVER
 +M:    Maximilian Luz <[email protected]>
 +L:    [email protected]
 +S:    Maintained
 +F:    drivers/firmware/qcom/qcom_qseecom_uefisecapp.c
 +
  QUALCOMM RMNET DRIVER
  M:    Subash Abhinov Kasiviswanathan <[email protected]>
  M:    Sean Tranchetti <[email protected]>
@@@ -17992,7 -17865,6 +17994,7 @@@ C:   irc://irc.oftc.net/radeo
  T:    git https://gitlab.freedesktop.org/agd5f/linux.git
  F:    Documentation/gpu/amdgpu/
  F:    drivers/gpu/drm/amd/
 +F:    drivers/gpu/drm/ci/xfails/amd*
  F:    drivers/gpu/drm/radeon/
  F:    include/uapi/drm/amdgpu_drm.h
  F:    include/uapi/drm/radeon_drm.h
@@@ -18057,6 -17929,7 +18059,6 @@@ F:   arch/mips/boot/dts/ralink/mt7621
  
  RALINK RT2X00 WIRELESS LAN DRIVER
  M:    Stanislaw Gruszka <[email protected]>
 -M:    Helmut Schaa <[email protected]>
  L:    [email protected]
  S:    Maintained
  F:    drivers/net/wireless/ralink/rt2x00/
@@@ -18261,6 -18134,8 +18263,6 @@@ REALTEK WIRELESS DRIVER (rtlwifi family
  M:    Ping-Ke Shih <[email protected]>
  L:    [email protected]
  S:    Maintained
 -W:    https://wireless.wiki.kernel.org/
 -T:    git git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-testing.git
  F:    drivers/net/wireless/realtek/rtlwifi/
  
  REALTEK WIRELESS DRIVER (rtw88)
  S:    Supported
  Q:    https://patchwork.kernel.org/project/linux-riscv/list/
  C:    irc://irc.libera.chat/riscv
 -P:    Documentation/riscv/patch-acceptance.rst
 +P:    Documentation/arch/riscv/patch-acceptance.rst
  T:    git git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git
  F:    arch/riscv/
  N:    riscv
@@@ -18788,6 -18663,7 +18790,6 @@@ F:   drivers/media/dvb-frontends/rtl2832_
  RTL8180 WIRELESS DRIVER
  L:    [email protected]
  S:    Orphan
 -W:    https://wireless.wiki.kernel.org/
  F:    drivers/net/wireless/realtek/rtl818x/rtl8180/
  
  RTL8187 WIRELESS DRIVER
@@@ -18795,12 -18671,14 +18797,12 @@@ M:        Hin-Tak Leung <[email protected]
  M:    Larry Finger <[email protected]>
  L:    [email protected]
  S:    Maintained
 -W:    https://wireless.wiki.kernel.org/
  F:    drivers/net/wireless/realtek/rtl818x/rtl8187/
  
  RTL8XXXU WIRELESS DRIVER (rtl8xxxu)
  M:    Jes Sorensen <[email protected]>
  L:    [email protected]
  S:    Maintained
 -T:    git git://git.kernel.org/pub/scm/linux/kernel/git/jes/linux.git rtl8xxxu-devel
  F:    drivers/net/wireless/realtek/rtl8xxxu/
  
  RTRS TRANSPORT DRIVERS
@@@ -18833,10 -18711,9 +18835,10 @@@ R: Andreas Hindborg <a.hindborg@samsung
  R:    Alice Ryhl <[email protected]>
  L:    [email protected]
  S:    Supported
 -W:    https://github.com/Rust-for-Linux/linux
 +W:    https://rust-for-linux.com
  B:    https://github.com/Rust-for-Linux/linux/issues
  C:    zulip://rust-for-linux.zulipchat.com
 +P:    https://rust-for-linux.com/contributing
  T:    git https://github.com/Rust-for-Linux/linux.git rust-next
  F:    Documentation/rust/
  F:    rust/
@@@ -19288,6 -19165,7 +19290,6 @@@ M:   "Martin K. Petersen" <martin.peterse
  L:    [email protected]
  L:    [email protected]
  S:    Supported
 -W:    http://www.linux-iscsi.org
  Q:    https://patchwork.kernel.org/project/target-devel/list/
  T:    git git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git
  F:    Documentation/target/
@@@ -19370,8 -19248,7 +19372,8 @@@ F:   Documentation/devicetree/bindings/mm
  F:    drivers/mmc/host/sdhci*
  
  SECURE DIGITAL HOST CONTROLLER INTERFACE (SDHCI) MICROCHIP DRIVER
 -M:    Eugen Hristev <[email protected]>
 +M:    Aubin Constans <[email protected]>
 +R:    Eugen Hristev <[email protected]>
  L:    [email protected]
  S:    Supported
  F:    drivers/mmc/host/sdhci-of-at91.c
@@@ -19528,7 -19405,6 +19530,7 @@@ F:   drivers/net/ethernet/sfc
  
  SFCTEMP HWMON DRIVER
  M:    Emil Renner Berthing <[email protected]>
 +M:    Hal Feng <[email protected]>
  L:    [email protected]
  S:    Maintained
  F:    Documentation/devicetree/bindings/hwmon/starfive,jh71x0-temp.yaml
@@@ -20190,17 -20066,10 +20192,17 @@@ F:        drivers/char/sonypi.
  F:    drivers/platform/x86/sony-laptop.c
  F:    include/linux/sony-laptop.h
  
 +SOPHGO DEVICETREES
 +M:    Chao Wei <[email protected]>
 +M:    Chen Wang <[email protected]>
 +S:    Maintained
 +F:    arch/riscv/boot/dts/sophgo/
 +F:    Documentation/devicetree/bindings/riscv/sophgo.yaml
 +
  SOUND
  M:    Jaroslav Kysela <[email protected]>
  M:    Takashi Iwai <[email protected]>
 -L:    [email protected] (moderated for non-subscribers)
 +L:    [email protected]
  S:    Maintained
  W:    http://www.alsa-project.org/
  Q:    http://patchwork.kernel.org/project/alsa-devel/list/
@@@ -20213,7 -20082,7 +20215,7 @@@ F:   tools/testing/selftests/als
  
  SOUND - ALSA SELFTESTS
  M:    Mark Brown <[email protected]>
 -L:    [email protected] (moderated for non-subscribers)
 +L:    [email protected]
  L:    [email protected]
  S:    Supported
  F:    tools/testing/selftests/alsa
@@@ -20239,7 -20108,7 +20241,7 @@@ F:   sound/soc/soc-generic-dmaengine-pcm.
  SOUND - SOC LAYER / DYNAMIC AUDIO POWER MANAGEMENT (ASoC)
  M:    Liam Girdwood <[email protected]>
  M:    Mark Brown <[email protected]>
 -L:    [email protected] (moderated for non-subscribers)
 +L:    [email protected]
  S:    Supported
  W:    http://alsa-project.org/main/index.php/ASoC
  T:    git git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git
@@@ -20247,10 -20116,6 +20249,10 @@@ F: Documentation/devicetree/bindings/so
  F:    Documentation/sound/soc/
  F:    include/dt-bindings/sound/
  F:    include/sound/soc*
 +F:    include/sound/sof.h
 +F:    include/sound/sof/
 +F:    include/trace/events/sof*.h
 +F:    include/uapi/sound/asoc.h
  F:    sound/soc/
  
  SOUND - SOUND OPEN FIRMWARE (SOF) DRIVERS
@@@ -20604,13 -20469,6 +20606,13 @@@ S: Supporte
  F:    Documentation/devicetree/bindings/clock/starfive,jh7110-pll.yaml
  F:    drivers/clk/starfive/clk-starfive-jh7110-pll.c
  
 +STARFIVE JH7110 PWMDAC DRIVER
 +M:    Hal Feng <[email protected]>
 +M:    Xingyu Wu <[email protected]>
 +S:    Supported
 +F:    Documentation/devicetree/bindings/sound/starfive,jh7110-pwmdac.yaml
 +F:    sound/soc/starfive/jh7110_pwmdac.c
 +
  STARFIVE JH7110 SYSCON
  M:    William Qiu <[email protected]>
  M:    Xingyu Wu <[email protected]>
@@@ -20634,7 -20492,6 +20636,7 @@@ F:   include/dt-bindings/clock/starfive?j
  STARFIVE JH71X0 PINCTRL DRIVERS
  M:    Emil Renner Berthing <[email protected]>
  M:    Jianlong Huang <[email protected]>
 +M:    Hal Feng <[email protected]>
  L:    [email protected]
  S:    Maintained
  F:    Documentation/devicetree/bindings/pinctrl/starfive,jh71*.yaml
@@@ -20658,10 -20515,9 +20660,10 @@@ F: drivers/usb/cdns3/cdns3-starfive.
  
  STARFIVE JH71XX PMU CONTROLLER DRIVER
  M:    Walker Chen <[email protected]>
 +M:    Changhuang Liang <[email protected]>
  S:    Supported
  F:    Documentation/devicetree/bindings/power/starfive*
 -F:    drivers/pmdomain/starfive/jh71xx-pmu.c
 +F:    drivers/pmdomain/starfive/
  F:    include/dt-bindings/power/starfive,jh7110-pmu.h
  
  STARFIVE SOC DRIVERS
@@@ -20669,6 -20525,7 +20671,6 @@@ M:   Conor Dooley <[email protected]
  S:    Maintained
  T:    git https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux.git/
  F:    Documentation/devicetree/bindings/soc/starfive/
 -F:    drivers/soc/starfive/
  
  STARFIVE TRNG DRIVER
  M:    Jia Jie Ho <[email protected]>
@@@ -21049,7 -20906,6 +21051,7 @@@ F:   drivers/clk/clk-sc[mp]i.
  F:    drivers/cpufreq/sc[mp]i-cpufreq.c
  F:    drivers/firmware/arm_scmi/
  F:    drivers/firmware/arm_scpi.c
 +F:    drivers/pmdomain/arm/
  F:    drivers/powercap/arm_scmi_powercap.c
  F:    drivers/regulator/scmi-regulator.c
  F:    drivers/reset/reset-scmi.c
@@@ -21509,8 -21365,8 +21511,8 @@@ F:   drivers/media/radio/radio-raremono.
  THERMAL
  M:    Rafael J. Wysocki <[email protected]>
  M:    Daniel Lezcano <[email protected]>
 -R:    Amit Kucheria <[email protected]>
  R:    Zhang Rui <[email protected]>
 +R:    Lukasz Luba <[email protected]>
  L:    [email protected]
  S:    Supported
  Q:    https://patchwork.kernel.org/project/linux-pm/list/
  S:    Orphan
  W:    https://wireless.wiki.kernel.org/en/users/Drivers/wl12xx
  W:    https://wireless.wiki.kernel.org/en/users/Drivers/wl1251
 -T:    git git://git.kernel.org/pub/scm/linux/kernel/git/luca/wl12xx.git
  F:    drivers/net/wireless/ti/
  
  TIMEKEEPING, CLOCKSOURCE CORE, NTP, ALARMTIMER
@@@ -21977,11 -21834,9 +21979,11 @@@ W: https://www.tq-group.com/en/products
  F:    arch/arm/boot/dts/imx*mba*.dts*
  F:    arch/arm/boot/dts/imx*tqma*.dts*
  F:    arch/arm/boot/dts/mba*.dtsi
 +F:    arch/arm64/boot/dts/freescale/fsl-*tqml*.dts*
  F:    arch/arm64/boot/dts/freescale/imx*mba*.dts*
  F:    arch/arm64/boot/dts/freescale/imx*tqma*.dts*
  F:    arch/arm64/boot/dts/freescale/mba*.dtsi
 +F:    arch/arm64/boot/dts/freescale/tqml*.dts*
  F:    drivers/gpio/gpio-tqmx86.c
  F:    drivers/mfd/tqmx86.c
  F:    drivers/watchdog/tqmx86_wdt.c
  L:    [email protected]
  S:    Maintained
  T:    git git://anongit.freedesktop.org/drm/drm-misc
 +F:    drivers/gpu/drm/ci/xfails/virtio*
  F:    drivers/gpu/drm/virtio/
  F:    include/uapi/linux/virtio_gpu.h
  
@@@ -23085,7 -22939,7 +23087,7 @@@ F:   fs/vboxsf/
  
  VIRTUAL PCM TEST DRIVER
  M:    Ivan Orlov <[email protected]>
 -L:    alsa-devel@alsa-project.org
 +L:    [email protected].org
  S:    Maintained
  F:    Documentation/sound/cards/pcmtest.rst
  F:    sound/drivers/pcmtest.c
@@@ -23188,7 -23042,7 +23190,7 @@@ F:   drivers/scsi/vmw_pvscsi.
  F:    drivers/scsi/vmw_pvscsi.h
  
  VMWARE VIRTUAL PTP CLOCK DRIVER
 -M:    Deep Shah <sdeep@vmware.com>
 +M:    Jeff Sipek <jsipek@vmware.com>
  R:    Ajay Kaher <[email protected]>
  R:    Alexey Makhalov <[email protected]>
  R:    VMware PV-Drivers Reviewers <[email protected]>
@@@ -23835,11 -23689,6 +23837,11 @@@ F: Documentation/devicetree/bindings/gp
  F:    drivers/gpio/gpio-xilinx.c
  F:    drivers/gpio/gpio-zynq.c
  
 +XILINX LL TEMAC ETHERNET DRIVER
 +L:    [email protected]
 +S:    Orphan
 +F:    drivers/net/ethernet/xilinx/ll_temac*
 +
  XILINX PWM DRIVER
  M:    Sean Anderson <[email protected]>
  S:    Maintained
@@@ -23872,13 -23721,6 +23874,13 @@@ F: Documentation/devicetree/bindings/me
  F:    drivers/media/platform/xilinx/
  F:    include/uapi/linux/xilinx-v4l2-controls.h
  
 +XILINX VERSAL EDAC DRIVER
 +M:    Shubhrajyoti Datta <[email protected]>
 +M:    Sai Krishna Potthuri <[email protected]>
 +S:    Maintained
 +F:    Documentation/devicetree/bindings/memory-controllers/xlnx,versal-ddrmc-edac.yaml
 +F:    drivers/edac/versal_edac.c
 +
  XILINX WATCHDOG DRIVER
  M:    Srinivas Neeli <[email protected]>
  R:    Shubhrajyoti Datta <[email protected]>
diff --combined arch/arm64/kernel/mte.c
index 2fb5e7a7a4d5e27e64a969911b080972f11ea0df,8878b392df58b074dc49cba4d098388f3c80269f..a41ef3213e1e9560ccfbf98d96443160eea2e784
@@@ -35,10 -35,10 +35,10 @@@ DEFINE_STATIC_KEY_FALSE(mte_async_or_as
  EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
  #endif
  
 -void mte_sync_tags(pte_t pte)
 +void mte_sync_tags(pte_t pte, unsigned int nr_pages)
  {
        struct page *page = pte_page(pte);
 -      long i, nr_pages = compound_nr(page);
 +      unsigned int i;
  
        /* if PG_mte_tagged is set, tags have already been initialised */
        for (i = 0; i < nr_pages; i++, page++) {
@@@ -411,8 -411,8 +411,8 @@@ static int __access_remote_tags(struct 
                struct page *page = get_user_page_vma_remote(mm, addr,
                                                             gup_flags, &vma);
  
-               if (IS_ERR_OR_NULL(page)) {
-                       err = page == NULL ? -EIO : PTR_ERR(page);
+               if (IS_ERR(page)) {
+                       err = PTR_ERR(page);
                        break;
                }
  
index 50e5ebf9d0a0dc74cb7248e419d619c6294f9021,f03c0a50ec3a27dff3191d4286ee0135c94b3e75..990eb686ca67bd9fe4fdcff11c159d50f5adb32e
@@@ -94,18 -94,17 +94,17 @@@ arch___clear_bit(unsigned long nr, vola
        asm volatile(__ASM_SIZE(btr) " %1,%0" : : ADDR, "Ir" (nr) : "memory");
  }
  
- static __always_inline bool
arch_clear_bit_unlock_is_negative_byte(long nr, volatile unsigned long *addr)
+ static __always_inline bool arch_xor_unlock_is_negative_byte(unsigned long mask,
              volatile unsigned long *addr)
  {
        bool negative;
-       asm volatile(LOCK_PREFIX "andb %2,%1"
+       asm volatile(LOCK_PREFIX "xorb %2,%1"
                CC_SET(s)
                : CC_OUT(s) (negative), WBYTE_ADDR(addr)
-               : "ir" ((char) ~(1 << nr)) : "memory");
+               : "iq" ((char)mask) : "memory");
        return negative;
  }
- #define arch_clear_bit_unlock_is_negative_byte                                 \
-       arch_clear_bit_unlock_is_negative_byte
+ #define arch_xor_unlock_is_negative_byte arch_xor_unlock_is_negative_byte
  
  static __always_inline void
  arch___clear_bit_unlock(long nr, volatile unsigned long *addr)
@@@ -293,9 -292,6 +292,9 @@@ static __always_inline unsigned long va
   */
  static __always_inline unsigned long __fls(unsigned long word)
  {
 +      if (__builtin_constant_p(word))
 +              return BITS_PER_LONG - 1 - __builtin_clzl(word);
 +
        asm("bsr %1,%0"
            : "=r" (word)
            : "rm" (word));
@@@ -363,9 -359,6 +362,9 @@@ static __always_inline int fls(unsigne
  {
        int r;
  
 +      if (__builtin_constant_p(x))
 +              return x ? 32 - __builtin_clz(x) : 0;
 +
  #ifdef CONFIG_X86_64
        /*
         * AMD64 says BSRL won't clobber the dest reg if x==0; Intel64 says the
  static __always_inline int fls64(__u64 x)
  {
        int bitpos = -1;
 +
 +      if (__builtin_constant_p(x))
 +              return x ? 64 - __builtin_clzll(x) : 0;
        /*
         * AMD64 says BSRQ won't clobber the dest reg if x==0; Intel64 says the
         * dest reg is undefined if x==0, but their CPU architect says its
diff --combined arch/x86/kvm/mmu/mmu.c
index b0f01d60561726450637e673bf66ac8b0bd0e52e,077d01d931dede6dab2fafb08e91debc3c14ad6c..c57e181bba21b43f4f15e8d36e906d57e499efb5
@@@ -3425,8 -3425,8 +3425,8 @@@ static int fast_page_fault(struct kvm_v
  {
        struct kvm_mmu_page *sp;
        int ret = RET_PF_INVALID;
 -      u64 spte = 0ull;
 -      u64 *sptep = NULL;
 +      u64 spte;
 +      u64 *sptep;
        uint retry_count = 0;
  
        if (!page_fault_can_be_fast(fault))
                else
                        sptep = fast_pf_get_last_sptep(vcpu, fault->addr, &spte);
  
 +              /*
 +               * It's entirely possible for the mapping to have been zapped
 +               * by a different task, but the root page should always be
 +               * available as the vCPU holds a reference to its root(s).
 +               */
 +              if (WARN_ON_ONCE(!sptep))
 +                      spte = REMOVED_SPTE;
 +
                if (!is_shadow_present_pte(spte))
                        break;
  
@@@ -4487,28 -4479,21 +4487,28 @@@ out_unlock
  }
  #endif
  
 -int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 +bool __kvm_mmu_honors_guest_mtrrs(bool vm_has_noncoherent_dma)
  {
        /*
 -       * If the guest's MTRRs may be used to compute the "real" memtype,
 -       * restrict the mapping level to ensure KVM uses a consistent memtype
 -       * across the entire mapping.  If the host MTRRs are ignored by TDP
 -       * (shadow_memtype_mask is non-zero), and the VM has non-coherent DMA
 -       * (DMA doesn't snoop CPU caches), KVM's ABI is to honor the memtype
 -       * from the guest's MTRRs so that guest accesses to memory that is
 -       * DMA'd aren't cached against the guest's wishes.
 +       * If host MTRRs are ignored (shadow_memtype_mask is non-zero), and the
 +       * VM has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is
 +       * to honor the memtype from the guest's MTRRs so that guest accesses
 +       * to memory that is DMA'd aren't cached against the guest's wishes.
         *
         * Note, KVM may still ultimately ignore guest MTRRs for certain PFNs,
         * e.g. KVM will force UC memtype for host MMIO.
         */
 -      if (shadow_memtype_mask && kvm_arch_has_noncoherent_dma(vcpu->kvm)) {
 +      return vm_has_noncoherent_dma && shadow_memtype_mask;
 +}
 +
 +int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 +{
 +      /*
 +       * If the guest's MTRRs may be used to compute the "real" memtype,
 +       * restrict the mapping level to ensure KVM uses a consistent memtype
 +       * across the entire mapping.
 +       */
 +      if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
                for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
                        int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
                        gfn_t base = gfn_round_for_level(fault->gfn,
@@@ -6800,11 -6785,7 +6800,7 @@@ static unsigned long mmu_shrink_count(s
        return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
  }
  
- static struct shrinker mmu_shrinker = {
-       .count_objects = mmu_shrink_count,
-       .scan_objects = mmu_shrink_scan,
-       .seeks = DEFAULT_SEEKS * 10,
- };
+ static struct shrinker *mmu_shrinker;
  
  static void mmu_destroy_caches(void)
  {
@@@ -6937,10 -6918,16 +6933,16 @@@ int kvm_mmu_vendor_module_init(void
        if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
                goto out;
  
-       ret = register_shrinker(&mmu_shrinker, "x86-mmu");
-       if (ret)
+       mmu_shrinker = shrinker_alloc(0, "x86-mmu");
+       if (!mmu_shrinker)
                goto out_shrinker;
  
+       mmu_shrinker->count_objects = mmu_shrink_count;
+       mmu_shrinker->scan_objects = mmu_shrink_scan;
+       mmu_shrinker->seeks = DEFAULT_SEEKS * 10;
+       shrinker_register(mmu_shrinker);
        return 0;
  
  out_shrinker:
@@@ -6962,7 -6949,7 +6964,7 @@@ void kvm_mmu_vendor_module_exit(void
  {
        mmu_destroy_caches();
        percpu_counter_destroy(&kvm_total_used_mmu_pages);
-       unregister_shrinker(&mmu_shrinker);
+       shrinker_free(mmu_shrinker);
  }
  
  /*
diff --combined drivers/acpi/acpi_pad.c
index 32a2c006908b8fe1a32fc505f89ece2996b8e92a,7f073ca64f0e9f172935ed1e76e16e53aa4d32e9..bd1ad07f0290738be04fee20b33995e06a505f4c
@@@ -18,7 -18,6 +18,7 @@@
  #include <linux/slab.h>
  #include <linux/acpi.h>
  #include <linux/perf_event.h>
 +#include <linux/platform_device.h>
  #include <asm/mwait.h>
  #include <xen/xen.h>
  
@@@ -101,7 -100,7 +101,7 @@@ static void round_robin_cpu(unsigned in
        for_each_cpu(cpu, pad_busy_cpus)
                cpumask_or(tmp, tmp, topology_sibling_cpumask(cpu));
        cpumask_andnot(tmp, cpu_online_mask, tmp);
-       /* avoid HT sibilings if possible */
+       /* avoid HT siblings if possible */
        if (cpumask_empty(tmp))
                cpumask_andnot(tmp, cpu_online_mask, pad_busy_cpus);
        if (cpumask_empty(tmp)) {
@@@ -337,14 -336,33 +337,14 @@@ static ssize_t idlecpus_show(struct dev
  
  static DEVICE_ATTR_RW(idlecpus);
  
 -static int acpi_pad_add_sysfs(struct acpi_device *device)
 -{
 -      int result;
 -
 -      result = device_create_file(&device->dev, &dev_attr_idlecpus);
 -      if (result)
 -              return -ENODEV;
 -      result = device_create_file(&device->dev, &dev_attr_idlepct);
 -      if (result) {
 -              device_remove_file(&device->dev, &dev_attr_idlecpus);
 -              return -ENODEV;
 -      }
 -      result = device_create_file(&device->dev, &dev_attr_rrtime);
 -      if (result) {
 -              device_remove_file(&device->dev, &dev_attr_idlecpus);
 -              device_remove_file(&device->dev, &dev_attr_idlepct);
 -              return -ENODEV;
 -      }
 -      return 0;
 -}
 +static struct attribute *acpi_pad_attrs[] = {
 +      &dev_attr_idlecpus.attr,
 +      &dev_attr_idlepct.attr,
 +      &dev_attr_rrtime.attr,
 +      NULL
 +};
  
 -static void acpi_pad_remove_sysfs(struct acpi_device *device)
 -{
 -      device_remove_file(&device->dev, &dev_attr_idlecpus);
 -      device_remove_file(&device->dev, &dev_attr_idlepct);
 -      device_remove_file(&device->dev, &dev_attr_rrtime);
 -}
 +ATTRIBUTE_GROUPS(acpi_pad);
  
  /*
   * Query firmware how many CPUs should be idle
@@@ -398,13 -416,13 +398,13 @@@ static void acpi_pad_handle_notify(acpi
  static void acpi_pad_notify(acpi_handle handle, u32 event,
        void *data)
  {
 -      struct acpi_device *device = data;
 +      struct acpi_device *adev = data;
  
        switch (event) {
        case ACPI_PROCESSOR_AGGREGATOR_NOTIFY:
                acpi_pad_handle_notify(handle);
 -              acpi_bus_generate_netlink_event(device->pnp.device_class,
 -                      dev_name(&device->dev), event, 0);
 +              acpi_bus_generate_netlink_event(adev->pnp.device_class,
 +                      dev_name(&adev->dev), event, 0);
                break;
        default:
                pr_warn("Unsupported event [0x%x]\n", event);
        }
  }
  
 -static int acpi_pad_add(struct acpi_device *device)
 +static int acpi_pad_probe(struct platform_device *pdev)
  {
 +      struct acpi_device *adev = ACPI_COMPANION(&pdev->dev);
        acpi_status status;
  
 -      strcpy(acpi_device_name(device), ACPI_PROCESSOR_AGGREGATOR_DEVICE_NAME);
 -      strcpy(acpi_device_class(device), ACPI_PROCESSOR_AGGREGATOR_CLASS);
 +      strcpy(acpi_device_name(adev), ACPI_PROCESSOR_AGGREGATOR_DEVICE_NAME);
 +      strcpy(acpi_device_class(adev), ACPI_PROCESSOR_AGGREGATOR_CLASS);
  
 -      if (acpi_pad_add_sysfs(device))
 -              return -ENODEV;
 +      status = acpi_install_notify_handler(adev->handle,
 +              ACPI_DEVICE_NOTIFY, acpi_pad_notify, adev);
  
 -      status = acpi_install_notify_handler(device->handle,
 -              ACPI_DEVICE_NOTIFY, acpi_pad_notify, device);
 -      if (ACPI_FAILURE(status)) {
 -              acpi_pad_remove_sysfs(device);
 +      if (ACPI_FAILURE(status))
                return -ENODEV;
 -      }
  
        return 0;
  }
  
 -static void acpi_pad_remove(struct acpi_device *device)
 +static void acpi_pad_remove(struct platform_device *pdev)
  {
 +      struct acpi_device *adev = ACPI_COMPANION(&pdev->dev);
 +
        mutex_lock(&isolated_cpus_lock);
        acpi_pad_idle_cpus(0);
        mutex_unlock(&isolated_cpus_lock);
  
 -      acpi_remove_notify_handler(device->handle,
 +      acpi_remove_notify_handler(adev->handle,
                ACPI_DEVICE_NOTIFY, acpi_pad_notify);
 -      acpi_pad_remove_sysfs(device);
  }
  
  static const struct acpi_device_id pad_device_ids[] = {
  };
  MODULE_DEVICE_TABLE(acpi, pad_device_ids);
  
 -static struct acpi_driver acpi_pad_driver = {
 -      .name = "processor_aggregator",
 -      .class = ACPI_PROCESSOR_AGGREGATOR_CLASS,
 -      .ids = pad_device_ids,
 -      .ops = {
 -              .add = acpi_pad_add,
 -              .remove = acpi_pad_remove,
 +static struct platform_driver acpi_pad_driver = {
 +      .probe = acpi_pad_probe,
 +      .remove_new = acpi_pad_remove,
 +      .driver = {
 +              .dev_groups = acpi_pad_groups,
 +              .name = "processor_aggregator",
 +              .acpi_match_table = pad_device_ids,
        },
  };
  
@@@ -467,12 -487,12 +467,12 @@@ static int __init acpi_pad_init(void
        if (power_saving_mwait_eax == 0)
                return -EINVAL;
  
 -      return acpi_bus_register_driver(&acpi_pad_driver);
 +      return platform_driver_register(&acpi_pad_driver);
  }
  
  static void __exit acpi_pad_exit(void)
  {
 -      acpi_bus_unregister_driver(&acpi_pad_driver);
 +      platform_driver_unregister(&acpi_pad_driver);
  }
  
  module_init(acpi_pad_init);
index 135278ddaf627bb1fd41ba6062a3596e52bd8f72,79ba576b22e3c9d0cf50f74f9e1197f95759b17a..3f2f7bf6e33526edeaa3a74288d0b14b79013aaf
@@@ -3,19 -3,12 +3,20 @@@
  #include <linux/efi.h>
  #include <linux/memblock.h>
  #include <linux/spinlock.h>
+ #include <linux/crash_dump.h>
  #include <asm/unaccepted_memory.h>
  
 -/* Protects unaccepted memory bitmap */
 +/* Protects unaccepted memory bitmap and accepting_list */
  static DEFINE_SPINLOCK(unaccepted_memory_lock);
  
 +struct accept_range {
 +      struct list_head list;
 +      unsigned long start;
 +      unsigned long end;
 +};
 +
 +static LIST_HEAD(accepting_list);
 +
  /*
   * accept_memory() -- Consult bitmap and accept the memory if needed.
   *
@@@ -32,7 -25,6 +33,7 @@@ void accept_memory(phys_addr_t start, p
  {
        struct efi_unaccepted_memory *unaccepted;
        unsigned long range_start, range_end;
 +      struct accept_range range, *entry;
        unsigned long flags;
        u64 unit_size;
  
        if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
                end = unaccepted->size * unit_size * BITS_PER_BYTE;
  
 -      range_start = start / unit_size;
 -
 +      range.start = start / unit_size;
 +      range.end = DIV_ROUND_UP(end, unit_size);
 +retry:
        spin_lock_irqsave(&unaccepted_memory_lock, flags);
 +
 +      /*
 +       * Check if anybody works on accepting the same range of the memory.
 +       *
 +       * The check is done with unit_size granularity. It is crucial to catch
 +       * all accept requests to the same unit_size block, even if they don't
 +       * overlap on physical address level.
 +       */
 +      list_for_each_entry(entry, &accepting_list, list) {
 +              if (entry->end < range.start)
 +                      continue;
 +              if (entry->start >= range.end)
 +                      continue;
 +
 +              /*
 +               * Somebody else accepting the range. Or at least part of it.
 +               *
 +               * Drop the lock and retry until it is complete.
 +               */
 +              spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
 +              goto retry;
 +      }
 +
 +      /*
 +       * Register that the range is about to be accepted.
 +       * Make sure nobody else will accept it.
 +       */
 +      list_add(&range.list, &accepting_list);
 +
 +      range_start = range.start;
        for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
 -                                 DIV_ROUND_UP(end, unit_size)) {
 +                                 range.end) {
                unsigned long phys_start, phys_end;
                unsigned long len = range_end - range_start;
  
                phys_start = range_start * unit_size + unaccepted->phys_base;
                phys_end = range_end * unit_size + unaccepted->phys_base;
  
 +              /*
 +               * Keep interrupts disabled until the accept operation is
 +               * complete in order to prevent deadlocks.
 +               *
 +               * Enabling interrupts before calling arch_accept_memory()
 +               * creates an opportunity for an interrupt handler to request
 +               * acceptance for the same memory. The handler will continuously
 +               * spin with interrupts disabled, preventing other task from
 +               * making progress with the acceptance process.
 +               */
 +              spin_unlock(&unaccepted_memory_lock);
 +
                arch_accept_memory(phys_start, phys_end);
 +
 +              spin_lock(&unaccepted_memory_lock);
                bitmap_clear(unaccepted->bitmap, range_start, len);
        }
 +
 +      list_del(&range.list);
        spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
  }
  
@@@ -201,3 -146,22 +202,22 @@@ bool range_contains_unaccepted_memory(p
  
        return ret;
  }
+ #ifdef CONFIG_PROC_VMCORE
+ static bool unaccepted_memory_vmcore_pfn_is_ram(struct vmcore_cb *cb,
+                                               unsigned long pfn)
+ {
+       return !pfn_is_unaccepted_memory(pfn);
+ }
+ static struct vmcore_cb vmcore_cb = {
+       .pfn_is_ram = unaccepted_memory_vmcore_pfn_is_ram,
+ };
+ static int __init unaccepted_memory_init_kdump(void)
+ {
+       register_vmcore_cb(&vmcore_cb);
+       return 0;
+ }
+ core_initcall(unaccepted_memory_init_kdump);
+ #endif /* CONFIG_PROC_VMCORE */
index 9cb7bbfb427829c553d65132168a6490a3be4e02,e07ffbd9eab3ba8d973e5bac5bf31fbc2833a755..d166052eb2ce3f2b0471cec3135ef3bcda2b7aee
@@@ -14,7 -14,6 +14,7 @@@
  #include <linux/vmalloc.h>
  
  #include "gt/intel_gt_requests.h"
 +#include "gt/intel_gt.h"
  
  #include "i915_trace.h"
  
@@@ -120,8 -119,7 +120,8 @@@ i915_gem_shrink(struct i915_gem_ww_ctx 
        intel_wakeref_t wakeref = 0;
        unsigned long count = 0;
        unsigned long scanned = 0;
 -      int err = 0;
 +      int err = 0, i = 0;
 +      struct intel_gt *gt;
  
        /* CHV + VTD workaround use stop_machine(); need to trylock vm->mutex */
        bool trylock_vm = !ww && intel_vm_no_concurrent_access_wa(i915);
         * what we can do is give them a kick so that we do not keep idle
         * contexts around longer than is necessary.
         */
 -      if (shrink & I915_SHRINK_ACTIVE)
 -              /* Retire requests to unpin all idle contexts */
 -              intel_gt_retire_requests(to_gt(i915));
 +      if (shrink & I915_SHRINK_ACTIVE) {
 +              for_each_gt(gt, i915, i)
 +                      /* Retire requests to unpin all idle contexts */
 +                      intel_gt_retire_requests(gt);
 +      }
  
        /*
         * As we may completely rewrite the (un)bound list whilst unbinding
@@@ -288,8 -284,7 +288,7 @@@ unsigned long i915_gem_shrink_all(struc
  static unsigned long
  i915_gem_shrinker_count(struct shrinker *shrinker, struct shrink_control *sc)
  {
-       struct drm_i915_private *i915 =
-               container_of(shrinker, struct drm_i915_private, mm.shrinker);
+       struct drm_i915_private *i915 = shrinker->private_data;
        unsigned long num_objects;
        unsigned long count;
  
        if (num_objects) {
                unsigned long avg = 2 * count / num_objects;
  
-               i915->mm.shrinker.batch =
-                       max((i915->mm.shrinker.batch + avg) >> 1,
+               i915->mm.shrinker->batch =
+                       max((i915->mm.shrinker->batch + avg) >> 1,
                            128ul /* default SHRINK_BATCH */);
        }
  
  static unsigned long
  i915_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
  {
-       struct drm_i915_private *i915 =
-               container_of(shrinker, struct drm_i915_private, mm.shrinker);
+       struct drm_i915_private *i915 = shrinker->private_data;
        unsigned long freed;
  
        sc->nr_scanned = 0;
@@@ -393,8 -387,6 +391,8 @@@ i915_gem_shrinker_vmap(struct notifier_
        struct i915_vma *vma, *next;
        unsigned long freed_pages = 0;
        intel_wakeref_t wakeref;
 +      struct intel_gt *gt;
 +      int i;
  
        with_intel_runtime_pm(&i915->runtime_pm, wakeref)
                freed_pages += i915_gem_shrink(NULL, i915, -1UL, NULL,
                                               I915_SHRINK_VMAPS);
  
        /* We also want to clear any cached iomaps as they wrap vmap */
 -      mutex_lock(&to_gt(i915)->ggtt->vm.mutex);
 -      list_for_each_entry_safe(vma, next,
 -                               &to_gt(i915)->ggtt->vm.bound_list, vm_link) {
 -              unsigned long count = i915_vma_size(vma) >> PAGE_SHIFT;
 -              struct drm_i915_gem_object *obj = vma->obj;
 -
 -              if (!vma->iomap || i915_vma_is_active(vma))
 -                      continue;
 +      for_each_gt(gt, i915, i) {
 +              mutex_lock(&gt->ggtt->vm.mutex);
 +              list_for_each_entry_safe(vma, next,
 +                                       &gt->ggtt->vm.bound_list, vm_link) {
 +                      unsigned long count = i915_vma_size(vma) >> PAGE_SHIFT;
 +                      struct drm_i915_gem_object *obj = vma->obj;
 +
 +                      if (!vma->iomap || i915_vma_is_active(vma))
 +                              continue;
  
 -              if (!i915_gem_object_trylock(obj, NULL))
 -                      continue;
 +                      if (!i915_gem_object_trylock(obj, NULL))
 +                              continue;
  
 -              if (__i915_vma_unbind(vma) == 0)
 -                      freed_pages += count;
 +                      if (__i915_vma_unbind(vma) == 0)
 +                              freed_pages += count;
  
 -              i915_gem_object_unlock(obj);
 +                      i915_gem_object_unlock(obj);
 +              }
 +              mutex_unlock(&gt->ggtt->vm.mutex);
        }
 -      mutex_unlock(&to_gt(i915)->ggtt->vm.mutex);
  
        *(unsigned long *)ptr += freed_pages;
        return NOTIFY_DONE;
  
  void i915_gem_driver_register__shrinker(struct drm_i915_private *i915)
  {
-       i915->mm.shrinker.scan_objects = i915_gem_shrinker_scan;
-       i915->mm.shrinker.count_objects = i915_gem_shrinker_count;
-       i915->mm.shrinker.seeks = DEFAULT_SEEKS;
-       i915->mm.shrinker.batch = 4096;
-       drm_WARN_ON(&i915->drm, register_shrinker(&i915->mm.shrinker,
-                                                 "drm-i915_gem"));
+       i915->mm.shrinker = shrinker_alloc(0, "drm-i915_gem");
+       if (!i915->mm.shrinker) {
+               drm_WARN_ON(&i915->drm, 1);
+       } else {
+               i915->mm.shrinker->scan_objects = i915_gem_shrinker_scan;
+               i915->mm.shrinker->count_objects = i915_gem_shrinker_count;
+               i915->mm.shrinker->batch = 4096;
+               i915->mm.shrinker->private_data = i915;
+               shrinker_register(i915->mm.shrinker);
+       }
  
        i915->mm.oom_notifier.notifier_call = i915_gem_shrinker_oom;
        drm_WARN_ON(&i915->drm, register_oom_notifier(&i915->mm.oom_notifier));
@@@ -451,7 -446,7 +454,7 @@@ void i915_gem_driver_unregister__shrink
                    unregister_vmap_purge_notifier(&i915->mm.vmap_notifier));
        drm_WARN_ON(&i915->drm,
                    unregister_oom_notifier(&i915->mm.oom_notifier));
-       unregister_shrinker(&i915->mm.shrinker);
+       shrinker_free(i915->mm.shrinker);
  }
  
  void i915_gem_shrinker_taints_mutex(struct drm_i915_private *i915,
index 6a2a78c61f212c7c8178012ab6676015903e271a,f2f21da4d7f915755d2c6d440613a1bbd1f4f68d..dd452c220df72e9744a19a999a8afa508daf0036
@@@ -163,7 -163,7 +163,7 @@@ struct i915_gem_mm 
  
        struct notifier_block oom_notifier;
        struct notifier_block vmap_notifier;
-       struct shrinker shrinker;
+       struct shrinker *shrinker;
  
  #ifdef CONFIG_MMU_NOTIFIER
        /**
@@@ -222,22 -222,7 +222,22 @@@ struct drm_i915_private 
                bool mchbar_need_disable;
        } gmch;
  
 -      struct rb_root uabi_engines;
 +      /*
 +       * Chaining user engines happens in multiple stages, starting with a
 +       * simple lock-less linked list created by intel_engine_add_user(),
 +       * which later gets sorted and converted to an intermediate regular
 +       * list, just to be converted once again to its final rb tree structure
 +       * in intel_engines_driver_register().
 +       *
 +       * Make sure to use the right iterator helper, depending on if the code
 +       * in question runs before or after intel_engines_driver_register() --
 +       * for_each_uabi_engine() can only be used afterwards!
 +       */
 +      union {
 +              struct llist_head uabi_engines_llist;
 +              struct list_head uabi_engines_list;
 +              struct rb_root uabi_engines;
 +      };
        unsigned int engine_uabi_class_count[I915_LAST_UABI_ENGINE_CLASS + 1];
  
        /* protects the irq masks */
  
        struct i915_hwmon *hwmon;
  
 -      /* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
 -      struct intel_gt gt0;
 -
 -      /*
 -       * i915->gt[0] == &i915->gt0
 -       */
        struct intel_gt *gt[I915_MAX_GT];
  
        struct kobject *sysfs_gt;
@@@ -391,9 -382,9 +391,9 @@@ static inline struct drm_i915_private *
        return pci_get_drvdata(pdev);
  }
  
 -static inline struct intel_gt *to_gt(struct drm_i915_private *i915)
 +static inline struct intel_gt *to_gt(const struct drm_i915_private *i915)
  {
 -      return &i915->gt0;
 +      return i915->gt[0];
  }
  
  /* Simple iterator over all initialised engines */
  
  #define INTEL_INFO(i915)      ((i915)->__info)
  #define RUNTIME_INFO(i915)    (&(i915)->__runtime)
 -#define DISPLAY_INFO(i915)    ((i915)->display.info.__device_info)
 -#define DISPLAY_RUNTIME_INFO(i915)    (&(i915)->display.info.__runtime_info)
  #define DRIVER_CAPS(i915)     (&(i915)->caps)
  
  #define INTEL_DEVID(i915)     (RUNTIME_INFO(i915)->device_id)
  #define IS_MEDIA_VER(i915, from, until) \
        (MEDIA_VER(i915) >= (from) && MEDIA_VER(i915) <= (until))
  
 -#define DISPLAY_VER(i915)     (DISPLAY_RUNTIME_INFO(i915)->ip.ver)
 -#define IS_DISPLAY_VER(i915, from, until) \
 -      (DISPLAY_VER(i915) >= (from) && DISPLAY_VER(i915) <= (until))
 -
  #define INTEL_REVID(i915)     (to_pci_dev((i915)->drm.dev)->revision)
  
  #define INTEL_DISPLAY_STEP(__i915) (RUNTIME_INFO(__i915)->step.display_step)
@@@ -576,6 -573,10 +576,6 @@@ IS_SUBPLATFORM(const struct drm_i915_pr
  #define IS_PONTEVECCHIO(i915) IS_PLATFORM(i915, INTEL_PONTEVECCHIO)
  #define IS_METEORLAKE(i915) IS_PLATFORM(i915, INTEL_METEORLAKE)
  
 -#define IS_METEORLAKE_M(i915) \
 -      IS_SUBPLATFORM(i915, INTEL_METEORLAKE, INTEL_SUBPLATFORM_M)
 -#define IS_METEORLAKE_P(i915) \
 -      IS_SUBPLATFORM(i915, INTEL_METEORLAKE, INTEL_SUBPLATFORM_P)
  #define IS_DG2_G10(i915) \
        IS_SUBPLATFORM(i915, INTEL_DG2, INTEL_SUBPLATFORM_G10)
  #define IS_DG2_G11(i915) \
  #define IS_TIGERLAKE_UY(i915) \
        IS_SUBPLATFORM(i915, INTEL_TIGERLAKE, INTEL_SUBPLATFORM_UY)
  
 -
 -
 -
 -
 -
 -
 -
  #define IS_XEHPSDV_GRAPHICS_STEP(__i915, since, until) \
        (IS_XEHPSDV(__i915) && IS_GRAPHICS_STEP(__i915, since, until))
  
 -#define IS_MTL_GRAPHICS_STEP(__i915, variant, since, until) \
 -      (IS_SUBPLATFORM(__i915, INTEL_METEORLAKE, INTEL_SUBPLATFORM_##variant) && \
 -       IS_GRAPHICS_STEP(__i915, since, until))
 -
 -#define IS_MTL_DISPLAY_STEP(__i915, since, until) \
 -      (IS_METEORLAKE(__i915) && \
 -       IS_DISPLAY_STEP(__i915, since, until))
 -
 -#define IS_MTL_MEDIA_STEP(__i915, since, until) \
 -      (IS_METEORLAKE(__i915) && \
 -       IS_MEDIA_STEP(__i915, since, until))
 -
 -/*
 - * DG2 hardware steppings are a bit unusual.  The hardware design was forked to
 - * create three variants (G10, G11, and G12) which each have distinct
 - * workaround sets.  The G11 and G12 forks of the DG2 design reset the GT
 - * stepping back to "A0" for their first iterations, even though they're more
 - * similar to a G10 B0 stepping and G10 C0 stepping respectively in terms of
 - * functionality and workarounds.  However the display stepping does not reset
 - * in the same manner --- a specific stepping like "B0" has a consistent
 - * meaning regardless of whether it belongs to a G10, G11, or G12 DG2.
 - *
 - * TLDR:  All GT workarounds and stepping-specific logic must be applied in
 - * relation to a specific subplatform (G10/G11/G12), whereas display workarounds
 - * and stepping-specific logic will be applied with a general DG2-wide stepping
 - * number.
 - */
 -#define IS_DG2_GRAPHICS_STEP(__i915, variant, since, until) \
 -      (IS_SUBPLATFORM(__i915, INTEL_DG2, INTEL_SUBPLATFORM_##variant) && \
 -       IS_GRAPHICS_STEP(__i915, since, until))
 -
 -#define IS_DG2_DISPLAY_STEP(__i915, since, until) \
 -      (IS_DG2(__i915) && \
 -       IS_DISPLAY_STEP(__i915, since, until))
 -
  #define IS_PVC_BD_STEP(__i915, since, until) \
        (IS_PONTEVECCHIO(__i915) && \
         IS_BASEDIE_STEP(__i915, since, until))
  #define CMDPARSER_USES_GGTT(i915) (GRAPHICS_VER(i915) == 7)
  
  #define HAS_LLC(i915) (INTEL_INFO(i915)->has_llc)
 -#define HAS_4TILE(i915)       (INTEL_INFO(i915)->has_4tile)
  #define HAS_SNOOP(i915)       (INTEL_INFO(i915)->has_snoop)
  #define HAS_EDRAM(i915)       ((i915)->edram_size_mb)
  #define HAS_SECURE_BATCHES(i915) (GRAPHICS_VER(i915) < 6)
  #define NUM_L3_SLICES(i915) (IS_HASWELL_GT3(i915) ? \
                                 2 : HAS_L3_DPF(i915))
  
 -/* Only valid when HAS_DISPLAY() is true */
 -#define INTEL_DISPLAY_ENABLED(i915) \
 -      (drm_WARN_ON(&(i915)->drm, !HAS_DISPLAY(i915)),         \
 -       !(i915)->params.disable_display &&                             \
 -       !intel_opregion_headless_sku(i915))
 -
  #define HAS_GUC_DEPRIVILEGE(i915) \
        (INTEL_INFO(i915)->has_guc_deprivilege)
  
 +#define HAS_GUC_TLB_INVALIDATION(i915)        (INTEL_INFO(i915)->has_guc_tlb_invalidation)
 +
  #define HAS_3D_PIPELINE(i915) (INTEL_INFO(i915)->has_3d_pipeline)
  
  #define HAS_ONE_EU_PER_FUSE_BIT(i915) (INTEL_INFO(i915)->has_one_eu_per_fuse_bit)
index 443bbc3ed75089110ef9f17f014e99be5209ff60,7f20249d60715b1676e885aa485c94f024fec7d8..2aae7d107f3356e08b55b6b05bf7cf96205318a0
@@@ -7,17 -7,29 +7,17 @@@
  
  #include <linux/dma-mapping.h>
  #include <linux/fault-inject.h>
 -#include <linux/kthread.h>
  #include <linux/of_address.h>
 -#include <linux/sched/mm.h>
  #include <linux/uaccess.h>
 -#include <uapi/linux/sched/types.h>
  
 -#include <drm/drm_aperture.h>
 -#include <drm/drm_bridge.h>
  #include <drm/drm_drv.h>
  #include <drm/drm_file.h>
  #include <drm/drm_ioctl.h>
 -#include <drm/drm_prime.h>
  #include <drm/drm_of.h>
 -#include <drm/drm_vblank.h>
  
 -#include "disp/msm_disp_snapshot.h"
  #include "msm_drv.h"
  #include "msm_debugfs.h"
 -#include "msm_fence.h"
 -#include "msm_gem.h"
 -#include "msm_gpu.h"
  #include "msm_kms.h"
 -#include "msm_mmu.h"
  #include "adreno/adreno_gpu.h"
  
  /*
  
  static void msm_deinit_vram(struct drm_device *ddev);
  
 -static const struct drm_mode_config_funcs mode_config_funcs = {
 -      .fb_create = msm_framebuffer_create,
 -      .atomic_check = msm_atomic_check,
 -      .atomic_commit = drm_atomic_helper_commit,
 -};
 -
 -static const struct drm_mode_config_helper_funcs mode_config_helper_funcs = {
 -      .atomic_commit_tail = msm_atomic_commit_tail,
 -};
 -
  static char *vram = "16m";
  MODULE_PARM_DESC(vram, "Configure VRAM size (for devices without IOMMU/GPUMMU)");
  module_param(vram, charp, 0);
@@@ -61,11 -83,125 +61,11 @@@ DECLARE_FAULT_ATTR(fail_gem_alloc)
  DECLARE_FAULT_ATTR(fail_gem_iova);
  #endif
  
 -static irqreturn_t msm_irq(int irq, void *arg)
 -{
 -      struct drm_device *dev = arg;
 -      struct msm_drm_private *priv = dev->dev_private;
 -      struct msm_kms *kms = priv->kms;
 -
 -      BUG_ON(!kms);
 -
 -      return kms->funcs->irq(kms);
 -}
 -
 -static void msm_irq_preinstall(struct drm_device *dev)
 -{
 -      struct msm_drm_private *priv = dev->dev_private;
 -      struct msm_kms *kms = priv->kms;
 -
 -      BUG_ON(!kms);
 -
 -      kms->funcs->irq_preinstall(kms);
 -}
 -
 -static int msm_irq_postinstall(struct drm_device *dev)
 -{
 -      struct msm_drm_private *priv = dev->dev_private;
 -      struct msm_kms *kms = priv->kms;
 -
 -      BUG_ON(!kms);
 -
 -      if (kms->funcs->irq_postinstall)
 -              return kms->funcs->irq_postinstall(kms);
 -
 -      return 0;
 -}
 -
 -static int msm_irq_install(struct drm_device *dev, unsigned int irq)
 -{
 -      struct msm_drm_private *priv = dev->dev_private;
 -      struct msm_kms *kms = priv->kms;
 -      int ret;
 -
 -      if (irq == IRQ_NOTCONNECTED)
 -              return -ENOTCONN;
 -
 -      msm_irq_preinstall(dev);
 -
 -      ret = request_irq(irq, msm_irq, 0, dev->driver->name, dev);
 -      if (ret)
 -              return ret;
 -
 -      kms->irq_requested = true;
 -
 -      ret = msm_irq_postinstall(dev);
 -      if (ret) {
 -              free_irq(irq, dev);
 -              return ret;
 -      }
 -
 -      return 0;
 -}
 -
 -static void msm_irq_uninstall(struct drm_device *dev)
 -{
 -      struct msm_drm_private *priv = dev->dev_private;
 -      struct msm_kms *kms = priv->kms;
 -
 -      kms->funcs->irq_uninstall(kms);
 -      if (kms->irq_requested)
 -              free_irq(kms->irq, dev);
 -}
 -
 -struct msm_vblank_work {
 -      struct work_struct work;
 -      struct drm_crtc *crtc;
 -      bool enable;
 -      struct msm_drm_private *priv;
 -};
 -
 -static void vblank_ctrl_worker(struct work_struct *work)
 -{
 -      struct msm_vblank_work *vbl_work = container_of(work,
 -                                              struct msm_vblank_work, work);
 -      struct msm_drm_private *priv = vbl_work->priv;
 -      struct msm_kms *kms = priv->kms;
 -
 -      if (vbl_work->enable)
 -              kms->funcs->enable_vblank(kms, vbl_work->crtc);
 -      else
 -              kms->funcs->disable_vblank(kms, vbl_work->crtc);
 -
 -      kfree(vbl_work);
 -}
 -
 -static int vblank_ctrl_queue_work(struct msm_drm_private *priv,
 -                                      struct drm_crtc *crtc, bool enable)
 -{
 -      struct msm_vblank_work *vbl_work;
 -
 -      vbl_work = kzalloc(sizeof(*vbl_work), GFP_ATOMIC);
 -      if (!vbl_work)
 -              return -ENOMEM;
 -
 -      INIT_WORK(&vbl_work->work, vblank_ctrl_worker);
 -
 -      vbl_work->crtc = crtc;
 -      vbl_work->enable = enable;
 -      vbl_work->priv = priv;
 -
 -      queue_work(priv->wq, &vbl_work->work);
 -
 -      return 0;
 -}
 -
  static int msm_drm_uninit(struct device *dev)
  {
        struct platform_device *pdev = to_platform_device(dev);
        struct msm_drm_private *priv = platform_get_drvdata(pdev);
        struct drm_device *ddev = priv->dev;
 -      struct msm_kms *kms = priv->kms;
 -      int i;
  
        /*
         * Shutdown the hw if we're far enough along where things might be on.
         */
        if (ddev->registered) {
                drm_dev_unregister(ddev);
 -              drm_atomic_helper_shutdown(ddev);
 +              if (priv->kms)
 +                      drm_atomic_helper_shutdown(ddev);
        }
  
        /* We must cancel and cleanup any pending vblank enable/disable
  
        flush_workqueue(priv->wq);
  
 -      /* clean up event worker threads */
 -      for (i = 0; i < priv->num_crtcs; i++) {
 -              if (priv->event_thread[i].worker)
 -                      kthread_destroy_worker(priv->event_thread[i].worker);
 -      }
 -
        msm_gem_shrinker_cleanup(ddev);
  
 -      drm_kms_helper_poll_fini(ddev);
 -
        msm_perf_debugfs_cleanup(priv);
        msm_rd_debugfs_cleanup(priv);
  
 -      if (kms)
 -              msm_disp_snapshot_destroy(ddev);
 -
 -      drm_mode_config_cleanup(ddev);
 -
 -      for (i = 0; i < priv->num_bridges; i++)
 -              drm_bridge_remove(priv->bridges[i]);
 -      priv->num_bridges = 0;
 -
 -      if (kms) {
 -              pm_runtime_get_sync(dev);
 -              msm_irq_uninstall(ddev);
 -              pm_runtime_put_sync(dev);
 -      }
 -
 -      if (kms && kms->funcs)
 -              kms->funcs->destroy(kms);
 +      if (priv->kms)
 +              msm_drm_kms_uninit(dev);
  
        msm_deinit_vram(ddev);
  
        return 0;
  }
  
 -struct msm_gem_address_space *msm_kms_init_aspace(struct drm_device *dev)
 -{
 -      struct msm_gem_address_space *aspace;
 -      struct msm_mmu *mmu;
 -      struct device *mdp_dev = dev->dev;
 -      struct device *mdss_dev = mdp_dev->parent;
 -      struct device *iommu_dev;
 -
 -      /*
 -       * IOMMUs can be a part of MDSS device tree binding, or the
 -       * MDP/DPU device.
 -       */
 -      if (device_iommu_mapped(mdp_dev))
 -              iommu_dev = mdp_dev;
 -      else
 -              iommu_dev = mdss_dev;
 -
 -      mmu = msm_iommu_new(iommu_dev, 0);
 -      if (IS_ERR(mmu))
 -              return ERR_CAST(mmu);
 -
 -      if (!mmu) {
 -              drm_info(dev, "no IOMMU, fallback to phys contig buffers for scanout\n");
 -              return NULL;
 -      }
 -
 -      aspace = msm_gem_address_space_create(mmu, "mdp_kms",
 -              0x1000, 0x100000000 - 0x1000);
 -      if (IS_ERR(aspace)) {
 -              dev_err(mdp_dev, "aspace create, error %pe\n", aspace);
 -              mmu->funcs->destroy(mmu);
 -      }
 -
 -      return aspace;
 -}
 -
  bool msm_use_mmu(struct drm_device *dev)
  {
        struct msm_drm_private *priv = dev->dev_private;
@@@ -212,6 -406,8 +212,6 @@@ static int msm_drm_init(struct device *
  {
        struct msm_drm_private *priv = dev_get_drvdata(dev);
        struct drm_device *ddev;
 -      struct msm_kms *kms;
 -      struct drm_crtc *crtc;
        int ret;
  
        if (drm_firmware_drivers_only())
        might_lock(&priv->lru.lock);
        fs_reclaim_release(GFP_KERNEL);
  
 -      drm_mode_config_init(ddev);
 +      if (priv->kms_init) {
 +              ret = drmm_mode_config_init(ddev);
 +              if (ret)
 +                      goto err_destroy_wq;
 +      }
  
        ret = msm_init_vram(ddev);
        if (ret)
 -              goto err_cleanup_mode_config;
 +              goto err_destroy_wq;
  
        dma_set_max_seg_size(dev, UINT_MAX);
  
        if (ret)
                goto err_deinit_vram;
  
-       msm_gem_shrinker_init(ddev);
 -      /* the fw fb could be anywhere in memory */
 -      ret = drm_aperture_remove_framebuffers(drv);
 -      if (ret)
 -              goto err_msm_uninit;
 -
+       ret = msm_gem_shrinker_init(ddev);
+       if (ret)
+               goto err_msm_uninit;
  
        if (priv->kms_init) {
 -              ret = priv->kms_init(ddev);
 -              if (ret) {
 -                      DRM_DEV_ERROR(dev, "failed to load kms\n");
 -                      priv->kms = NULL;
 +              ret = msm_drm_kms_init(dev, drv);
 +              if (ret)
                        goto err_msm_uninit;
 -              }
 -              kms = priv->kms;
        } else {
                /* valid only for the dummy headless case, where of_node=NULL */
                WARN_ON(dev->of_node);
 -              kms = NULL;
 -      }
 -
 -      /* Enable normalization of plane zpos */
 -      ddev->mode_config.normalize_zpos = true;
 -
 -      if (kms) {
 -              kms->dev = ddev;
 -              ret = kms->funcs->hw_init(kms);
 -              if (ret) {
 -                      DRM_DEV_ERROR(dev, "kms hw init failed: %d\n", ret);
 -                      goto err_msm_uninit;
 -              }
 -      }
 -
 -      drm_helper_move_panel_connectors_to_head(ddev);
 -
 -      ddev->mode_config.funcs = &mode_config_funcs;
 -      ddev->mode_config.helper_private = &mode_config_helper_funcs;
 -
 -      drm_for_each_crtc(crtc, ddev) {
 -              struct msm_drm_thread *ev_thread;
 -
 -              /* initialize event thread */
 -              ev_thread = &priv->event_thread[drm_crtc_index(crtc)];
 -              ev_thread->dev = ddev;
 -              ev_thread->worker = kthread_create_worker(0, "crtc_event:%d", crtc->base.id);
 -              if (IS_ERR(ev_thread->worker)) {
 -                      ret = PTR_ERR(ev_thread->worker);
 -                      DRM_DEV_ERROR(dev, "failed to create crtc_event kthread\n");
 -                      ev_thread->worker = NULL;
 -                      goto err_msm_uninit;
 -              }
 -
 -              sched_set_fifo(ev_thread->worker->task);
 -      }
 -
 -      ret = drm_vblank_init(ddev, priv->num_crtcs);
 -      if (ret < 0) {
 -              DRM_DEV_ERROR(dev, "failed to initialize vblank\n");
 -              goto err_msm_uninit;
 -      }
 -
 -      if (kms) {
 -              pm_runtime_get_sync(dev);
 -              ret = msm_irq_install(ddev, kms->irq);
 -              pm_runtime_put_sync(dev);
 -              if (ret < 0) {
 -                      DRM_DEV_ERROR(dev, "failed to install IRQ handler\n");
 -                      goto err_msm_uninit;
 -              }
 +              ddev->driver_features &= ~DRIVER_MODESET;
 +              ddev->driver_features &= ~DRIVER_ATOMIC;
        }
  
        ret = drm_dev_register(ddev, 0);
        if (ret)
                goto err_msm_uninit;
  
 -      if (kms) {
 -              ret = msm_disp_snapshot_init(ddev);
 -              if (ret)
 -                      DRM_DEV_ERROR(dev, "msm_disp_snapshot_init failed ret = %d\n", ret);
 -      }
 -      drm_mode_config_reset(ddev);
 -
        ret = msm_debugfs_late_init(ddev);
        if (ret)
                goto err_msm_uninit;
  
        drm_kms_helper_poll_init(ddev);
  
 -      if (kms)
 +      if (priv->kms_init) {
 +              drm_kms_helper_poll_init(ddev);
                msm_fbdev_setup(ddev);
 +      }
  
        return 0;
  
@@@ -302,7 -559,8 +304,7 @@@ err_msm_uninit
  
  err_deinit_vram:
        msm_deinit_vram(ddev);
 -err_cleanup_mode_config:
 -      drm_mode_config_cleanup(ddev);
 +err_destroy_wq:
        destroy_workqueue(priv->wq);
  err_put_dev:
        drm_dev_put(ddev);
@@@ -382,6 -640,28 +384,6 @@@ static void msm_postclose(struct drm_de
        context_close(ctx);
  }
  
 -int msm_crtc_enable_vblank(struct drm_crtc *crtc)
 -{
 -      struct drm_device *dev = crtc->dev;
 -      struct msm_drm_private *priv = dev->dev_private;
 -      struct msm_kms *kms = priv->kms;
 -      if (!kms)
 -              return -ENXIO;
 -      drm_dbg_vbl(dev, "crtc=%u", crtc->base.id);
 -      return vblank_ctrl_queue_work(priv, crtc, true);
 -}
 -
 -void msm_crtc_disable_vblank(struct drm_crtc *crtc)
 -{
 -      struct drm_device *dev = crtc->dev;
 -      struct msm_drm_private *priv = dev->dev_private;
 -      struct msm_kms *kms = priv->kms;
 -      if (!kms)
 -              return;
 -      drm_dbg_vbl(dev, "crtc=%u", crtc->base.id);
 -      vblank_ctrl_queue_work(priv, crtc, false);
 -}
 -
  /*
   * DRM ioctls:
   */
@@@ -824,6 -1104,33 +826,6 @@@ static const struct drm_driver msm_driv
        .patchlevel         = MSM_VERSION_PATCHLEVEL,
  };
  
 -int msm_pm_prepare(struct device *dev)
 -{
 -      struct msm_drm_private *priv = dev_get_drvdata(dev);
 -      struct drm_device *ddev = priv ? priv->dev : NULL;
 -
 -      if (!priv || !priv->kms)
 -              return 0;
 -
 -      return drm_mode_config_helper_suspend(ddev);
 -}
 -
 -void msm_pm_complete(struct device *dev)
 -{
 -      struct msm_drm_private *priv = dev_get_drvdata(dev);
 -      struct drm_device *ddev = priv ? priv->dev : NULL;
 -
 -      if (!priv || !priv->kms)
 -              return;
 -
 -      drm_mode_config_helper_resume(ddev);
 -}
 -
 -static const struct dev_pm_ops msm_pm_ops = {
 -      .prepare = msm_pm_prepare,
 -      .complete = msm_pm_complete,
 -};
 -
  /*
   * Componentized driver support:
   */
@@@ -925,8 -1232,7 +927,8 @@@ const struct component_master_ops msm_d
  };
  
  int msm_drv_probe(struct device *master_dev,
 -      int (*kms_init)(struct drm_device *dev))
 +      int (*kms_init)(struct drm_device *dev),
 +      struct msm_kms *kms)
  {
        struct msm_drm_private *priv;
        struct component_match *match = NULL;
        if (!priv)
                return -ENOMEM;
  
 +      priv->kms = kms;
        priv->kms_init = kms_init;
        dev_set_drvdata(master_dev, priv);
  
  
  static int msm_pdev_probe(struct platform_device *pdev)
  {
 -      return msm_drv_probe(&pdev->dev, NULL);
 +      return msm_drv_probe(&pdev->dev, NULL, NULL);
  }
  
 -static int msm_pdev_remove(struct platform_device *pdev)
 +static void msm_pdev_remove(struct platform_device *pdev)
  {
        component_master_del(&pdev->dev, &msm_drm_ops);
 -
 -      return 0;
 -}
 -
 -void msm_drv_shutdown(struct platform_device *pdev)
 -{
 -      struct msm_drm_private *priv = platform_get_drvdata(pdev);
 -      struct drm_device *drm = priv ? priv->dev : NULL;
 -
 -      /*
 -       * Shutdown the hw if we're far enough along where things might be on.
 -       * If we run this too early, we'll end up panicking in any variety of
 -       * places. Since we don't register the drm device until late in
 -       * msm_drm_init, drm_dev->registered is used as an indicator that the
 -       * shutdown will be successful.
 -       */
 -      if (drm && drm->registered && priv->kms)
 -              drm_atomic_helper_shutdown(drm);
  }
  
  static struct platform_driver msm_platform_driver = {
        .probe      = msm_pdev_probe,
 -      .remove     = msm_pdev_remove,
 -      .shutdown   = msm_drv_shutdown,
 +      .remove_new = msm_pdev_remove,
        .driver     = {
                .name   = "msm",
 -              .pm     = &msm_pm_ops,
        },
  };
  
index 7dbd0f06898bd0ec1852e0275d89e66ea9f39c1f,e2fc56f161b524e6cbdf81a41187bbe65ff2e3f4..cd5bf658df669ee697f9efbd732f77bba8d18a38
@@@ -206,6 -206,9 +206,6 @@@ struct msm_drm_private 
  
        struct msm_drm_thread event_thread[MAX_CRTCS];
  
 -      unsigned int num_bridges;
 -      struct drm_bridge *bridges[MAX_BRIDGES];
 -
        /* VRAM carveout, used when no IOMMU: */
        struct {
                unsigned long size;
        } vram;
  
        struct notifier_block vmap_notifier;
-       struct shrinker shrinker;
+       struct shrinker *shrinker;
  
        struct drm_atomic_state *pm_state;
  
@@@ -280,7 -283,7 +280,7 @@@ int msm_ioctl_gem_submit(struct drm_dev
  unsigned long msm_gem_shrinker_shrink(struct drm_device *dev, unsigned long nr_to_scan);
  #endif
  
void msm_gem_shrinker_init(struct drm_device *dev);
int msm_gem_shrinker_init(struct drm_device *dev);
  void msm_gem_shrinker_cleanup(struct drm_device *dev);
  
  struct sg_table *msm_gem_prime_get_sg_table(struct drm_gem_object *obj);
@@@ -340,7 -343,6 +340,7 @@@ void msm_dsi_snapshot(struct msm_disp_s
  bool msm_dsi_is_cmd_mode(struct msm_dsi *msm_dsi);
  bool msm_dsi_is_bonded_dsi(struct msm_dsi *msm_dsi);
  bool msm_dsi_is_master_dsi(struct msm_dsi *msm_dsi);
 +bool msm_dsi_wide_bus_enabled(struct msm_dsi *msm_dsi);
  struct drm_dsc_config *msm_dsi_get_dsc_config(struct msm_dsi *msm_dsi);
  #else
  static inline void __init msm_dsi_register(void)
@@@ -370,10 -372,6 +370,10 @@@ static inline bool msm_dsi_is_master_ds
  {
        return false;
  }
 +static inline bool msm_dsi_wide_bus_enabled(struct msm_dsi *msm_dsi)
 +{
 +      return false;
 +}
  
  static inline struct drm_dsc_config *msm_dsi_get_dsc_config(struct msm_dsi *msm_dsi)
  {
@@@ -563,13 -561,12 +563,13 @@@ static inline unsigned long timeout_to_
  
  extern const struct component_master_ops msm_drm_ops;
  
 -int msm_pm_prepare(struct device *dev);
 -void msm_pm_complete(struct device *dev);
 +int msm_kms_pm_prepare(struct device *dev);
 +void msm_kms_pm_complete(struct device *dev);
  
  int msm_drv_probe(struct device *dev,
 -      int (*kms_init)(struct drm_device *dev));
 -void msm_drv_shutdown(struct platform_device *pdev);
 +      int (*kms_init)(struct drm_device *dev),
 +      struct msm_kms *kms);
 +void msm_kms_shutdown(struct platform_device *pdev);
  
  
  #endif /* __MSM_DRV_H__ */
index 1e85656dc2f7fe71e57d35ab64b672aa3ebd6883,e667e56893536709bea72867f135a5c1fdd735dd..1ef38f60d5dc4e96f2878d0e6b0ad4c16d1f382d
@@@ -107,7 -107,6 +107,7 @@@ struct panfrost_device 
        struct list_head scheduled_jobs;
  
        struct panfrost_perfcnt *perfcnt;
 +      atomic_t profile_mode;
  
        struct mutex sched_lock;
  
  
        struct mutex shrinker_lock;
        struct list_head shrinker_list;
-       struct shrinker shrinker;
+       struct shrinker *shrinker;
  
        struct panfrost_devfreq pfdevfreq;
 +
 +      struct {
 +              atomic_t use_count;
 +              spinlock_t lock;
 +      } cycle_counter;
  };
  
  struct panfrost_mmu {
        struct list_head list;
  };
  
 +struct panfrost_engine_usage {
 +      unsigned long long elapsed_ns[NUM_JOB_SLOTS];
 +      unsigned long long cycles[NUM_JOB_SLOTS];
 +};
 +
  struct panfrost_file_priv {
        struct panfrost_device *pfdev;
  
        struct drm_sched_entity sched_entity[NUM_JOB_SLOTS];
  
        struct panfrost_mmu *mmu;
 +
 +      struct panfrost_engine_usage engine_usage;
  };
  
  static inline struct panfrost_device *to_panfrost_device(struct drm_device *ddev)
index b834777b409b076d5273eccf1fba20952e9e3360,e1d0e3a23757bc7d49583a629d44e204e3775dd2..7cabf4e3d1f214a04e1afa13dac46e239d872f1e
@@@ -20,7 -20,6 +20,7 @@@
  #include "panfrost_job.h"
  #include "panfrost_gpu.h"
  #include "panfrost_perfcnt.h"
 +#include "panfrost_debugfs.h"
  
  static bool unstable_ioctls;
  module_param_unsafe(unstable_ioctls, bool, 0600);
@@@ -268,7 -267,6 +268,7 @@@ static int panfrost_ioctl_submit(struc
        job->requirements = args->requirements;
        job->flush_id = panfrost_gpu_get_latest_flush_id(pfdev);
        job->mmu = file_priv->mmu;
 +      job->engine_usage = &file_priv->engine_usage;
  
        slot = panfrost_job_get_slot(job);
  
@@@ -525,58 -523,7 +525,58 @@@ static const struct drm_ioctl_desc panf
        PANFROST_IOCTL(MADVISE,         madvise,        DRM_RENDER_ALLOW),
  };
  
 -DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
 +static void panfrost_gpu_show_fdinfo(struct panfrost_device *pfdev,
 +                                   struct panfrost_file_priv *panfrost_priv,
 +                                   struct drm_printer *p)
 +{
 +      int i;
 +
 +      /*
 +       * IMPORTANT NOTE: drm-cycles and drm-engine measurements are not
 +       * accurate, as they only provide a rough estimation of the number of
 +       * GPU cycles and CPU time spent in a given context. This is due to two
 +       * different factors:
 +       * - Firstly, we must consider the time the CPU and then the kernel
 +       *   takes to process the GPU interrupt, which means additional time and
 +       *   GPU cycles will be added in excess to the real figure.
 +       * - Secondly, the pipelining done by the Job Manager (2 job slots per
 +       *   engine) implies there is no way to know exactly how much time each
 +       *   job spent on the GPU.
 +       */
 +
 +      static const char * const engine_names[] = {
 +              "fragment", "vertex-tiler", "compute-only"
 +      };
 +
 +      BUILD_BUG_ON(ARRAY_SIZE(engine_names) != NUM_JOB_SLOTS);
 +
 +      for (i = 0; i < NUM_JOB_SLOTS - 1; i++) {
 +              drm_printf(p, "drm-engine-%s:\t%llu ns\n",
 +                         engine_names[i], panfrost_priv->engine_usage.elapsed_ns[i]);
 +              drm_printf(p, "drm-cycles-%s:\t%llu\n",
 +                         engine_names[i], panfrost_priv->engine_usage.cycles[i]);
 +              drm_printf(p, "drm-maxfreq-%s:\t%lu Hz\n",
 +                         engine_names[i], pfdev->pfdevfreq.fast_rate);
 +              drm_printf(p, "drm-curfreq-%s:\t%lu Hz\n",
 +                         engine_names[i], pfdev->pfdevfreq.current_frequency);
 +      }
 +}
 +
 +static void panfrost_show_fdinfo(struct drm_printer *p, struct drm_file *file)
 +{
 +      struct drm_device *dev = file->minor->dev;
 +      struct panfrost_device *pfdev = dev->dev_private;
 +
 +      panfrost_gpu_show_fdinfo(pfdev, file->driver_priv, p);
 +
 +      drm_show_memory_stats(p, file);
 +}
 +
 +static const struct file_operations panfrost_drm_driver_fops = {
 +      .owner = THIS_MODULE,
 +      DRM_GEM_FOPS,
 +      .show_fdinfo = drm_show_fdinfo,
 +};
  
  /*
   * Panfrost driver version:
@@@ -588,7 -535,6 +588,7 @@@ static const struct drm_driver panfrost
        .driver_features        = DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ,
        .open                   = panfrost_open,
        .postclose              = panfrost_postclose,
 +      .show_fdinfo            = panfrost_show_fdinfo,
        .ioctls                 = panfrost_drm_driver_ioctls,
        .num_ioctls             = ARRAY_SIZE(panfrost_drm_driver_ioctls),
        .fops                   = &panfrost_drm_driver_fops,
  
        .gem_create_object      = panfrost_gem_create_object,
        .gem_prime_import_sg_table = panfrost_gem_prime_import_sg_table,
 +
 +#ifdef CONFIG_DEBUG_FS
 +      .debugfs_init           = panfrost_debugfs_init,
 +#endif
  };
  
  static int panfrost_probe(struct platform_device *pdev)
        if (err < 0)
                goto err_out1;
  
-       panfrost_gem_shrinker_init(ddev);
+       err = panfrost_gem_shrinker_init(ddev);
+       if (err)
+               goto err_out2;
  
        return 0;
  
+ err_out2:
+       drm_dev_unregister(ddev);
  err_out1:
        pm_runtime_disable(pfdev->dev);
        panfrost_device_fini(pfdev);
index 13c0a8149c3ad014b310da90ec758dad8ed825f5,863d2ec8d4f013c4e0fb54fd9254e3a9fed3105c..7516b7ecf7feab416491a0cdeba01ceaded16c54
@@@ -36,11 -36,6 +36,11 @@@ struct panfrost_gem_object 
         */
        atomic_t gpu_usecount;
  
 +      /*
 +       * Object chunk size currently mapped onto physical memory
 +       */
 +      size_t heap_rss_size;
 +
        bool noexec             :1;
        bool is_heap            :1;
  };
@@@ -86,7 -81,7 +86,7 @@@ panfrost_gem_mapping_get(struct panfros
  void panfrost_gem_mapping_put(struct panfrost_gem_mapping *mapping);
  void panfrost_gem_teardown_mappings_locked(struct panfrost_gem_object *bo);
  
void panfrost_gem_shrinker_init(struct drm_device *dev);
int panfrost_gem_shrinker_init(struct drm_device *dev);
  void panfrost_gem_shrinker_cleanup(struct drm_device *dev);
  
  #endif /* __PANFROST_GEM_H__ */
index 313cee6ad00983f12619c34425bfd3f010894f80,c622bc50f81bd830deb4c50615164ace661e2c3e..05be59ae21b29d7de20c7ca32cae293f7ea195e4
  #define pr_fmt(fmt) "bcache: %s() " fmt, __func__
  
  #include <linux/bio.h>
 +#include <linux/closure.h>
  #include <linux/kobject.h>
  #include <linux/list.h>
  #include <linux/mutex.h>
  #include "bcache_ondisk.h"
  #include "bset.h"
  #include "util.h"
 -#include "closure.h"
  
  struct bucket {
        atomic_t        pin;
@@@ -299,7 -299,6 +299,7 @@@ struct cached_dev 
        struct list_head        list;
        struct bcache_device    disk;
        struct block_device     *bdev;
 +      struct bdev_handle      *bdev_handle;
  
        struct cache_sb         sb;
        struct cache_sb_disk    *sb_disk;
@@@ -422,7 -421,6 +422,7 @@@ struct cache 
  
        struct kobject          kobj;
        struct block_device     *bdev;
 +      struct bdev_handle      *bdev_handle;
  
        struct task_struct      *alloc_thread;
  
@@@ -543,7 -541,7 +543,7 @@@ struct cache_set 
        struct bio_set          bio_split;
  
        /* For the btree cache */
-       struct shrinker         shrink;
+       struct shrinker         *shrink;
  
        /* For the btree cache and anything allocation related */
        struct mutex            bucket_lock;
index 5a18b80d3666e75fe6a4702de9aab7003a55ff18,9e0c69958587209a0fb3af47c42446f4d2b58910..96751cd3d18113795adb98bd34ee05d984e74d4a
@@@ -597,7 -597,7 +597,7 @@@ static void read_superblock_fields(stru
        cmd->discard_nr_blocks = to_dblock(le64_to_cpu(disk_super->discard_nr_blocks));
        cmd->data_block_size = le32_to_cpu(disk_super->data_block_size);
        cmd->cache_blocks = to_cblock(le32_to_cpu(disk_super->cache_blocks));
 -      strncpy(cmd->policy_name, disk_super->policy_name, sizeof(cmd->policy_name));
 +      strscpy(cmd->policy_name, disk_super->policy_name, sizeof(cmd->policy_name));
        cmd->policy_version[0] = le32_to_cpu(disk_super->policy_version[0]);
        cmd->policy_version[1] = le32_to_cpu(disk_super->policy_version[1]);
        cmd->policy_version[2] = le32_to_cpu(disk_super->policy_version[2]);
@@@ -707,7 -707,7 +707,7 @@@ static int __commit_transaction(struct 
        disk_super->discard_block_size = cpu_to_le64(cmd->discard_block_size);
        disk_super->discard_nr_blocks = cpu_to_le64(from_dblock(cmd->discard_nr_blocks));
        disk_super->cache_blocks = cpu_to_le32(from_cblock(cmd->cache_blocks));
 -      strncpy(disk_super->policy_name, cmd->policy_name, sizeof(disk_super->policy_name));
 +      strscpy(disk_super->policy_name, cmd->policy_name, sizeof(disk_super->policy_name));
        disk_super->policy_version[0] = cpu_to_le32(cmd->policy_version[0]);
        disk_super->policy_version[1] = cpu_to_le32(cmd->policy_version[1]);
        disk_super->policy_version[2] = cpu_to_le32(cmd->policy_version[2]);
@@@ -1726,7 -1726,7 +1726,7 @@@ static int write_hints(struct dm_cache_
            (strlen(policy_name) > sizeof(cmd->policy_name) - 1))
                return -EINVAL;
  
 -      strncpy(cmd->policy_name, policy_name, sizeof(cmd->policy_name));
 +      strscpy(cmd->policy_name, policy_name, sizeof(cmd->policy_name));
        memcpy(cmd->policy_version, policy_version, sizeof(cmd->policy_version));
  
        hint_size = dm_cache_policy_get_hint_size(policy);
@@@ -1828,7 -1828,7 +1828,7 @@@ int dm_cache_metadata_abort(struct dm_c
         * Replacement block manager (new_bm) is created and old_bm destroyed outside of
         * cmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
         * shrinker associated with the block manager's bufio client vs cmd root_lock).
-        * - must take shrinker_rwsem without holding cmd->root_lock
+        * - must take shrinker_mutex without holding cmd->root_lock
         */
        new_bm = dm_block_manager_create(cmd->bdev, DM_CACHE_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
                                         CACHE_MAX_CONCURRENT_LOCKS);
diff --combined drivers/md/raid5.c
index c84ccc97329bbf043ee63557d68657a1a2aef003,c8d2c6e50aa1c6ec3a56b9020e8041437ca3589e..dc031d42f53bc678e775741318985db351d17c01
@@@ -70,8 -70,6 +70,8 @@@ MODULE_PARM_DESC(devices_handle_discard
                 "Set to Y if all devices in each array reliably return zeroes on reads from discarded regions");
  static struct workqueue_struct *raid5_wq;
  
 +static void raid5_quiesce(struct mddev *mddev, int quiesce);
 +
  static inline struct hlist_head *stripe_hash(struct r5conf *conf, sector_t sect)
  {
        int hash = (sect >> RAID5_STRIPE_SHIFT(conf)) & HASH_MASK;
@@@ -856,13 -854,6 +856,13 @@@ struct stripe_head *raid5_get_active_st
  
                set_bit(R5_INACTIVE_BLOCKED, &conf->cache_state);
                r5l_wake_reclaim(conf->log, 0);
 +
 +              /* release batch_last before wait to avoid risk of deadlock */
 +              if (ctx && ctx->batch_last) {
 +                      raid5_release_stripe(ctx->batch_last);
 +                      ctx->batch_last = NULL;
 +              }
 +
                wait_event_lock_irq(conf->wait_for_stripe,
                                    is_inactive_blocked(conf, hash),
                                    *(conf->hash_locks + hash));
@@@ -2501,12 -2492,15 +2501,12 @@@ static int resize_chunks(struct r5conf 
        unsigned long cpu;
        int err = 0;
  
 -      /*
 -       * Never shrink. And mddev_suspend() could deadlock if this is called
 -       * from raid5d. In that case, scribble_disks and scribble_sectors
 -       * should equal to new_disks and new_sectors
 -       */
 +      /* Never shrink. */
        if (conf->scribble_disks >= new_disks &&
            conf->scribble_sectors >= new_sectors)
                return 0;
 -      mddev_suspend(conf->mddev);
 +
 +      raid5_quiesce(conf->mddev, true);
        cpus_read_lock();
  
        for_each_present_cpu(cpu) {
        }
  
        cpus_read_unlock();
 -      mddev_resume(conf->mddev);
 +      raid5_quiesce(conf->mddev, false);
 +
        if (!err) {
                conf->scribble_disks = new_disks;
                conf->scribble_sectors = new_sectors;
@@@ -5960,6 -5953,19 +5960,6 @@@ out
        return ret;
  }
  
 -static bool reshape_inprogress(struct mddev *mddev)
 -{
 -      return test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
 -             test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
 -             !test_bit(MD_RECOVERY_DONE, &mddev->recovery) &&
 -             !test_bit(MD_RECOVERY_INTR, &mddev->recovery);
 -}
 -
 -static bool reshape_disabled(struct mddev *mddev)
 -{
 -      return is_md_suspended(mddev) || !md_is_rdwr(mddev);
 -}
 -
  static enum stripe_result make_stripe_request(struct mddev *mddev,
                struct r5conf *conf, struct stripe_request_ctx *ctx,
                sector_t logical_sector, struct bio *bi)
                        if (ahead_of_reshape(mddev, logical_sector,
                                             conf->reshape_safe)) {
                                spin_unlock_irq(&conf->device_lock);
 -                              ret = STRIPE_SCHEDULE_AND_RETRY;
 -                              goto out;
 +                              return STRIPE_SCHEDULE_AND_RETRY;
                        }
                }
                spin_unlock_irq(&conf->device_lock);
  
  out_release:
        raid5_release_stripe(sh);
 -out:
 -      if (ret == STRIPE_SCHEDULE_AND_RETRY && !reshape_inprogress(mddev) &&
 -          reshape_disabled(mddev)) {
 -              bi->bi_status = BLK_STS_IOERR;
 -              ret = STRIPE_FAIL;
 -              pr_err("md/raid456:%s: io failed across reshape position while reshape can't make progress.\n",
 -                     mdname(mddev));
 -      }
 -
        return ret;
  }
  
@@@ -7009,7 -7025,7 +7009,7 @@@ raid5_store_stripe_size(struct mddev  *
                        new != roundup_pow_of_two(new))
                return -EINVAL;
  
 -      err = mddev_lock(mddev);
 +      err = mddev_suspend_and_lock(mddev);
        if (err)
                return err;
  
                goto out_unlock;
        }
  
 -      mddev_suspend(mddev);
        mutex_lock(&conf->cache_size_mutex);
        size = conf->max_nr_stripes;
  
                err = -ENOMEM;
        }
        mutex_unlock(&conf->cache_size_mutex);
 -      mddev_resume(mddev);
  
  out_unlock:
 -      mddev_unlock(mddev);
 +      mddev_unlock_and_resume(mddev);
        return err ?: len;
  }
  
@@@ -7135,7 -7153,7 +7135,7 @@@ raid5_store_skip_copy(struct mddev *mdd
                return -EINVAL;
        new = !!new;
  
 -      err = mddev_lock(mddev);
 +      err = mddev_suspend_and_lock(mddev);
        if (err)
                return err;
        conf = mddev->private;
        else if (new != conf->skip_copy) {
                struct request_queue *q = mddev->queue;
  
 -              mddev_suspend(mddev);
                conf->skip_copy = new;
                if (new)
                        blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, q);
                else
                        blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, q);
 -              mddev_resume(mddev);
        }
 -      mddev_unlock(mddev);
 +      mddev_unlock_and_resume(mddev);
        return err ?: len;
  }
  
@@@ -7205,13 -7225,15 +7205,13 @@@ raid5_store_group_thread_cnt(struct mdd
        if (new > 8192)
                return -EINVAL;
  
 -      err = mddev_lock(mddev);
 +      err = mddev_suspend_and_lock(mddev);
        if (err)
                return err;
        conf = mddev->private;
        if (!conf)
                err = -ENODEV;
        else if (new != conf->worker_cnt_per_group) {
 -              mddev_suspend(mddev);
 -
                old_groups = conf->worker_groups;
                if (old_groups)
                        flush_workqueue(raid5_wq);
                                kfree(old_groups[0].workers);
                        kfree(old_groups);
                }
 -              mddev_resume(mddev);
        }
 -      mddev_unlock(mddev);
 +      mddev_unlock_and_resume(mddev);
  
        return err ?: len;
  }
@@@ -7378,7 -7401,7 +7378,7 @@@ static void free_conf(struct r5conf *co
  
        log_exit(conf);
  
-       unregister_shrinker(&conf->shrinker);
+       shrinker_free(conf->shrinker);
        free_thread_groups(conf);
        shrink_stripes(conf);
        raid5_free_percpu(conf);
@@@ -7426,7 -7449,7 +7426,7 @@@ static int raid5_alloc_percpu(struct r5
  static unsigned long raid5_cache_scan(struct shrinker *shrink,
                                      struct shrink_control *sc)
  {
-       struct r5conf *conf = container_of(shrink, struct r5conf, shrinker);
+       struct r5conf *conf = shrink->private_data;
        unsigned long ret = SHRINK_STOP;
  
        if (mutex_trylock(&conf->cache_size_mutex)) {
  static unsigned long raid5_cache_count(struct shrinker *shrink,
                                       struct shrink_control *sc)
  {
-       struct r5conf *conf = container_of(shrink, struct r5conf, shrinker);
+       struct r5conf *conf = shrink->private_data;
  
        if (conf->max_nr_stripes < conf->min_nr_stripes)
                /* unlikely, but not impossible */
@@@ -7682,18 -7705,22 +7682,22 @@@ static struct r5conf *setup_conf(struc
         * it reduces the queue depth and so can hurt throughput.
         * So set it rather large, scaled by number of devices.
         */
-       conf->shrinker.seeks = DEFAULT_SEEKS * conf->raid_disks * 4;
-       conf->shrinker.scan_objects = raid5_cache_scan;
-       conf->shrinker.count_objects = raid5_cache_count;
-       conf->shrinker.batch = 128;
-       conf->shrinker.flags = 0;
-       ret = register_shrinker(&conf->shrinker, "md-raid5:%s", mdname(mddev));
-       if (ret) {
-               pr_warn("md/raid:%s: couldn't register shrinker.\n",
+       conf->shrinker = shrinker_alloc(0, "md-raid5:%s", mdname(mddev));
+       if (!conf->shrinker) {
+               ret = -ENOMEM;
+               pr_warn("md/raid:%s: couldn't allocate shrinker.\n",
                        mdname(mddev));
                goto abort;
        }
  
+       conf->shrinker->seeks = DEFAULT_SEEKS * conf->raid_disks * 4;
+       conf->shrinker->scan_objects = raid5_cache_scan;
+       conf->shrinker->count_objects = raid5_cache_count;
+       conf->shrinker->batch = 128;
+       conf->shrinker->private_data = conf;
+       shrinker_register(conf->shrinker);
        sprintf(pers_name, "raid%d", mddev->new_level);
        rcu_assign_pointer(conf->thread,
                           md_register_thread(raid5d, mddev, pers_name));
@@@ -7755,6 -7782,9 +7759,6 @@@ static int raid5_run(struct mddev *mdde
        long long min_offset_diff = 0;
        int first = 1;
  
 -      if (mddev_init_writes_pending(mddev) < 0)
 -              return -ENOMEM;
 -
        if (mddev->recovery_cp != MaxSector)
                pr_notice("md/raid:%s: not clean -- starting background reconstruction\n",
                          mdname(mddev));
@@@ -8535,8 -8565,8 +8539,8 @@@ static int raid5_start_reshape(struct m
         * the reshape wasn't running - like Discard or Read - have
         * completed.
         */
 -      mddev_suspend(mddev);
 -      mddev_resume(mddev);
 +      raid5_quiesce(mddev, true);
 +      raid5_quiesce(mddev, false);
  
        /* Add some new drives, as many as will fit.
         * We know there are enough to make the newly sized array work.
@@@ -8951,12 -8981,12 +8955,12 @@@ static int raid5_change_consistency_pol
        struct r5conf *conf;
        int err;
  
 -      err = mddev_lock(mddev);
 +      err = mddev_suspend_and_lock(mddev);
        if (err)
                return err;
        conf = mddev->private;
        if (!conf) {
 -              mddev_unlock(mddev);
 +              mddev_unlock_and_resume(mddev);
                return -ENODEV;
        }
  
                        err = log_init(conf, NULL, true);
                        if (!err) {
                                err = resize_stripes(conf, conf->pool_size);
 -                              if (err) {
 -                                      mddev_suspend(mddev);
 +                              if (err)
                                        log_exit(conf);
 -                                      mddev_resume(mddev);
 -                              }
                        }
                } else
                        err = -EINVAL;
        } else if (strncmp(buf, "resync", 6) == 0) {
                if (raid5_has_ppl(conf)) {
 -                      mddev_suspend(mddev);
                        log_exit(conf);
 -                      mddev_resume(mddev);
                        err = resize_stripes(conf, conf->pool_size);
                } else if (test_bit(MD_HAS_JOURNAL, &conf->mddev->flags) &&
                           r5l_log_disk_error(conf)) {
                                        break;
                                }
  
 -                      if (!journal_dev_exists) {
 -                              mddev_suspend(mddev);
 +                      if (!journal_dev_exists)
                                clear_bit(MD_HAS_JOURNAL, &mddev->flags);
 -                              mddev_resume(mddev);
 -                      } else  /* need remove journal device first */
 +                      else  /* need remove journal device first */
                                err = -EBUSY;
                } else
                        err = -EINVAL;
        if (!err)
                md_update_sb(mddev, 1);
  
 -      mddev_unlock(mddev);
 +      mddev_unlock_and_resume(mddev);
  
        return err;
  }
@@@ -9011,6 -9048,22 +9015,6 @@@ static int raid5_start(struct mddev *md
        return r5l_start(conf->log);
  }
  
 -static void raid5_prepare_suspend(struct mddev *mddev)
 -{
 -      struct r5conf *conf = mddev->private;
 -
 -      wait_event(mddev->sb_wait, !reshape_inprogress(mddev) ||
 -                                  percpu_ref_is_zero(&mddev->active_io));
 -      if (percpu_ref_is_zero(&mddev->active_io))
 -              return;
 -
 -      /*
 -       * Reshape is not in progress, and array is suspended, io that is
 -       * waiting for reshpape can never be done.
 -       */
 -      wake_up(&conf->wait_for_overlap);
 -}
 -
  static struct md_personality raid6_personality =
  {
        .name           = "raid6",
        .check_reshape  = raid6_check_reshape,
        .start_reshape  = raid5_start_reshape,
        .finish_reshape = raid5_finish_reshape,
 -      .prepare_suspend = raid5_prepare_suspend,
        .quiesce        = raid5_quiesce,
        .takeover       = raid6_takeover,
        .change_consistency_policy = raid5_change_consistency_policy,
@@@ -9055,6 -9109,7 +9059,6 @@@ static struct md_personality raid5_pers
        .check_reshape  = raid5_check_reshape,
        .start_reshape  = raid5_start_reshape,
        .finish_reshape = raid5_finish_reshape,
 -      .prepare_suspend = raid5_prepare_suspend,
        .quiesce        = raid5_quiesce,
        .takeover       = raid5_takeover,
        .change_consistency_policy = raid5_change_consistency_policy,
@@@ -9080,6 -9135,7 +9084,6 @@@ static struct md_personality raid4_pers
        .check_reshape  = raid5_check_reshape,
        .start_reshape  = raid5_start_reshape,
        .finish_reshape = raid5_finish_reshape,
 -      .prepare_suspend = raid5_prepare_suspend,
        .quiesce        = raid5_quiesce,
        .takeover       = raid4_takeover,
        .change_consistency_policy = raid5_change_consistency_policy,
index 2d5d252ef419727f9685bacc483e4923544b181e,5cdc7b081baa279906f4240b7567147f9e146eb3..44dcb9e7b55ec044fb78bb32e9f0aefde8ecc5a4
@@@ -111,7 -111,7 +111,7 @@@ struct virtio_balloon 
        struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
  
        /* Shrinker to return free pages - VIRTIO_BALLOON_F_FREE_PAGE_HINT */
-       struct shrinker shrinker;
+       struct shrinker *shrinker;
  
        /* OOM notifier to deflate on OOM - VIRTIO_BALLOON_F_DEFLATE_ON_OOM */
        struct notifier_block oom_nb;
@@@ -395,11 -395,7 +395,11 @@@ static inline s64 towards_target(struc
        virtio_cread_le(vb->vdev, struct virtio_balloon_config, num_pages,
                        &num_pages);
  
 -      target = num_pages;
 +      /*
 +       * Aligned up to guest page size to avoid inflating and deflating
 +       * balloon endlessly.
 +       */
 +      target = ALIGN(num_pages, VIRTIO_BALLOON_PAGES_PER_PAGE);
        return target - vb->num_pages;
  }
  
@@@ -820,8 -816,7 +820,7 @@@ static unsigned long shrink_free_pages(
  static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,
                                                  struct shrink_control *sc)
  {
-       struct virtio_balloon *vb = container_of(shrinker,
-                                       struct virtio_balloon, shrinker);
+       struct virtio_balloon *vb = shrinker->private_data;
  
        return shrink_free_pages(vb, sc->nr_to_scan);
  }
  static unsigned long virtio_balloon_shrinker_count(struct shrinker *shrinker,
                                                   struct shrink_control *sc)
  {
-       struct virtio_balloon *vb = container_of(shrinker,
-                                       struct virtio_balloon, shrinker);
+       struct virtio_balloon *vb = shrinker->private_data;
  
        return vb->num_free_page_blocks * VIRTIO_BALLOON_HINT_BLOCK_PAGES;
  }
@@@ -851,16 -845,22 +849,22 @@@ static int virtio_balloon_oom_notify(st
  
  static void virtio_balloon_unregister_shrinker(struct virtio_balloon *vb)
  {
-       unregister_shrinker(&vb->shrinker);
+       shrinker_free(vb->shrinker);
  }
  
  static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
  {
-       vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
-       vb->shrinker.count_objects = virtio_balloon_shrinker_count;
-       vb->shrinker.seeks = DEFAULT_SEEKS;
+       vb->shrinker = shrinker_alloc(0, "virtio-balloon");
+       if (!vb->shrinker)
+               return -ENOMEM;
  
-       return register_shrinker(&vb->shrinker, "virtio-balloon");
+       vb->shrinker->scan_objects = virtio_balloon_shrinker_scan;
+       vb->shrinker->count_objects = virtio_balloon_shrinker_count;
+       vb->shrinker->private_data = vb;
+       shrinker_register(vb->shrinker);
+       return 0;
  }
  
  static int virtballoon_probe(struct virtio_device *vdev)
index 82cf243aa28830ef1720c994bd7261eebbb23cba,0000000000000000000000000000000000000000..5e585819190576db1f22ec9ec19b9b3322f1caf1
mode 100644,000000..100644
--- /dev/null
@@@ -1,1202 -1,0 +1,1204 @@@
-       struct bch_fs *c = container_of(shrink, struct bch_fs,
-                                       btree_cache.shrink);
 +// SPDX-License-Identifier: GPL-2.0
 +
 +#include "bcachefs.h"
 +#include "bkey_buf.h"
 +#include "btree_cache.h"
 +#include "btree_io.h"
 +#include "btree_iter.h"
 +#include "btree_locking.h"
 +#include "debug.h"
 +#include "errcode.h"
 +#include "error.h"
 +#include "trace.h"
 +
 +#include <linux/prefetch.h>
 +#include <linux/sched/mm.h>
 +
 +const char * const bch2_btree_node_flags[] = {
 +#define x(f)  #f,
 +      BTREE_FLAGS()
 +#undef x
 +      NULL
 +};
 +
 +void bch2_recalc_btree_reserve(struct bch_fs *c)
 +{
 +      unsigned i, reserve = 16;
 +
 +      if (!c->btree_roots_known[0].b)
 +              reserve += 8;
 +
 +      for (i = 0; i < btree_id_nr_alive(c); i++) {
 +              struct btree_root *r = bch2_btree_id_root(c, i);
 +
 +              if (r->b)
 +                      reserve += min_t(unsigned, 1, r->b->c.level) * 8;
 +      }
 +
 +      c->btree_cache.reserve = reserve;
 +}
 +
 +static inline unsigned btree_cache_can_free(struct btree_cache *bc)
 +{
 +      return max_t(int, 0, bc->used - bc->reserve);
 +}
 +
 +static void btree_node_to_freedlist(struct btree_cache *bc, struct btree *b)
 +{
 +      if (b->c.lock.readers)
 +              list_move(&b->list, &bc->freed_pcpu);
 +      else
 +              list_move(&b->list, &bc->freed_nonpcpu);
 +}
 +
 +static void btree_node_data_free(struct bch_fs *c, struct btree *b)
 +{
 +      struct btree_cache *bc = &c->btree_cache;
 +
 +      EBUG_ON(btree_node_write_in_flight(b));
 +
 +      clear_btree_node_just_written(b);
 +
 +      kvpfree(b->data, btree_bytes(c));
 +      b->data = NULL;
 +#ifdef __KERNEL__
 +      kvfree(b->aux_data);
 +#else
 +      munmap(b->aux_data, btree_aux_data_bytes(b));
 +#endif
 +      b->aux_data = NULL;
 +
 +      bc->used--;
 +
 +      btree_node_to_freedlist(bc, b);
 +}
 +
 +static int bch2_btree_cache_cmp_fn(struct rhashtable_compare_arg *arg,
 +                                 const void *obj)
 +{
 +      const struct btree *b = obj;
 +      const u64 *v = arg->key;
 +
 +      return b->hash_val == *v ? 0 : 1;
 +}
 +
 +static const struct rhashtable_params bch_btree_cache_params = {
 +      .head_offset    = offsetof(struct btree, hash),
 +      .key_offset     = offsetof(struct btree, hash_val),
 +      .key_len        = sizeof(u64),
 +      .obj_cmpfn      = bch2_btree_cache_cmp_fn,
 +};
 +
 +static int btree_node_data_alloc(struct bch_fs *c, struct btree *b, gfp_t gfp)
 +{
 +      BUG_ON(b->data || b->aux_data);
 +
 +      b->data = kvpmalloc(btree_bytes(c), gfp);
 +      if (!b->data)
 +              return -BCH_ERR_ENOMEM_btree_node_mem_alloc;
 +#ifdef __KERNEL__
 +      b->aux_data = kvmalloc(btree_aux_data_bytes(b), gfp);
 +#else
 +      b->aux_data = mmap(NULL, btree_aux_data_bytes(b),
 +                         PROT_READ|PROT_WRITE|PROT_EXEC,
 +                         MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
 +      if (b->aux_data == MAP_FAILED)
 +              b->aux_data = NULL;
 +#endif
 +      if (!b->aux_data) {
 +              kvpfree(b->data, btree_bytes(c));
 +              b->data = NULL;
 +              return -BCH_ERR_ENOMEM_btree_node_mem_alloc;
 +      }
 +
 +      return 0;
 +}
 +
 +static struct btree *__btree_node_mem_alloc(struct bch_fs *c, gfp_t gfp)
 +{
 +      struct btree *b;
 +
 +      b = kzalloc(sizeof(struct btree), gfp);
 +      if (!b)
 +              return NULL;
 +
 +      bkey_btree_ptr_init(&b->key);
 +      INIT_LIST_HEAD(&b->list);
 +      INIT_LIST_HEAD(&b->write_blocked);
 +      b->byte_order = ilog2(btree_bytes(c));
 +      return b;
 +}
 +
 +struct btree *__bch2_btree_node_mem_alloc(struct bch_fs *c)
 +{
 +      struct btree_cache *bc = &c->btree_cache;
 +      struct btree *b;
 +
 +      b = __btree_node_mem_alloc(c, GFP_KERNEL);
 +      if (!b)
 +              return NULL;
 +
 +      if (btree_node_data_alloc(c, b, GFP_KERNEL)) {
 +              kfree(b);
 +              return NULL;
 +      }
 +
 +      bch2_btree_lock_init(&b->c, 0);
 +
 +      bc->used++;
 +      list_add(&b->list, &bc->freeable);
 +      return b;
 +}
 +
 +/* Btree in memory cache - hash table */
 +
 +void bch2_btree_node_hash_remove(struct btree_cache *bc, struct btree *b)
 +{
 +      int ret = rhashtable_remove_fast(&bc->table, &b->hash, bch_btree_cache_params);
 +
 +      BUG_ON(ret);
 +
 +      /* Cause future lookups for this node to fail: */
 +      b->hash_val = 0;
 +}
 +
 +int __bch2_btree_node_hash_insert(struct btree_cache *bc, struct btree *b)
 +{
 +      BUG_ON(b->hash_val);
 +      b->hash_val = btree_ptr_hash_val(&b->key);
 +
 +      return rhashtable_lookup_insert_fast(&bc->table, &b->hash,
 +                                           bch_btree_cache_params);
 +}
 +
 +int bch2_btree_node_hash_insert(struct btree_cache *bc, struct btree *b,
 +                              unsigned level, enum btree_id id)
 +{
 +      int ret;
 +
 +      b->c.level      = level;
 +      b->c.btree_id   = id;
 +
 +      mutex_lock(&bc->lock);
 +      ret = __bch2_btree_node_hash_insert(bc, b);
 +      if (!ret)
 +              list_add_tail(&b->list, &bc->live);
 +      mutex_unlock(&bc->lock);
 +
 +      return ret;
 +}
 +
 +__flatten
 +static inline struct btree *btree_cache_find(struct btree_cache *bc,
 +                                   const struct bkey_i *k)
 +{
 +      u64 v = btree_ptr_hash_val(k);
 +
 +      return rhashtable_lookup_fast(&bc->table, &v, bch_btree_cache_params);
 +}
 +
 +/*
 + * this version is for btree nodes that have already been freed (we're not
 + * reaping a real btree node)
 + */
 +static int __btree_node_reclaim(struct bch_fs *c, struct btree *b, bool flush)
 +{
 +      struct btree_cache *bc = &c->btree_cache;
 +      int ret = 0;
 +
 +      lockdep_assert_held(&bc->lock);
 +wait_on_io:
 +      if (b->flags & ((1U << BTREE_NODE_dirty)|
 +                      (1U << BTREE_NODE_read_in_flight)|
 +                      (1U << BTREE_NODE_write_in_flight))) {
 +              if (!flush)
 +                      return -BCH_ERR_ENOMEM_btree_node_reclaim;
 +
 +              /* XXX: waiting on IO with btree cache lock held */
 +              bch2_btree_node_wait_on_read(b);
 +              bch2_btree_node_wait_on_write(b);
 +      }
 +
 +      if (!six_trylock_intent(&b->c.lock))
 +              return -BCH_ERR_ENOMEM_btree_node_reclaim;
 +
 +      if (!six_trylock_write(&b->c.lock))
 +              goto out_unlock_intent;
 +
 +      /* recheck under lock */
 +      if (b->flags & ((1U << BTREE_NODE_read_in_flight)|
 +                      (1U << BTREE_NODE_write_in_flight))) {
 +              if (!flush)
 +                      goto out_unlock;
 +              six_unlock_write(&b->c.lock);
 +              six_unlock_intent(&b->c.lock);
 +              goto wait_on_io;
 +      }
 +
 +      if (btree_node_noevict(b) ||
 +          btree_node_write_blocked(b) ||
 +          btree_node_will_make_reachable(b))
 +              goto out_unlock;
 +
 +      if (btree_node_dirty(b)) {
 +              if (!flush)
 +                      goto out_unlock;
 +              /*
 +               * Using the underscore version because we don't want to compact
 +               * bsets after the write, since this node is about to be evicted
 +               * - unless btree verify mode is enabled, since it runs out of
 +               * the post write cleanup:
 +               */
 +              if (bch2_verify_btree_ondisk)
 +                      bch2_btree_node_write(c, b, SIX_LOCK_intent,
 +                                            BTREE_WRITE_cache_reclaim);
 +              else
 +                      __bch2_btree_node_write(c, b,
 +                                              BTREE_WRITE_cache_reclaim);
 +
 +              six_unlock_write(&b->c.lock);
 +              six_unlock_intent(&b->c.lock);
 +              goto wait_on_io;
 +      }
 +out:
 +      if (b->hash_val && !ret)
 +              trace_and_count(c, btree_cache_reap, c, b);
 +      return ret;
 +out_unlock:
 +      six_unlock_write(&b->c.lock);
 +out_unlock_intent:
 +      six_unlock_intent(&b->c.lock);
 +      ret = -BCH_ERR_ENOMEM_btree_node_reclaim;
 +      goto out;
 +}
 +
 +static int btree_node_reclaim(struct bch_fs *c, struct btree *b)
 +{
 +      return __btree_node_reclaim(c, b, false);
 +}
 +
 +static int btree_node_write_and_reclaim(struct bch_fs *c, struct btree *b)
 +{
 +      return __btree_node_reclaim(c, b, true);
 +}
 +
 +static unsigned long bch2_btree_cache_scan(struct shrinker *shrink,
 +                                         struct shrink_control *sc)
 +{
-       struct bch_fs *c = container_of(shrink, struct bch_fs,
-                                       btree_cache.shrink);
++      struct bch_fs *c = shrink->private_data;
 +      struct btree_cache *bc = &c->btree_cache;
 +      struct btree *b, *t;
 +      unsigned long nr = sc->nr_to_scan;
 +      unsigned long can_free = 0;
 +      unsigned long freed = 0;
 +      unsigned long touched = 0;
 +      unsigned i, flags;
 +      unsigned long ret = SHRINK_STOP;
 +      bool trigger_writes = atomic_read(&bc->dirty) + nr >=
 +              bc->used * 3 / 4;
 +
 +      if (bch2_btree_shrinker_disabled)
 +              return SHRINK_STOP;
 +
 +      mutex_lock(&bc->lock);
 +      flags = memalloc_nofs_save();
 +
 +      /*
 +       * It's _really_ critical that we don't free too many btree nodes - we
 +       * have to always leave ourselves a reserve. The reserve is how we
 +       * guarantee that allocating memory for a new btree node can always
 +       * succeed, so that inserting keys into the btree can always succeed and
 +       * IO can always make forward progress:
 +       */
 +      can_free = btree_cache_can_free(bc);
 +      nr = min_t(unsigned long, nr, can_free);
 +
 +      i = 0;
 +      list_for_each_entry_safe(b, t, &bc->freeable, list) {
 +              /*
 +               * Leave a few nodes on the freeable list, so that a btree split
 +               * won't have to hit the system allocator:
 +               */
 +              if (++i <= 3)
 +                      continue;
 +
 +              touched++;
 +
 +              if (touched >= nr)
 +                      goto out;
 +
 +              if (!btree_node_reclaim(c, b)) {
 +                      btree_node_data_free(c, b);
 +                      six_unlock_write(&b->c.lock);
 +                      six_unlock_intent(&b->c.lock);
 +                      freed++;
 +              }
 +      }
 +restart:
 +      list_for_each_entry_safe(b, t, &bc->live, list) {
 +              touched++;
 +
 +              if (btree_node_accessed(b)) {
 +                      clear_btree_node_accessed(b);
 +              } else if (!btree_node_reclaim(c, b)) {
 +                      freed++;
 +                      btree_node_data_free(c, b);
 +
 +                      bch2_btree_node_hash_remove(bc, b);
 +                      six_unlock_write(&b->c.lock);
 +                      six_unlock_intent(&b->c.lock);
 +
 +                      if (freed == nr)
 +                              goto out_rotate;
 +              } else if (trigger_writes &&
 +                         btree_node_dirty(b) &&
 +                         !btree_node_will_make_reachable(b) &&
 +                         !btree_node_write_blocked(b) &&
 +                         six_trylock_read(&b->c.lock)) {
 +                      list_move(&bc->live, &b->list);
 +                      mutex_unlock(&bc->lock);
 +                      __bch2_btree_node_write(c, b, BTREE_WRITE_cache_reclaim);
 +                      six_unlock_read(&b->c.lock);
 +                      if (touched >= nr)
 +                              goto out_nounlock;
 +                      mutex_lock(&bc->lock);
 +                      goto restart;
 +              }
 +
 +              if (touched >= nr)
 +                      break;
 +      }
 +out_rotate:
 +      if (&t->list != &bc->live)
 +              list_move_tail(&bc->live, &t->list);
 +out:
 +      mutex_unlock(&bc->lock);
 +out_nounlock:
 +      ret = freed;
 +      memalloc_nofs_restore(flags);
 +      trace_and_count(c, btree_cache_scan, sc->nr_to_scan, can_free, ret);
 +      return ret;
 +}
 +
 +static unsigned long bch2_btree_cache_count(struct shrinker *shrink,
 +                                          struct shrink_control *sc)
 +{
-       unregister_shrinker(&bc->shrink);
++      struct bch_fs *c = shrink->private_data;
 +      struct btree_cache *bc = &c->btree_cache;
 +
 +      if (bch2_btree_shrinker_disabled)
 +              return 0;
 +
 +      return btree_cache_can_free(bc);
 +}
 +
 +void bch2_fs_btree_cache_exit(struct bch_fs *c)
 +{
 +      struct btree_cache *bc = &c->btree_cache;
 +      struct btree *b;
 +      unsigned i, flags;
 +
-       bc->shrink.count_objects        = bch2_btree_cache_count;
-       bc->shrink.scan_objects         = bch2_btree_cache_scan;
-       bc->shrink.seeks                = 4;
-       ret = register_shrinker(&bc->shrink, "%s/btree_cache", c->name);
-       if (ret)
++      shrinker_free(bc->shrink);
 +
 +      /* vfree() can allocate memory: */
 +      flags = memalloc_nofs_save();
 +      mutex_lock(&bc->lock);
 +
 +      if (c->verify_data)
 +              list_move(&c->verify_data->list, &bc->live);
 +
 +      kvpfree(c->verify_ondisk, btree_bytes(c));
 +
 +      for (i = 0; i < btree_id_nr_alive(c); i++) {
 +              struct btree_root *r = bch2_btree_id_root(c, i);
 +
 +              if (r->b)
 +                      list_add(&r->b->list, &bc->live);
 +      }
 +
 +      list_splice(&bc->freeable, &bc->live);
 +
 +      while (!list_empty(&bc->live)) {
 +              b = list_first_entry(&bc->live, struct btree, list);
 +
 +              BUG_ON(btree_node_read_in_flight(b) ||
 +                     btree_node_write_in_flight(b));
 +
 +              if (btree_node_dirty(b))
 +                      bch2_btree_complete_write(c, b, btree_current_write(b));
 +              clear_btree_node_dirty_acct(c, b);
 +
 +              btree_node_data_free(c, b);
 +      }
 +
 +      BUG_ON(atomic_read(&c->btree_cache.dirty));
 +
 +      list_splice(&bc->freed_pcpu, &bc->freed_nonpcpu);
 +
 +      while (!list_empty(&bc->freed_nonpcpu)) {
 +              b = list_first_entry(&bc->freed_nonpcpu, struct btree, list);
 +              list_del(&b->list);
 +              six_lock_exit(&b->c.lock);
 +              kfree(b);
 +      }
 +
 +      mutex_unlock(&bc->lock);
 +      memalloc_nofs_restore(flags);
 +
 +      if (bc->table_init_done)
 +              rhashtable_destroy(&bc->table);
 +}
 +
 +int bch2_fs_btree_cache_init(struct bch_fs *c)
 +{
 +      struct btree_cache *bc = &c->btree_cache;
++      struct shrinker *shrink;
 +      unsigned i;
 +      int ret = 0;
 +
 +      ret = rhashtable_init(&bc->table, &bch_btree_cache_params);
 +      if (ret)
 +              goto err;
 +
 +      bc->table_init_done = true;
 +
 +      bch2_recalc_btree_reserve(c);
 +
 +      for (i = 0; i < bc->reserve; i++)
 +              if (!__bch2_btree_node_mem_alloc(c))
 +                      goto err;
 +
 +      list_splice_init(&bc->live, &bc->freeable);
 +
 +      mutex_init(&c->verify_lock);
 +
++      shrink = shrinker_alloc(0, "%s/btree_cache", c->name);
++      if (!shrink)
 +              goto err;
++      bc->shrink = shrink;
++      shrink->count_objects   = bch2_btree_cache_count;
++      shrink->scan_objects    = bch2_btree_cache_scan;
++      shrink->seeks           = 4;
++      shrink->private_data    = c;
++      shrinker_register(shrink);
 +
 +      return 0;
 +err:
 +      return -BCH_ERR_ENOMEM_fs_btree_cache_init;
 +}
 +
 +void bch2_fs_btree_cache_init_early(struct btree_cache *bc)
 +{
 +      mutex_init(&bc->lock);
 +      INIT_LIST_HEAD(&bc->live);
 +      INIT_LIST_HEAD(&bc->freeable);
 +      INIT_LIST_HEAD(&bc->freed_pcpu);
 +      INIT_LIST_HEAD(&bc->freed_nonpcpu);
 +}
 +
 +/*
 + * We can only have one thread cannibalizing other cached btree nodes at a time,
 + * or we'll deadlock. We use an open coded mutex to ensure that, which a
 + * cannibalize_bucket() will take. This means every time we unlock the root of
 + * the btree, we need to release this lock if we have it held.
 + */
 +void bch2_btree_cache_cannibalize_unlock(struct bch_fs *c)
 +{
 +      struct btree_cache *bc = &c->btree_cache;
 +
 +      if (bc->alloc_lock == current) {
 +              trace_and_count(c, btree_cache_cannibalize_unlock, c);
 +              bc->alloc_lock = NULL;
 +              closure_wake_up(&bc->alloc_wait);
 +      }
 +}
 +
 +int bch2_btree_cache_cannibalize_lock(struct bch_fs *c, struct closure *cl)
 +{
 +      struct btree_cache *bc = &c->btree_cache;
 +      struct task_struct *old;
 +
 +      old = cmpxchg(&bc->alloc_lock, NULL, current);
 +      if (old == NULL || old == current)
 +              goto success;
 +
 +      if (!cl) {
 +              trace_and_count(c, btree_cache_cannibalize_lock_fail, c);
 +              return -BCH_ERR_ENOMEM_btree_cache_cannibalize_lock;
 +      }
 +
 +      closure_wait(&bc->alloc_wait, cl);
 +
 +      /* Try again, after adding ourselves to waitlist */
 +      old = cmpxchg(&bc->alloc_lock, NULL, current);
 +      if (old == NULL || old == current) {
 +              /* We raced */
 +              closure_wake_up(&bc->alloc_wait);
 +              goto success;
 +      }
 +
 +      trace_and_count(c, btree_cache_cannibalize_lock_fail, c);
 +      return -BCH_ERR_btree_cache_cannibalize_lock_blocked;
 +
 +success:
 +      trace_and_count(c, btree_cache_cannibalize_lock, c);
 +      return 0;
 +}
 +
 +static struct btree *btree_node_cannibalize(struct bch_fs *c)
 +{
 +      struct btree_cache *bc = &c->btree_cache;
 +      struct btree *b;
 +
 +      list_for_each_entry_reverse(b, &bc->live, list)
 +              if (!btree_node_reclaim(c, b))
 +                      return b;
 +
 +      while (1) {
 +              list_for_each_entry_reverse(b, &bc->live, list)
 +                      if (!btree_node_write_and_reclaim(c, b))
 +                              return b;
 +
 +              /*
 +               * Rare case: all nodes were intent-locked.
 +               * Just busy-wait.
 +               */
 +              WARN_ONCE(1, "btree cache cannibalize failed\n");
 +              cond_resched();
 +      }
 +}
 +
 +struct btree *bch2_btree_node_mem_alloc(struct btree_trans *trans, bool pcpu_read_locks)
 +{
 +      struct bch_fs *c = trans->c;
 +      struct btree_cache *bc = &c->btree_cache;
 +      struct list_head *freed = pcpu_read_locks
 +              ? &bc->freed_pcpu
 +              : &bc->freed_nonpcpu;
 +      struct btree *b, *b2;
 +      u64 start_time = local_clock();
 +      unsigned flags;
 +
 +      flags = memalloc_nofs_save();
 +      mutex_lock(&bc->lock);
 +
 +      /*
 +       * We never free struct btree itself, just the memory that holds the on
 +       * disk node. Check the freed list before allocating a new one:
 +       */
 +      list_for_each_entry(b, freed, list)
 +              if (!btree_node_reclaim(c, b)) {
 +                      list_del_init(&b->list);
 +                      goto got_node;
 +              }
 +
 +      b = __btree_node_mem_alloc(c, GFP_NOWAIT|__GFP_NOWARN);
 +      if (!b) {
 +              mutex_unlock(&bc->lock);
 +              bch2_trans_unlock(trans);
 +              b = __btree_node_mem_alloc(c, GFP_KERNEL);
 +              if (!b)
 +                      goto err;
 +              mutex_lock(&bc->lock);
 +      }
 +
 +      bch2_btree_lock_init(&b->c, pcpu_read_locks ? SIX_LOCK_INIT_PCPU : 0);
 +
 +      BUG_ON(!six_trylock_intent(&b->c.lock));
 +      BUG_ON(!six_trylock_write(&b->c.lock));
 +got_node:
 +
 +      /*
 +       * btree_free() doesn't free memory; it sticks the node on the end of
 +       * the list. Check if there's any freed nodes there:
 +       */
 +      list_for_each_entry(b2, &bc->freeable, list)
 +              if (!btree_node_reclaim(c, b2)) {
 +                      swap(b->data, b2->data);
 +                      swap(b->aux_data, b2->aux_data);
 +                      btree_node_to_freedlist(bc, b2);
 +                      six_unlock_write(&b2->c.lock);
 +                      six_unlock_intent(&b2->c.lock);
 +                      goto got_mem;
 +              }
 +
 +      mutex_unlock(&bc->lock);
 +
 +      if (btree_node_data_alloc(c, b, GFP_NOWAIT|__GFP_NOWARN)) {
 +              bch2_trans_unlock(trans);
 +              if (btree_node_data_alloc(c, b, GFP_KERNEL|__GFP_NOWARN))
 +                      goto err;
 +      }
 +
 +      mutex_lock(&bc->lock);
 +      bc->used++;
 +got_mem:
 +      mutex_unlock(&bc->lock);
 +
 +      BUG_ON(btree_node_hashed(b));
 +      BUG_ON(btree_node_dirty(b));
 +      BUG_ON(btree_node_write_in_flight(b));
 +out:
 +      b->flags                = 0;
 +      b->written              = 0;
 +      b->nsets                = 0;
 +      b->sib_u64s[0]          = 0;
 +      b->sib_u64s[1]          = 0;
 +      b->whiteout_u64s        = 0;
 +      bch2_btree_keys_init(b);
 +      set_btree_node_accessed(b);
 +
 +      bch2_time_stats_update(&c->times[BCH_TIME_btree_node_mem_alloc],
 +                             start_time);
 +
 +      memalloc_nofs_restore(flags);
 +      return b;
 +err:
 +      mutex_lock(&bc->lock);
 +
 +      /* Try to cannibalize another cached btree node: */
 +      if (bc->alloc_lock == current) {
 +              b2 = btree_node_cannibalize(c);
 +              clear_btree_node_just_written(b2);
 +              bch2_btree_node_hash_remove(bc, b2);
 +
 +              if (b) {
 +                      swap(b->data, b2->data);
 +                      swap(b->aux_data, b2->aux_data);
 +                      btree_node_to_freedlist(bc, b2);
 +                      six_unlock_write(&b2->c.lock);
 +                      six_unlock_intent(&b2->c.lock);
 +              } else {
 +                      b = b2;
 +                      list_del_init(&b->list);
 +              }
 +
 +              mutex_unlock(&bc->lock);
 +
 +              trace_and_count(c, btree_cache_cannibalize, c);
 +              goto out;
 +      }
 +
 +      mutex_unlock(&bc->lock);
 +      memalloc_nofs_restore(flags);
 +      return ERR_PTR(-BCH_ERR_ENOMEM_btree_node_mem_alloc);
 +}
 +
 +/* Slowpath, don't want it inlined into btree_iter_traverse() */
 +static noinline struct btree *bch2_btree_node_fill(struct btree_trans *trans,
 +                              struct btree_path *path,
 +                              const struct bkey_i *k,
 +                              enum btree_id btree_id,
 +                              unsigned level,
 +                              enum six_lock_type lock_type,
 +                              bool sync)
 +{
 +      struct bch_fs *c = trans->c;
 +      struct btree_cache *bc = &c->btree_cache;
 +      struct btree *b;
 +      u32 seq;
 +
 +      BUG_ON(level + 1 >= BTREE_MAX_DEPTH);
 +      /*
 +       * Parent node must be locked, else we could read in a btree node that's
 +       * been freed:
 +       */
 +      if (path && !bch2_btree_node_relock(trans, path, level + 1)) {
 +              trace_and_count(c, trans_restart_relock_parent_for_fill, trans, _THIS_IP_, path);
 +              return ERR_PTR(btree_trans_restart(trans, BCH_ERR_transaction_restart_fill_relock));
 +      }
 +
 +      b = bch2_btree_node_mem_alloc(trans, level != 0);
 +
 +      if (bch2_err_matches(PTR_ERR_OR_ZERO(b), ENOMEM)) {
 +              trans->memory_allocation_failure = true;
 +              trace_and_count(c, trans_restart_memory_allocation_failure, trans, _THIS_IP_, path);
 +              return ERR_PTR(btree_trans_restart(trans, BCH_ERR_transaction_restart_fill_mem_alloc_fail));
 +      }
 +
 +      if (IS_ERR(b))
 +              return b;
 +
 +      /*
 +       * Btree nodes read in from disk should not have the accessed bit set
 +       * initially, so that linear scans don't thrash the cache:
 +       */
 +      clear_btree_node_accessed(b);
 +
 +      bkey_copy(&b->key, k);
 +      if (bch2_btree_node_hash_insert(bc, b, level, btree_id)) {
 +              /* raced with another fill: */
 +
 +              /* mark as unhashed... */
 +              b->hash_val = 0;
 +
 +              mutex_lock(&bc->lock);
 +              list_add(&b->list, &bc->freeable);
 +              mutex_unlock(&bc->lock);
 +
 +              six_unlock_write(&b->c.lock);
 +              six_unlock_intent(&b->c.lock);
 +              return NULL;
 +      }
 +
 +      set_btree_node_read_in_flight(b);
 +
 +      six_unlock_write(&b->c.lock);
 +      seq = six_lock_seq(&b->c.lock);
 +      six_unlock_intent(&b->c.lock);
 +
 +      /* Unlock before doing IO: */
 +      if (path && sync)
 +              bch2_trans_unlock_noassert(trans);
 +
 +      bch2_btree_node_read(c, b, sync);
 +
 +      if (!sync)
 +              return NULL;
 +
 +      if (path) {
 +              int ret = bch2_trans_relock(trans) ?:
 +                      bch2_btree_path_relock_intent(trans, path);
 +              if (ret) {
 +                      BUG_ON(!trans->restarted);
 +                      return ERR_PTR(ret);
 +              }
 +      }
 +
 +      if (!six_relock_type(&b->c.lock, lock_type, seq)) {
 +              if (path)
 +                      trace_and_count(c, trans_restart_relock_after_fill, trans, _THIS_IP_, path);
 +              return ERR_PTR(btree_trans_restart(trans, BCH_ERR_transaction_restart_relock_after_fill));
 +      }
 +
 +      return b;
 +}
 +
 +static noinline void btree_bad_header(struct bch_fs *c, struct btree *b)
 +{
 +      struct printbuf buf = PRINTBUF;
 +
 +      if (c->curr_recovery_pass <= BCH_RECOVERY_PASS_check_allocations)
 +              return;
 +
 +      prt_printf(&buf,
 +             "btree node header doesn't match ptr\n"
 +             "btree %s level %u\n"
 +             "ptr: ",
 +             bch2_btree_ids[b->c.btree_id], b->c.level);
 +      bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&b->key));
 +
 +      prt_printf(&buf, "\nheader: btree %s level %llu\n"
 +             "min ",
 +             bch2_btree_ids[BTREE_NODE_ID(b->data)],
 +             BTREE_NODE_LEVEL(b->data));
 +      bch2_bpos_to_text(&buf, b->data->min_key);
 +
 +      prt_printf(&buf, "\nmax ");
 +      bch2_bpos_to_text(&buf, b->data->max_key);
 +
 +      bch2_fs_inconsistent(c, "%s", buf.buf);
 +      printbuf_exit(&buf);
 +}
 +
 +static inline void btree_check_header(struct bch_fs *c, struct btree *b)
 +{
 +      if (b->c.btree_id != BTREE_NODE_ID(b->data) ||
 +          b->c.level != BTREE_NODE_LEVEL(b->data) ||
 +          !bpos_eq(b->data->max_key, b->key.k.p) ||
 +          (b->key.k.type == KEY_TYPE_btree_ptr_v2 &&
 +           !bpos_eq(b->data->min_key,
 +                    bkey_i_to_btree_ptr_v2(&b->key)->v.min_key)))
 +              btree_bad_header(c, b);
 +}
 +
 +static struct btree *__bch2_btree_node_get(struct btree_trans *trans, struct btree_path *path,
 +                                         const struct bkey_i *k, unsigned level,
 +                                         enum six_lock_type lock_type,
 +                                         unsigned long trace_ip)
 +{
 +      struct bch_fs *c = trans->c;
 +      struct btree_cache *bc = &c->btree_cache;
 +      struct btree *b;
 +      struct bset_tree *t;
 +      bool need_relock = false;
 +      int ret;
 +
 +      EBUG_ON(level >= BTREE_MAX_DEPTH);
 +retry:
 +      b = btree_cache_find(bc, k);
 +      if (unlikely(!b)) {
 +              /*
 +               * We must have the parent locked to call bch2_btree_node_fill(),
 +               * else we could read in a btree node from disk that's been
 +               * freed:
 +               */
 +              b = bch2_btree_node_fill(trans, path, k, path->btree_id,
 +                                       level, lock_type, true);
 +              need_relock = true;
 +
 +              /* We raced and found the btree node in the cache */
 +              if (!b)
 +                      goto retry;
 +
 +              if (IS_ERR(b))
 +                      return b;
 +      } else {
 +              if (btree_node_read_locked(path, level + 1))
 +                      btree_node_unlock(trans, path, level + 1);
 +
 +              ret = btree_node_lock(trans, path, &b->c, level, lock_type, trace_ip);
 +              if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
 +                      return ERR_PTR(ret);
 +
 +              BUG_ON(ret);
 +
 +              if (unlikely(b->hash_val != btree_ptr_hash_val(k) ||
 +                           b->c.level != level ||
 +                           race_fault())) {
 +                      six_unlock_type(&b->c.lock, lock_type);
 +                      if (bch2_btree_node_relock(trans, path, level + 1))
 +                              goto retry;
 +
 +                      trace_and_count(c, trans_restart_btree_node_reused, trans, trace_ip, path);
 +                      return ERR_PTR(btree_trans_restart(trans, BCH_ERR_transaction_restart_lock_node_reused));
 +              }
 +
 +              /* avoid atomic set bit if it's not needed: */
 +              if (!btree_node_accessed(b))
 +                      set_btree_node_accessed(b);
 +      }
 +
 +      if (unlikely(btree_node_read_in_flight(b))) {
 +              u32 seq = six_lock_seq(&b->c.lock);
 +
 +              six_unlock_type(&b->c.lock, lock_type);
 +              bch2_trans_unlock(trans);
 +              need_relock = true;
 +
 +              bch2_btree_node_wait_on_read(b);
 +
 +              /*
 +               * should_be_locked is not set on this path yet, so we need to
 +               * relock it specifically:
 +               */
 +              if (!six_relock_type(&b->c.lock, lock_type, seq))
 +                      goto retry;
 +      }
 +
 +      if (unlikely(need_relock)) {
 +              ret = bch2_trans_relock(trans) ?:
 +                      bch2_btree_path_relock_intent(trans, path);
 +              if (ret) {
 +                      six_unlock_type(&b->c.lock, lock_type);
 +                      return ERR_PTR(ret);
 +              }
 +      }
 +
 +      prefetch(b->aux_data);
 +
 +      for_each_bset(b, t) {
 +              void *p = (u64 *) b->aux_data + t->aux_data_offset;
 +
 +              prefetch(p + L1_CACHE_BYTES * 0);
 +              prefetch(p + L1_CACHE_BYTES * 1);
 +              prefetch(p + L1_CACHE_BYTES * 2);
 +      }
 +
 +      if (unlikely(btree_node_read_error(b))) {
 +              six_unlock_type(&b->c.lock, lock_type);
 +              return ERR_PTR(-EIO);
 +      }
 +
 +      EBUG_ON(b->c.btree_id != path->btree_id);
 +      EBUG_ON(BTREE_NODE_LEVEL(b->data) != level);
 +      btree_check_header(c, b);
 +
 +      return b;
 +}
 +
 +/**
 + * bch2_btree_node_get - find a btree node in the cache and lock it, reading it
 + * in from disk if necessary.
 + *
 + * @trans:    btree transaction object
 + * @path:     btree_path being traversed
 + * @k:                pointer to btree node (generally KEY_TYPE_btree_ptr_v2)
 + * @level:    level of btree node being looked up (0 == leaf node)
 + * @lock_type:        SIX_LOCK_read or SIX_LOCK_intent
 + * @trace_ip: ip of caller of btree iterator code (i.e. caller of bch2_btree_iter_peek())
 + *
 + * The btree node will have either a read or a write lock held, depending on
 + * the @write parameter.
 + *
 + * Returns: btree node or ERR_PTR()
 + */
 +struct btree *bch2_btree_node_get(struct btree_trans *trans, struct btree_path *path,
 +                                const struct bkey_i *k, unsigned level,
 +                                enum six_lock_type lock_type,
 +                                unsigned long trace_ip)
 +{
 +      struct bch_fs *c = trans->c;
 +      struct btree *b;
 +      struct bset_tree *t;
 +      int ret;
 +
 +      EBUG_ON(level >= BTREE_MAX_DEPTH);
 +
 +      b = btree_node_mem_ptr(k);
 +
 +      /*
 +       * Check b->hash_val _before_ calling btree_node_lock() - this might not
 +       * be the node we want anymore, and trying to lock the wrong node could
 +       * cause an unneccessary transaction restart:
 +       */
 +      if (unlikely(!c->opts.btree_node_mem_ptr_optimization ||
 +                   !b ||
 +                   b->hash_val != btree_ptr_hash_val(k)))
 +              return __bch2_btree_node_get(trans, path, k, level, lock_type, trace_ip);
 +
 +      if (btree_node_read_locked(path, level + 1))
 +              btree_node_unlock(trans, path, level + 1);
 +
 +      ret = btree_node_lock(trans, path, &b->c, level, lock_type, trace_ip);
 +      if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
 +              return ERR_PTR(ret);
 +
 +      BUG_ON(ret);
 +
 +      if (unlikely(b->hash_val != btree_ptr_hash_val(k) ||
 +                   b->c.level != level ||
 +                   race_fault())) {
 +              six_unlock_type(&b->c.lock, lock_type);
 +              if (bch2_btree_node_relock(trans, path, level + 1))
 +                      return __bch2_btree_node_get(trans, path, k, level, lock_type, trace_ip);
 +
 +              trace_and_count(c, trans_restart_btree_node_reused, trans, trace_ip, path);
 +              return ERR_PTR(btree_trans_restart(trans, BCH_ERR_transaction_restart_lock_node_reused));
 +      }
 +
 +      if (unlikely(btree_node_read_in_flight(b))) {
 +              six_unlock_type(&b->c.lock, lock_type);
 +              return __bch2_btree_node_get(trans, path, k, level, lock_type, trace_ip);
 +      }
 +
 +      prefetch(b->aux_data);
 +
 +      for_each_bset(b, t) {
 +              void *p = (u64 *) b->aux_data + t->aux_data_offset;
 +
 +              prefetch(p + L1_CACHE_BYTES * 0);
 +              prefetch(p + L1_CACHE_BYTES * 1);
 +              prefetch(p + L1_CACHE_BYTES * 2);
 +      }
 +
 +      /* avoid atomic set bit if it's not needed: */
 +      if (!btree_node_accessed(b))
 +              set_btree_node_accessed(b);
 +
 +      if (unlikely(btree_node_read_error(b))) {
 +              six_unlock_type(&b->c.lock, lock_type);
 +              return ERR_PTR(-EIO);
 +      }
 +
 +      EBUG_ON(b->c.btree_id != path->btree_id);
 +      EBUG_ON(BTREE_NODE_LEVEL(b->data) != level);
 +      btree_check_header(c, b);
 +
 +      return b;
 +}
 +
 +struct btree *bch2_btree_node_get_noiter(struct btree_trans *trans,
 +                                       const struct bkey_i *k,
 +                                       enum btree_id btree_id,
 +                                       unsigned level,
 +                                       bool nofill)
 +{
 +      struct bch_fs *c = trans->c;
 +      struct btree_cache *bc = &c->btree_cache;
 +      struct btree *b;
 +      struct bset_tree *t;
 +      int ret;
 +
 +      EBUG_ON(level >= BTREE_MAX_DEPTH);
 +
 +      if (c->opts.btree_node_mem_ptr_optimization) {
 +              b = btree_node_mem_ptr(k);
 +              if (b)
 +                      goto lock_node;
 +      }
 +retry:
 +      b = btree_cache_find(bc, k);
 +      if (unlikely(!b)) {
 +              if (nofill)
 +                      goto out;
 +
 +              b = bch2_btree_node_fill(trans, NULL, k, btree_id,
 +                                       level, SIX_LOCK_read, true);
 +
 +              /* We raced and found the btree node in the cache */
 +              if (!b)
 +                      goto retry;
 +
 +              if (IS_ERR(b) &&
 +                  !bch2_btree_cache_cannibalize_lock(c, NULL))
 +                      goto retry;
 +
 +              if (IS_ERR(b))
 +                      goto out;
 +      } else {
 +lock_node:
 +              ret = btree_node_lock_nopath(trans, &b->c, SIX_LOCK_read, _THIS_IP_);
 +              if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
 +                      return ERR_PTR(ret);
 +
 +              BUG_ON(ret);
 +
 +              if (unlikely(b->hash_val != btree_ptr_hash_val(k) ||
 +                           b->c.btree_id != btree_id ||
 +                           b->c.level != level)) {
 +                      six_unlock_read(&b->c.lock);
 +                      goto retry;
 +              }
 +      }
 +
 +      /* XXX: waiting on IO with btree locks held: */
 +      __bch2_btree_node_wait_on_read(b);
 +
 +      prefetch(b->aux_data);
 +
 +      for_each_bset(b, t) {
 +              void *p = (u64 *) b->aux_data + t->aux_data_offset;
 +
 +              prefetch(p + L1_CACHE_BYTES * 0);
 +              prefetch(p + L1_CACHE_BYTES * 1);
 +              prefetch(p + L1_CACHE_BYTES * 2);
 +      }
 +
 +      /* avoid atomic set bit if it's not needed: */
 +      if (!btree_node_accessed(b))
 +              set_btree_node_accessed(b);
 +
 +      if (unlikely(btree_node_read_error(b))) {
 +              six_unlock_read(&b->c.lock);
 +              b = ERR_PTR(-EIO);
 +              goto out;
 +      }
 +
 +      EBUG_ON(b->c.btree_id != btree_id);
 +      EBUG_ON(BTREE_NODE_LEVEL(b->data) != level);
 +      btree_check_header(c, b);
 +out:
 +      bch2_btree_cache_cannibalize_unlock(c);
 +      return b;
 +}
 +
 +int bch2_btree_node_prefetch(struct btree_trans *trans,
 +                           struct btree_path *path,
 +                           const struct bkey_i *k,
 +                           enum btree_id btree_id, unsigned level)
 +{
 +      struct bch_fs *c = trans->c;
 +      struct btree_cache *bc = &c->btree_cache;
 +      struct btree *b;
 +
 +      BUG_ON(trans && !btree_node_locked(path, level + 1));
 +      BUG_ON(level >= BTREE_MAX_DEPTH);
 +
 +      b = btree_cache_find(bc, k);
 +      if (b)
 +              return 0;
 +
 +      b = bch2_btree_node_fill(trans, path, k, btree_id,
 +                               level, SIX_LOCK_read, false);
 +      return PTR_ERR_OR_ZERO(b);
 +}
 +
 +void bch2_btree_node_evict(struct btree_trans *trans, const struct bkey_i *k)
 +{
 +      struct bch_fs *c = trans->c;
 +      struct btree_cache *bc = &c->btree_cache;
 +      struct btree *b;
 +
 +      b = btree_cache_find(bc, k);
 +      if (!b)
 +              return;
 +wait_on_io:
 +      /* not allowed to wait on io with btree locks held: */
 +
 +      /* XXX we're called from btree_gc which will be holding other btree
 +       * nodes locked
 +       */
 +      __bch2_btree_node_wait_on_read(b);
 +      __bch2_btree_node_wait_on_write(b);
 +
 +      btree_node_lock_nopath_nofail(trans, &b->c, SIX_LOCK_intent);
 +      btree_node_lock_nopath_nofail(trans, &b->c, SIX_LOCK_write);
 +
 +      if (btree_node_dirty(b)) {
 +              __bch2_btree_node_write(c, b, BTREE_WRITE_cache_reclaim);
 +              six_unlock_write(&b->c.lock);
 +              six_unlock_intent(&b->c.lock);
 +              goto wait_on_io;
 +      }
 +
 +      BUG_ON(btree_node_dirty(b));
 +
 +      mutex_lock(&bc->lock);
 +      btree_node_data_free(c, b);
 +      bch2_btree_node_hash_remove(bc, b);
 +      mutex_unlock(&bc->lock);
 +
 +      six_unlock_write(&b->c.lock);
 +      six_unlock_intent(&b->c.lock);
 +}
 +
 +void bch2_btree_node_to_text(struct printbuf *out, struct bch_fs *c,
 +                           const struct btree *b)
 +{
 +      struct bset_stats stats;
 +
 +      memset(&stats, 0, sizeof(stats));
 +
 +      bch2_btree_keys_stats(b, &stats);
 +
 +      prt_printf(out, "l %u ", b->c.level);
 +      bch2_bpos_to_text(out, b->data->min_key);
 +      prt_printf(out, " - ");
 +      bch2_bpos_to_text(out, b->data->max_key);
 +      prt_printf(out, ":\n"
 +             "    ptrs: ");
 +      bch2_val_to_text(out, c, bkey_i_to_s_c(&b->key));
 +      prt_newline(out);
 +
 +      prt_printf(out,
 +             "    format: ");
 +      bch2_bkey_format_to_text(out, &b->format);
 +
 +      prt_printf(out,
 +             "    unpack fn len: %u\n"
 +             "    bytes used %zu/%zu (%zu%% full)\n"
 +             "    sib u64s: %u, %u (merge threshold %u)\n"
 +             "    nr packed keys %u\n"
 +             "    nr unpacked keys %u\n"
 +             "    floats %zu\n"
 +             "    failed unpacked %zu\n",
 +             b->unpack_fn_len,
 +             b->nr.live_u64s * sizeof(u64),
 +             btree_bytes(c) - sizeof(struct btree_node),
 +             b->nr.live_u64s * 100 / btree_max_u64s(c),
 +             b->sib_u64s[0],
 +             b->sib_u64s[1],
 +             c->btree_foreground_merge_threshold,
 +             b->nr.packed_keys,
 +             b->nr.unpacked_keys,
 +             stats.floats,
 +             stats.failed);
 +}
 +
 +void bch2_btree_cache_to_text(struct printbuf *out, const struct bch_fs *c)
 +{
 +      prt_printf(out, "nr nodes:\t\t%u\n", c->btree_cache.used);
 +      prt_printf(out, "nr dirty:\t\t%u\n", atomic_read(&c->btree_cache.dirty));
 +      prt_printf(out, "cannibalize lock:\t%p\n", c->btree_cache.alloc_lock);
 +}
index 29a0b566a4fe9ebdfd07666249c0eed23d427bea,0000000000000000000000000000000000000000..f9a5e38a085bbfb280fbe439ca2a6b1f0ba2f1af
mode 100644,000000..100644
--- /dev/null
@@@ -1,1072 -1,0 +1,1075 @@@
-       struct bch_fs *c = container_of(shrink, struct bch_fs,
-                                       btree_key_cache.shrink);
 +// SPDX-License-Identifier: GPL-2.0
 +
 +#include "bcachefs.h"
 +#include "btree_cache.h"
 +#include "btree_iter.h"
 +#include "btree_key_cache.h"
 +#include "btree_locking.h"
 +#include "btree_update.h"
 +#include "errcode.h"
 +#include "error.h"
 +#include "journal.h"
 +#include "journal_reclaim.h"
 +#include "trace.h"
 +
 +#include <linux/sched/mm.h>
 +
 +static inline bool btree_uses_pcpu_readers(enum btree_id id)
 +{
 +      return id == BTREE_ID_subvolumes;
 +}
 +
 +static struct kmem_cache *bch2_key_cache;
 +
 +static int bch2_btree_key_cache_cmp_fn(struct rhashtable_compare_arg *arg,
 +                                     const void *obj)
 +{
 +      const struct bkey_cached *ck = obj;
 +      const struct bkey_cached_key *key = arg->key;
 +
 +      return ck->key.btree_id != key->btree_id ||
 +              !bpos_eq(ck->key.pos, key->pos);
 +}
 +
 +static const struct rhashtable_params bch2_btree_key_cache_params = {
 +      .head_offset    = offsetof(struct bkey_cached, hash),
 +      .key_offset     = offsetof(struct bkey_cached, key),
 +      .key_len        = sizeof(struct bkey_cached_key),
 +      .obj_cmpfn      = bch2_btree_key_cache_cmp_fn,
 +};
 +
 +__flatten
 +inline struct bkey_cached *
 +bch2_btree_key_cache_find(struct bch_fs *c, enum btree_id btree_id, struct bpos pos)
 +{
 +      struct bkey_cached_key key = {
 +              .btree_id       = btree_id,
 +              .pos            = pos,
 +      };
 +
 +      return rhashtable_lookup_fast(&c->btree_key_cache.table, &key,
 +                                    bch2_btree_key_cache_params);
 +}
 +
 +static bool bkey_cached_lock_for_evict(struct bkey_cached *ck)
 +{
 +      if (!six_trylock_intent(&ck->c.lock))
 +              return false;
 +
 +      if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
 +              six_unlock_intent(&ck->c.lock);
 +              return false;
 +      }
 +
 +      if (!six_trylock_write(&ck->c.lock)) {
 +              six_unlock_intent(&ck->c.lock);
 +              return false;
 +      }
 +
 +      return true;
 +}
 +
 +static void bkey_cached_evict(struct btree_key_cache *c,
 +                            struct bkey_cached *ck)
 +{
 +      BUG_ON(rhashtable_remove_fast(&c->table, &ck->hash,
 +                                    bch2_btree_key_cache_params));
 +      memset(&ck->key, ~0, sizeof(ck->key));
 +
 +      atomic_long_dec(&c->nr_keys);
 +}
 +
 +static void bkey_cached_free(struct btree_key_cache *bc,
 +                           struct bkey_cached *ck)
 +{
 +      struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache);
 +
 +      BUG_ON(test_bit(BKEY_CACHED_DIRTY, &ck->flags));
 +
 +      ck->btree_trans_barrier_seq =
 +              start_poll_synchronize_srcu(&c->btree_trans_barrier);
 +
 +      if (ck->c.lock.readers)
 +              list_move_tail(&ck->list, &bc->freed_pcpu);
 +      else
 +              list_move_tail(&ck->list, &bc->freed_nonpcpu);
 +      atomic_long_inc(&bc->nr_freed);
 +
 +      kfree(ck->k);
 +      ck->k           = NULL;
 +      ck->u64s        = 0;
 +
 +      six_unlock_write(&ck->c.lock);
 +      six_unlock_intent(&ck->c.lock);
 +}
 +
 +#ifdef __KERNEL__
 +static void __bkey_cached_move_to_freelist_ordered(struct btree_key_cache *bc,
 +                                                 struct bkey_cached *ck)
 +{
 +      struct bkey_cached *pos;
 +
 +      list_for_each_entry_reverse(pos, &bc->freed_nonpcpu, list) {
 +              if (ULONG_CMP_GE(ck->btree_trans_barrier_seq,
 +                               pos->btree_trans_barrier_seq)) {
 +                      list_move(&ck->list, &pos->list);
 +                      return;
 +              }
 +      }
 +
 +      list_move(&ck->list, &bc->freed_nonpcpu);
 +}
 +#endif
 +
 +static void bkey_cached_move_to_freelist(struct btree_key_cache *bc,
 +                                       struct bkey_cached *ck)
 +{
 +      BUG_ON(test_bit(BKEY_CACHED_DIRTY, &ck->flags));
 +
 +      if (!ck->c.lock.readers) {
 +#ifdef __KERNEL__
 +              struct btree_key_cache_freelist *f;
 +              bool freed = false;
 +
 +              preempt_disable();
 +              f = this_cpu_ptr(bc->pcpu_freed);
 +
 +              if (f->nr < ARRAY_SIZE(f->objs)) {
 +                      f->objs[f->nr++] = ck;
 +                      freed = true;
 +              }
 +              preempt_enable();
 +
 +              if (!freed) {
 +                      mutex_lock(&bc->lock);
 +                      preempt_disable();
 +                      f = this_cpu_ptr(bc->pcpu_freed);
 +
 +                      while (f->nr > ARRAY_SIZE(f->objs) / 2) {
 +                              struct bkey_cached *ck2 = f->objs[--f->nr];
 +
 +                              __bkey_cached_move_to_freelist_ordered(bc, ck2);
 +                      }
 +                      preempt_enable();
 +
 +                      __bkey_cached_move_to_freelist_ordered(bc, ck);
 +                      mutex_unlock(&bc->lock);
 +              }
 +#else
 +              mutex_lock(&bc->lock);
 +              list_move_tail(&ck->list, &bc->freed_nonpcpu);
 +              mutex_unlock(&bc->lock);
 +#endif
 +      } else {
 +              mutex_lock(&bc->lock);
 +              list_move_tail(&ck->list, &bc->freed_pcpu);
 +              mutex_unlock(&bc->lock);
 +      }
 +}
 +
 +static void bkey_cached_free_fast(struct btree_key_cache *bc,
 +                                struct bkey_cached *ck)
 +{
 +      struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache);
 +
 +      ck->btree_trans_barrier_seq =
 +              start_poll_synchronize_srcu(&c->btree_trans_barrier);
 +
 +      list_del_init(&ck->list);
 +      atomic_long_inc(&bc->nr_freed);
 +
 +      kfree(ck->k);
 +      ck->k           = NULL;
 +      ck->u64s        = 0;
 +
 +      bkey_cached_move_to_freelist(bc, ck);
 +
 +      six_unlock_write(&ck->c.lock);
 +      six_unlock_intent(&ck->c.lock);
 +}
 +
 +static struct bkey_cached *
 +bkey_cached_alloc(struct btree_trans *trans, struct btree_path *path,
 +                bool *was_new)
 +{
 +      struct bch_fs *c = trans->c;
 +      struct btree_key_cache *bc = &c->btree_key_cache;
 +      struct bkey_cached *ck = NULL;
 +      bool pcpu_readers = btree_uses_pcpu_readers(path->btree_id);
 +      int ret;
 +
 +      if (!pcpu_readers) {
 +#ifdef __KERNEL__
 +              struct btree_key_cache_freelist *f;
 +
 +              preempt_disable();
 +              f = this_cpu_ptr(bc->pcpu_freed);
 +              if (f->nr)
 +                      ck = f->objs[--f->nr];
 +              preempt_enable();
 +
 +              if (!ck) {
 +                      mutex_lock(&bc->lock);
 +                      preempt_disable();
 +                      f = this_cpu_ptr(bc->pcpu_freed);
 +
 +                      while (!list_empty(&bc->freed_nonpcpu) &&
 +                             f->nr < ARRAY_SIZE(f->objs) / 2) {
 +                              ck = list_last_entry(&bc->freed_nonpcpu, struct bkey_cached, list);
 +                              list_del_init(&ck->list);
 +                              f->objs[f->nr++] = ck;
 +                      }
 +
 +                      ck = f->nr ? f->objs[--f->nr] : NULL;
 +                      preempt_enable();
 +                      mutex_unlock(&bc->lock);
 +              }
 +#else
 +              mutex_lock(&bc->lock);
 +              if (!list_empty(&bc->freed_nonpcpu)) {
 +                      ck = list_last_entry(&bc->freed_nonpcpu, struct bkey_cached, list);
 +                      list_del_init(&ck->list);
 +              }
 +              mutex_unlock(&bc->lock);
 +#endif
 +      } else {
 +              mutex_lock(&bc->lock);
 +              if (!list_empty(&bc->freed_pcpu)) {
 +                      ck = list_last_entry(&bc->freed_pcpu, struct bkey_cached, list);
 +                      list_del_init(&ck->list);
 +              }
 +              mutex_unlock(&bc->lock);
 +      }
 +
 +      if (ck) {
 +              ret = btree_node_lock_nopath(trans, &ck->c, SIX_LOCK_intent, _THIS_IP_);
 +              if (unlikely(ret)) {
 +                      bkey_cached_move_to_freelist(bc, ck);
 +                      return ERR_PTR(ret);
 +              }
 +
 +              path->l[0].b = (void *) ck;
 +              path->l[0].lock_seq = six_lock_seq(&ck->c.lock);
 +              mark_btree_node_locked(trans, path, 0, BTREE_NODE_INTENT_LOCKED);
 +
 +              ret = bch2_btree_node_lock_write(trans, path, &ck->c);
 +              if (unlikely(ret)) {
 +                      btree_node_unlock(trans, path, 0);
 +                      bkey_cached_move_to_freelist(bc, ck);
 +                      return ERR_PTR(ret);
 +              }
 +
 +              return ck;
 +      }
 +
 +      ck = allocate_dropping_locks(trans, ret,
 +                      kmem_cache_zalloc(bch2_key_cache, _gfp));
 +      if (ret) {
 +              kmem_cache_free(bch2_key_cache, ck);
 +              return ERR_PTR(ret);
 +      }
 +
 +      if (!ck)
 +              return NULL;
 +
 +      INIT_LIST_HEAD(&ck->list);
 +      bch2_btree_lock_init(&ck->c, pcpu_readers ? SIX_LOCK_INIT_PCPU : 0);
 +
 +      ck->c.cached = true;
 +      BUG_ON(!six_trylock_intent(&ck->c.lock));
 +      BUG_ON(!six_trylock_write(&ck->c.lock));
 +      *was_new = true;
 +      return ck;
 +}
 +
 +static struct bkey_cached *
 +bkey_cached_reuse(struct btree_key_cache *c)
 +{
 +      struct bucket_table *tbl;
 +      struct rhash_head *pos;
 +      struct bkey_cached *ck;
 +      unsigned i;
 +
 +      mutex_lock(&c->lock);
 +      rcu_read_lock();
 +      tbl = rht_dereference_rcu(c->table.tbl, &c->table);
 +      for (i = 0; i < tbl->size; i++)
 +              rht_for_each_entry_rcu(ck, pos, tbl, i, hash) {
 +                      if (!test_bit(BKEY_CACHED_DIRTY, &ck->flags) &&
 +                          bkey_cached_lock_for_evict(ck)) {
 +                              bkey_cached_evict(c, ck);
 +                              goto out;
 +                      }
 +              }
 +      ck = NULL;
 +out:
 +      rcu_read_unlock();
 +      mutex_unlock(&c->lock);
 +      return ck;
 +}
 +
 +static struct bkey_cached *
 +btree_key_cache_create(struct btree_trans *trans, struct btree_path *path)
 +{
 +      struct bch_fs *c = trans->c;
 +      struct btree_key_cache *bc = &c->btree_key_cache;
 +      struct bkey_cached *ck;
 +      bool was_new = false;
 +
 +      ck = bkey_cached_alloc(trans, path, &was_new);
 +      if (IS_ERR(ck))
 +              return ck;
 +
 +      if (unlikely(!ck)) {
 +              ck = bkey_cached_reuse(bc);
 +              if (unlikely(!ck)) {
 +                      bch_err(c, "error allocating memory for key cache item, btree %s",
 +                              bch2_btree_ids[path->btree_id]);
 +                      return ERR_PTR(-BCH_ERR_ENOMEM_btree_key_cache_create);
 +              }
 +
 +              mark_btree_node_locked(trans, path, 0, BTREE_NODE_INTENT_LOCKED);
 +      }
 +
 +      ck->c.level             = 0;
 +      ck->c.btree_id          = path->btree_id;
 +      ck->key.btree_id        = path->btree_id;
 +      ck->key.pos             = path->pos;
 +      ck->valid               = false;
 +      ck->flags               = 1U << BKEY_CACHED_ACCESSED;
 +
 +      if (unlikely(rhashtable_lookup_insert_fast(&bc->table,
 +                                        &ck->hash,
 +                                        bch2_btree_key_cache_params))) {
 +              /* We raced with another fill: */
 +
 +              if (likely(was_new)) {
 +                      six_unlock_write(&ck->c.lock);
 +                      six_unlock_intent(&ck->c.lock);
 +                      kfree(ck);
 +              } else {
 +                      bkey_cached_free_fast(bc, ck);
 +              }
 +
 +              mark_btree_node_locked(trans, path, 0, BTREE_NODE_UNLOCKED);
 +              return NULL;
 +      }
 +
 +      atomic_long_inc(&bc->nr_keys);
 +
 +      six_unlock_write(&ck->c.lock);
 +
 +      return ck;
 +}
 +
 +static int btree_key_cache_fill(struct btree_trans *trans,
 +                              struct btree_path *ck_path,
 +                              struct bkey_cached *ck)
 +{
 +      struct btree_iter iter;
 +      struct bkey_s_c k;
 +      unsigned new_u64s = 0;
 +      struct bkey_i *new_k = NULL;
 +      int ret;
 +
 +      k = bch2_bkey_get_iter(trans, &iter, ck->key.btree_id, ck->key.pos,
 +                             BTREE_ITER_KEY_CACHE_FILL|
 +                             BTREE_ITER_CACHED_NOFILL);
 +      ret = bkey_err(k);
 +      if (ret)
 +              goto err;
 +
 +      if (!bch2_btree_node_relock(trans, ck_path, 0)) {
 +              trace_and_count(trans->c, trans_restart_relock_key_cache_fill, trans, _THIS_IP_, ck_path);
 +              ret = btree_trans_restart(trans, BCH_ERR_transaction_restart_key_cache_fill);
 +              goto err;
 +      }
 +
 +      /*
 +       * bch2_varint_decode can read past the end of the buffer by at
 +       * most 7 bytes (it won't be used):
 +       */
 +      new_u64s = k.k->u64s + 1;
 +
 +      /*
 +       * Allocate some extra space so that the transaction commit path is less
 +       * likely to have to reallocate, since that requires a transaction
 +       * restart:
 +       */
 +      new_u64s = min(256U, (new_u64s * 3) / 2);
 +
 +      if (new_u64s > ck->u64s) {
 +              new_u64s = roundup_pow_of_two(new_u64s);
 +              new_k = kmalloc(new_u64s * sizeof(u64), GFP_NOWAIT|__GFP_NOWARN);
 +              if (!new_k) {
 +                      bch2_trans_unlock(trans);
 +
 +                      new_k = kmalloc(new_u64s * sizeof(u64), GFP_KERNEL);
 +                      if (!new_k) {
 +                              bch_err(trans->c, "error allocating memory for key cache key, btree %s u64s %u",
 +                                      bch2_btree_ids[ck->key.btree_id], new_u64s);
 +                              ret = -BCH_ERR_ENOMEM_btree_key_cache_fill;
 +                              goto err;
 +                      }
 +
 +                      if (!bch2_btree_node_relock(trans, ck_path, 0)) {
 +                              kfree(new_k);
 +                              trace_and_count(trans->c, trans_restart_relock_key_cache_fill, trans, _THIS_IP_, ck_path);
 +                              ret = btree_trans_restart(trans, BCH_ERR_transaction_restart_key_cache_fill);
 +                              goto err;
 +                      }
 +
 +                      ret = bch2_trans_relock(trans);
 +                      if (ret) {
 +                              kfree(new_k);
 +                              goto err;
 +                      }
 +              }
 +      }
 +
 +      ret = bch2_btree_node_lock_write(trans, ck_path, &ck_path->l[0].b->c);
 +      if (ret) {
 +              kfree(new_k);
 +              goto err;
 +      }
 +
 +      if (new_k) {
 +              kfree(ck->k);
 +              ck->u64s = new_u64s;
 +              ck->k = new_k;
 +      }
 +
 +      bkey_reassemble(ck->k, k);
 +      ck->valid = true;
 +      bch2_btree_node_unlock_write(trans, ck_path, ck_path->l[0].b);
 +
 +      /* We're not likely to need this iterator again: */
 +      set_btree_iter_dontneed(&iter);
 +err:
 +      bch2_trans_iter_exit(trans, &iter);
 +      return ret;
 +}
 +
 +static noinline int
 +bch2_btree_path_traverse_cached_slowpath(struct btree_trans *trans, struct btree_path *path,
 +                                       unsigned flags)
 +{
 +      struct bch_fs *c = trans->c;
 +      struct bkey_cached *ck;
 +      int ret = 0;
 +
 +      BUG_ON(path->level);
 +
 +      path->l[1].b = NULL;
 +
 +      if (bch2_btree_node_relock_notrace(trans, path, 0)) {
 +              ck = (void *) path->l[0].b;
 +              goto fill;
 +      }
 +retry:
 +      ck = bch2_btree_key_cache_find(c, path->btree_id, path->pos);
 +      if (!ck) {
 +              ck = btree_key_cache_create(trans, path);
 +              ret = PTR_ERR_OR_ZERO(ck);
 +              if (ret)
 +                      goto err;
 +              if (!ck)
 +                      goto retry;
 +
 +              mark_btree_node_locked(trans, path, 0, BTREE_NODE_INTENT_LOCKED);
 +              path->locks_want = 1;
 +      } else {
 +              enum six_lock_type lock_want = __btree_lock_want(path, 0);
 +
 +              ret = btree_node_lock(trans, path, (void *) ck, 0,
 +                                    lock_want, _THIS_IP_);
 +              if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
 +                      goto err;
 +
 +              BUG_ON(ret);
 +
 +              if (ck->key.btree_id != path->btree_id ||
 +                  !bpos_eq(ck->key.pos, path->pos)) {
 +                      six_unlock_type(&ck->c.lock, lock_want);
 +                      goto retry;
 +              }
 +
 +              mark_btree_node_locked(trans, path, 0,
 +                                     (enum btree_node_locked_type) lock_want);
 +      }
 +
 +      path->l[0].lock_seq     = six_lock_seq(&ck->c.lock);
 +      path->l[0].b            = (void *) ck;
 +fill:
 +      path->uptodate = BTREE_ITER_UPTODATE;
 +
 +      if (!ck->valid && !(flags & BTREE_ITER_CACHED_NOFILL)) {
 +              /*
 +               * Using the underscore version because we haven't set
 +               * path->uptodate yet:
 +               */
 +              if (!path->locks_want &&
 +                  !__bch2_btree_path_upgrade(trans, path, 1)) {
 +                      trace_and_count(trans->c, trans_restart_key_cache_upgrade, trans, _THIS_IP_);
 +                      ret = btree_trans_restart(trans, BCH_ERR_transaction_restart_key_cache_upgrade);
 +                      goto err;
 +              }
 +
 +              ret = btree_key_cache_fill(trans, path, ck);
 +              if (ret)
 +                      goto err;
 +
 +              ret = bch2_btree_path_relock(trans, path, _THIS_IP_);
 +              if (ret)
 +                      goto err;
 +
 +              path->uptodate = BTREE_ITER_UPTODATE;
 +      }
 +
 +      if (!test_bit(BKEY_CACHED_ACCESSED, &ck->flags))
 +              set_bit(BKEY_CACHED_ACCESSED, &ck->flags);
 +
 +      BUG_ON(btree_node_locked_type(path, 0) != btree_lock_want(path, 0));
 +      BUG_ON(path->uptodate);
 +
 +      return ret;
 +err:
 +      path->uptodate = BTREE_ITER_NEED_TRAVERSE;
 +      if (!bch2_err_matches(ret, BCH_ERR_transaction_restart)) {
 +              btree_node_unlock(trans, path, 0);
 +              path->l[0].b = ERR_PTR(ret);
 +      }
 +      return ret;
 +}
 +
 +int bch2_btree_path_traverse_cached(struct btree_trans *trans, struct btree_path *path,
 +                                  unsigned flags)
 +{
 +      struct bch_fs *c = trans->c;
 +      struct bkey_cached *ck;
 +      int ret = 0;
 +
 +      EBUG_ON(path->level);
 +
 +      path->l[1].b = NULL;
 +
 +      if (bch2_btree_node_relock_notrace(trans, path, 0)) {
 +              ck = (void *) path->l[0].b;
 +              goto fill;
 +      }
 +retry:
 +      ck = bch2_btree_key_cache_find(c, path->btree_id, path->pos);
 +      if (!ck) {
 +              return bch2_btree_path_traverse_cached_slowpath(trans, path, flags);
 +      } else {
 +              enum six_lock_type lock_want = __btree_lock_want(path, 0);
 +
 +              ret = btree_node_lock(trans, path, (void *) ck, 0,
 +                                    lock_want, _THIS_IP_);
 +              EBUG_ON(ret && !bch2_err_matches(ret, BCH_ERR_transaction_restart));
 +
 +              if (ret)
 +                      return ret;
 +
 +              if (ck->key.btree_id != path->btree_id ||
 +                  !bpos_eq(ck->key.pos, path->pos)) {
 +                      six_unlock_type(&ck->c.lock, lock_want);
 +                      goto retry;
 +              }
 +
 +              mark_btree_node_locked(trans, path, 0,
 +                                     (enum btree_node_locked_type) lock_want);
 +      }
 +
 +      path->l[0].lock_seq     = six_lock_seq(&ck->c.lock);
 +      path->l[0].b            = (void *) ck;
 +fill:
 +      if (!ck->valid)
 +              return bch2_btree_path_traverse_cached_slowpath(trans, path, flags);
 +
 +      if (!test_bit(BKEY_CACHED_ACCESSED, &ck->flags))
 +              set_bit(BKEY_CACHED_ACCESSED, &ck->flags);
 +
 +      path->uptodate = BTREE_ITER_UPTODATE;
 +      EBUG_ON(!ck->valid);
 +      EBUG_ON(btree_node_locked_type(path, 0) != btree_lock_want(path, 0));
 +
 +      return ret;
 +}
 +
 +static int btree_key_cache_flush_pos(struct btree_trans *trans,
 +                                   struct bkey_cached_key key,
 +                                   u64 journal_seq,
 +                                   unsigned commit_flags,
 +                                   bool evict)
 +{
 +      struct bch_fs *c = trans->c;
 +      struct journal *j = &c->journal;
 +      struct btree_iter c_iter, b_iter;
 +      struct bkey_cached *ck = NULL;
 +      int ret;
 +
 +      bch2_trans_iter_init(trans, &b_iter, key.btree_id, key.pos,
 +                           BTREE_ITER_SLOTS|
 +                           BTREE_ITER_INTENT|
 +                           BTREE_ITER_ALL_SNAPSHOTS);
 +      bch2_trans_iter_init(trans, &c_iter, key.btree_id, key.pos,
 +                           BTREE_ITER_CACHED|
 +                           BTREE_ITER_INTENT);
 +      b_iter.flags &= ~BTREE_ITER_WITH_KEY_CACHE;
 +
 +      ret = bch2_btree_iter_traverse(&c_iter);
 +      if (ret)
 +              goto out;
 +
 +      ck = (void *) c_iter.path->l[0].b;
 +      if (!ck)
 +              goto out;
 +
 +      if (!test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
 +              if (evict)
 +                      goto evict;
 +              goto out;
 +      }
 +
 +      BUG_ON(!ck->valid);
 +
 +      if (journal_seq && ck->journal.seq != journal_seq)
 +              goto out;
 +
 +      /*
 +       * Since journal reclaim depends on us making progress here, and the
 +       * allocator/copygc depend on journal reclaim making progress, we need
 +       * to be using alloc reserves:
 +       */
 +      ret   = bch2_btree_iter_traverse(&b_iter) ?:
 +              bch2_trans_update(trans, &b_iter, ck->k,
 +                                BTREE_UPDATE_KEY_CACHE_RECLAIM|
 +                                BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE|
 +                                BTREE_TRIGGER_NORUN) ?:
 +              bch2_trans_commit(trans, NULL, NULL,
 +                                BTREE_INSERT_NOCHECK_RW|
 +                                BTREE_INSERT_NOFAIL|
 +                                (ck->journal.seq == journal_last_seq(j)
 +                                 ? BCH_WATERMARK_reclaim
 +                                 : 0)|
 +                                commit_flags);
 +
 +      bch2_fs_fatal_err_on(ret &&
 +                           !bch2_err_matches(ret, BCH_ERR_transaction_restart) &&
 +                           !bch2_err_matches(ret, BCH_ERR_journal_reclaim_would_deadlock) &&
 +                           !bch2_journal_error(j), c,
 +                           "error flushing key cache: %s", bch2_err_str(ret));
 +      if (ret)
 +              goto out;
 +
 +      bch2_journal_pin_drop(j, &ck->journal);
 +      bch2_journal_preres_put(j, &ck->res);
 +
 +      BUG_ON(!btree_node_locked(c_iter.path, 0));
 +
 +      if (!evict) {
 +              if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
 +                      clear_bit(BKEY_CACHED_DIRTY, &ck->flags);
 +                      atomic_long_dec(&c->btree_key_cache.nr_dirty);
 +              }
 +      } else {
 +              struct btree_path *path2;
 +evict:
 +              trans_for_each_path(trans, path2)
 +                      if (path2 != c_iter.path)
 +                              __bch2_btree_path_unlock(trans, path2);
 +
 +              bch2_btree_node_lock_write_nofail(trans, c_iter.path, &ck->c);
 +
 +              if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
 +                      clear_bit(BKEY_CACHED_DIRTY, &ck->flags);
 +                      atomic_long_dec(&c->btree_key_cache.nr_dirty);
 +              }
 +
 +              mark_btree_node_locked_noreset(c_iter.path, 0, BTREE_NODE_UNLOCKED);
 +              bkey_cached_evict(&c->btree_key_cache, ck);
 +              bkey_cached_free_fast(&c->btree_key_cache, ck);
 +      }
 +out:
 +      bch2_trans_iter_exit(trans, &b_iter);
 +      bch2_trans_iter_exit(trans, &c_iter);
 +      return ret;
 +}
 +
 +int bch2_btree_key_cache_journal_flush(struct journal *j,
 +                              struct journal_entry_pin *pin, u64 seq)
 +{
 +      struct bch_fs *c = container_of(j, struct bch_fs, journal);
 +      struct bkey_cached *ck =
 +              container_of(pin, struct bkey_cached, journal);
 +      struct bkey_cached_key key;
 +      struct btree_trans *trans = bch2_trans_get(c);
 +      int srcu_idx = srcu_read_lock(&c->btree_trans_barrier);
 +      int ret = 0;
 +
 +      btree_node_lock_nopath_nofail(trans, &ck->c, SIX_LOCK_read);
 +      key = ck->key;
 +
 +      if (ck->journal.seq != seq ||
 +          !test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
 +              six_unlock_read(&ck->c.lock);
 +              goto unlock;
 +      }
 +
 +      if (ck->seq != seq) {
 +              bch2_journal_pin_update(&c->journal, ck->seq, &ck->journal,
 +                                      bch2_btree_key_cache_journal_flush);
 +              six_unlock_read(&ck->c.lock);
 +              goto unlock;
 +      }
 +      six_unlock_read(&ck->c.lock);
 +
 +      ret = commit_do(trans, NULL, NULL, 0,
 +              btree_key_cache_flush_pos(trans, key, seq,
 +                              BTREE_INSERT_JOURNAL_RECLAIM, false));
 +unlock:
 +      srcu_read_unlock(&c->btree_trans_barrier, srcu_idx);
 +
 +      bch2_trans_put(trans);
 +      return ret;
 +}
 +
 +/*
 + * Flush and evict a key from the key cache:
 + */
 +int bch2_btree_key_cache_flush(struct btree_trans *trans,
 +                             enum btree_id id, struct bpos pos)
 +{
 +      struct bch_fs *c = trans->c;
 +      struct bkey_cached_key key = { id, pos };
 +
 +      /* Fastpath - assume it won't be found: */
 +      if (!bch2_btree_key_cache_find(c, id, pos))
 +              return 0;
 +
 +      return btree_key_cache_flush_pos(trans, key, 0, 0, true);
 +}
 +
 +bool bch2_btree_insert_key_cached(struct btree_trans *trans,
 +                                unsigned flags,
 +                                struct btree_insert_entry *insert_entry)
 +{
 +      struct bch_fs *c = trans->c;
 +      struct bkey_cached *ck = (void *) insert_entry->path->l[0].b;
 +      struct bkey_i *insert = insert_entry->k;
 +      bool kick_reclaim = false;
 +
 +      BUG_ON(insert->k.u64s > ck->u64s);
 +
 +      if (likely(!(flags & BTREE_INSERT_JOURNAL_REPLAY))) {
 +              int difference;
 +
 +              BUG_ON(jset_u64s(insert->k.u64s) > trans->journal_preres.u64s);
 +
 +              difference = jset_u64s(insert->k.u64s) - ck->res.u64s;
 +              if (difference > 0) {
 +                      trans->journal_preres.u64s      -= difference;
 +                      ck->res.u64s                    += difference;
 +              }
 +      }
 +
 +      bkey_copy(ck->k, insert);
 +      ck->valid = true;
 +
 +      if (!test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
 +              EBUG_ON(test_bit(BCH_FS_CLEAN_SHUTDOWN, &c->flags));
 +              set_bit(BKEY_CACHED_DIRTY, &ck->flags);
 +              atomic_long_inc(&c->btree_key_cache.nr_dirty);
 +
 +              if (bch2_nr_btree_keys_need_flush(c))
 +                      kick_reclaim = true;
 +      }
 +
 +      /*
 +       * To minimize lock contention, we only add the journal pin here and
 +       * defer pin updates to the flush callback via ->seq. Be careful not to
 +       * update ->seq on nojournal commits because we don't want to update the
 +       * pin to a seq that doesn't include journal updates on disk. Otherwise
 +       * we risk losing the update after a crash.
 +       *
 +       * The only exception is if the pin is not active in the first place. We
 +       * have to add the pin because journal reclaim drives key cache
 +       * flushing. The flush callback will not proceed unless ->seq matches
 +       * the latest pin, so make sure it starts with a consistent value.
 +       */
 +      if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL) ||
 +          !journal_pin_active(&ck->journal)) {
 +              ck->seq = trans->journal_res.seq;
 +      }
 +      bch2_journal_pin_add(&c->journal, trans->journal_res.seq,
 +                           &ck->journal, bch2_btree_key_cache_journal_flush);
 +
 +      if (kick_reclaim)
 +              journal_reclaim_kick(&c->journal);
 +      return true;
 +}
 +
 +void bch2_btree_key_cache_drop(struct btree_trans *trans,
 +                             struct btree_path *path)
 +{
 +      struct bch_fs *c = trans->c;
 +      struct bkey_cached *ck = (void *) path->l[0].b;
 +
 +      BUG_ON(!ck->valid);
 +
 +      /*
 +       * We just did an update to the btree, bypassing the key cache: the key
 +       * cache key is now stale and must be dropped, even if dirty:
 +       */
 +      if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
 +              clear_bit(BKEY_CACHED_DIRTY, &ck->flags);
 +              atomic_long_dec(&c->btree_key_cache.nr_dirty);
 +              bch2_journal_pin_drop(&c->journal, &ck->journal);
 +      }
 +
 +      ck->valid = false;
 +}
 +
 +static unsigned long bch2_btree_key_cache_scan(struct shrinker *shrink,
 +                                         struct shrink_control *sc)
 +{
-       struct bch_fs *c = container_of(shrink, struct bch_fs,
-                                       btree_key_cache.shrink);
++      struct bch_fs *c = shrink->private_data;
 +      struct btree_key_cache *bc = &c->btree_key_cache;
 +      struct bucket_table *tbl;
 +      struct bkey_cached *ck, *t;
 +      size_t scanned = 0, freed = 0, nr = sc->nr_to_scan;
 +      unsigned start, flags;
 +      int srcu_idx;
 +
 +      mutex_lock(&bc->lock);
 +      srcu_idx = srcu_read_lock(&c->btree_trans_barrier);
 +      flags = memalloc_nofs_save();
 +
 +      /*
 +       * Newest freed entries are at the end of the list - once we hit one
 +       * that's too new to be freed, we can bail out:
 +       */
 +      list_for_each_entry_safe(ck, t, &bc->freed_nonpcpu, list) {
 +              if (!poll_state_synchronize_srcu(&c->btree_trans_barrier,
 +                                               ck->btree_trans_barrier_seq))
 +                      break;
 +
 +              list_del(&ck->list);
 +              six_lock_exit(&ck->c.lock);
 +              kmem_cache_free(bch2_key_cache, ck);
 +              atomic_long_dec(&bc->nr_freed);
 +              scanned++;
 +              freed++;
 +      }
 +
 +      if (scanned >= nr)
 +              goto out;
 +
 +      list_for_each_entry_safe(ck, t, &bc->freed_pcpu, list) {
 +              if (!poll_state_synchronize_srcu(&c->btree_trans_barrier,
 +                                               ck->btree_trans_barrier_seq))
 +                      break;
 +
 +              list_del(&ck->list);
 +              six_lock_exit(&ck->c.lock);
 +              kmem_cache_free(bch2_key_cache, ck);
 +              atomic_long_dec(&bc->nr_freed);
 +              scanned++;
 +              freed++;
 +      }
 +
 +      if (scanned >= nr)
 +              goto out;
 +
 +      rcu_read_lock();
 +      tbl = rht_dereference_rcu(bc->table.tbl, &bc->table);
 +      if (bc->shrink_iter >= tbl->size)
 +              bc->shrink_iter = 0;
 +      start = bc->shrink_iter;
 +
 +      do {
 +              struct rhash_head *pos, *next;
 +
 +              pos = rht_ptr_rcu(rht_bucket(tbl, bc->shrink_iter));
 +
 +              while (!rht_is_a_nulls(pos)) {
 +                      next = rht_dereference_bucket_rcu(pos->next, tbl, bc->shrink_iter);
 +                      ck = container_of(pos, struct bkey_cached, hash);
 +
 +                      if (test_bit(BKEY_CACHED_DIRTY, &ck->flags))
 +                              goto next;
 +
 +                      if (test_bit(BKEY_CACHED_ACCESSED, &ck->flags))
 +                              clear_bit(BKEY_CACHED_ACCESSED, &ck->flags);
 +                      else if (bkey_cached_lock_for_evict(ck)) {
 +                              bkey_cached_evict(bc, ck);
 +                              bkey_cached_free(bc, ck);
 +                      }
 +
 +                      scanned++;
 +                      if (scanned >= nr)
 +                              break;
 +next:
 +                      pos = next;
 +              }
 +
 +              bc->shrink_iter++;
 +              if (bc->shrink_iter >= tbl->size)
 +                      bc->shrink_iter = 0;
 +      } while (scanned < nr && bc->shrink_iter != start);
 +
 +      rcu_read_unlock();
 +out:
 +      memalloc_nofs_restore(flags);
 +      srcu_read_unlock(&c->btree_trans_barrier, srcu_idx);
 +      mutex_unlock(&bc->lock);
 +
 +      return freed;
 +}
 +
 +static unsigned long bch2_btree_key_cache_count(struct shrinker *shrink,
 +                                          struct shrink_control *sc)
 +{
-       unregister_shrinker(&bc->shrink);
++      struct bch_fs *c = shrink->private_data;
 +      struct btree_key_cache *bc = &c->btree_key_cache;
 +      long nr = atomic_long_read(&bc->nr_keys) -
 +              atomic_long_read(&bc->nr_dirty);
 +
 +      return max(0L, nr);
 +}
 +
 +void bch2_fs_btree_key_cache_exit(struct btree_key_cache *bc)
 +{
 +      struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache);
 +      struct bucket_table *tbl;
 +      struct bkey_cached *ck, *n;
 +      struct rhash_head *pos;
 +      LIST_HEAD(items);
 +      unsigned i;
 +#ifdef __KERNEL__
 +      int cpu;
 +#endif
 +
-       bc->shrink.seeks                = 0;
-       bc->shrink.count_objects        = bch2_btree_key_cache_count;
-       bc->shrink.scan_objects         = bch2_btree_key_cache_scan;
-       if (register_shrinker(&bc->shrink, "%s/btree_key_cache", c->name))
++      shrinker_free(bc->shrink);
 +
 +      mutex_lock(&bc->lock);
 +
 +      /*
 +       * The loop is needed to guard against racing with rehash:
 +       */
 +      while (atomic_long_read(&bc->nr_keys)) {
 +              rcu_read_lock();
 +              tbl = rht_dereference_rcu(bc->table.tbl, &bc->table);
 +              if (tbl)
 +                      for (i = 0; i < tbl->size; i++)
 +                              rht_for_each_entry_rcu(ck, pos, tbl, i, hash) {
 +                                      bkey_cached_evict(bc, ck);
 +                                      list_add(&ck->list, &items);
 +                              }
 +              rcu_read_unlock();
 +      }
 +
 +#ifdef __KERNEL__
 +      for_each_possible_cpu(cpu) {
 +              struct btree_key_cache_freelist *f =
 +                      per_cpu_ptr(bc->pcpu_freed, cpu);
 +
 +              for (i = 0; i < f->nr; i++) {
 +                      ck = f->objs[i];
 +                      list_add(&ck->list, &items);
 +              }
 +      }
 +#endif
 +
 +      list_splice(&bc->freed_pcpu,    &items);
 +      list_splice(&bc->freed_nonpcpu, &items);
 +
 +      mutex_unlock(&bc->lock);
 +
 +      list_for_each_entry_safe(ck, n, &items, list) {
 +              cond_resched();
 +
 +              bch2_journal_pin_drop(&c->journal, &ck->journal);
 +              bch2_journal_preres_put(&c->journal, &ck->res);
 +
 +              list_del(&ck->list);
 +              kfree(ck->k);
 +              six_lock_exit(&ck->c.lock);
 +              kmem_cache_free(bch2_key_cache, ck);
 +      }
 +
 +      if (atomic_long_read(&bc->nr_dirty) &&
 +          !bch2_journal_error(&c->journal) &&
 +          test_bit(BCH_FS_WAS_RW, &c->flags))
 +              panic("btree key cache shutdown error: nr_dirty nonzero (%li)\n",
 +                    atomic_long_read(&bc->nr_dirty));
 +
 +      if (atomic_long_read(&bc->nr_keys))
 +              panic("btree key cache shutdown error: nr_keys nonzero (%li)\n",
 +                    atomic_long_read(&bc->nr_keys));
 +
 +      if (bc->table_init_done)
 +              rhashtable_destroy(&bc->table);
 +
 +      free_percpu(bc->pcpu_freed);
 +}
 +
 +void bch2_fs_btree_key_cache_init_early(struct btree_key_cache *c)
 +{
 +      mutex_init(&c->lock);
 +      INIT_LIST_HEAD(&c->freed_pcpu);
 +      INIT_LIST_HEAD(&c->freed_nonpcpu);
 +}
 +
 +int bch2_fs_btree_key_cache_init(struct btree_key_cache *bc)
 +{
 +      struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache);
++      struct shrinker *shrink;
 +
 +#ifdef __KERNEL__
 +      bc->pcpu_freed = alloc_percpu(struct btree_key_cache_freelist);
 +      if (!bc->pcpu_freed)
 +              return -BCH_ERR_ENOMEM_fs_btree_cache_init;
 +#endif
 +
 +      if (rhashtable_init(&bc->table, &bch2_btree_key_cache_params))
 +              return -BCH_ERR_ENOMEM_fs_btree_cache_init;
 +
 +      bc->table_init_done = true;
 +
++      shrink = shrinker_alloc(0, "%s/btree_key_cache", c->name);
++      if (!shrink)
 +              return -BCH_ERR_ENOMEM_fs_btree_cache_init;
++      bc->shrink = shrink;
++      shrink->seeks           = 0;
++      shrink->count_objects   = bch2_btree_key_cache_count;
++      shrink->scan_objects    = bch2_btree_key_cache_scan;
++      shrink->private_data    = c;
++      shrinker_register(shrink);
 +      return 0;
 +}
 +
 +void bch2_btree_key_cache_to_text(struct printbuf *out, struct btree_key_cache *c)
 +{
 +      prt_printf(out, "nr_freed:\t%lu",       atomic_long_read(&c->nr_freed));
 +      prt_newline(out);
 +      prt_printf(out, "nr_keys:\t%lu",        atomic_long_read(&c->nr_keys));
 +      prt_newline(out);
 +      prt_printf(out, "nr_dirty:\t%lu",       atomic_long_read(&c->nr_dirty));
 +      prt_newline(out);
 +}
 +
 +void bch2_btree_key_cache_exit(void)
 +{
 +      kmem_cache_destroy(bch2_key_cache);
 +}
 +
 +int __init bch2_btree_key_cache_init(void)
 +{
 +      bch2_key_cache = KMEM_CACHE(bkey_cached, SLAB_RECLAIM_ACCOUNT);
 +      if (!bch2_key_cache)
 +              return -ENOMEM;
 +
 +      return 0;
 +}
index c9a38e254949ec2b4aa400d253a52e099c1b0292,0000000000000000000000000000000000000000..bc6714d88925f3183ed9d817f8a79f106e55da7c
mode 100644,000000..100644
--- /dev/null
@@@ -1,739 -1,0 +1,739 @@@
-       struct shrinker         shrink;
 +/* SPDX-License-Identifier: GPL-2.0 */
 +#ifndef _BCACHEFS_BTREE_TYPES_H
 +#define _BCACHEFS_BTREE_TYPES_H
 +
 +#include <linux/list.h>
 +#include <linux/rhashtable.h>
 +
 +//#include "bkey_methods.h"
 +#include "buckets_types.h"
 +#include "darray.h"
 +#include "errcode.h"
 +#include "journal_types.h"
 +#include "replicas_types.h"
 +#include "six.h"
 +
 +struct open_bucket;
 +struct btree_update;
 +struct btree_trans;
 +
 +#define MAX_BSETS             3U
 +
 +struct btree_nr_keys {
 +
 +      /*
 +       * Amount of live metadata (i.e. size of node after a compaction) in
 +       * units of u64s
 +       */
 +      u16                     live_u64s;
 +      u16                     bset_u64s[MAX_BSETS];
 +
 +      /* live keys only: */
 +      u16                     packed_keys;
 +      u16                     unpacked_keys;
 +};
 +
 +struct bset_tree {
 +      /*
 +       * We construct a binary tree in an array as if the array
 +       * started at 1, so that things line up on the same cachelines
 +       * better: see comments in bset.c at cacheline_to_bkey() for
 +       * details
 +       */
 +
 +      /* size of the binary tree and prev array */
 +      u16                     size;
 +
 +      /* function of size - precalculated for to_inorder() */
 +      u16                     extra;
 +
 +      u16                     data_offset;
 +      u16                     aux_data_offset;
 +      u16                     end_offset;
 +};
 +
 +struct btree_write {
 +      struct journal_entry_pin        journal;
 +};
 +
 +struct btree_alloc {
 +      struct open_buckets     ob;
 +      __BKEY_PADDED(k, BKEY_BTREE_PTR_VAL_U64s_MAX);
 +};
 +
 +struct btree_bkey_cached_common {
 +      struct six_lock         lock;
 +      u8                      level;
 +      u8                      btree_id;
 +      bool                    cached;
 +};
 +
 +struct btree {
 +      struct btree_bkey_cached_common c;
 +
 +      struct rhash_head       hash;
 +      u64                     hash_val;
 +
 +      unsigned long           flags;
 +      u16                     written;
 +      u8                      nsets;
 +      u8                      nr_key_bits;
 +      u16                     version_ondisk;
 +
 +      struct bkey_format      format;
 +
 +      struct btree_node       *data;
 +      void                    *aux_data;
 +
 +      /*
 +       * Sets of sorted keys - the real btree node - plus a binary search tree
 +       *
 +       * set[0] is special; set[0]->tree, set[0]->prev and set[0]->data point
 +       * to the memory we have allocated for this btree node. Additionally,
 +       * set[0]->data points to the entire btree node as it exists on disk.
 +       */
 +      struct bset_tree        set[MAX_BSETS];
 +
 +      struct btree_nr_keys    nr;
 +      u16                     sib_u64s[2];
 +      u16                     whiteout_u64s;
 +      u8                      byte_order;
 +      u8                      unpack_fn_len;
 +
 +      struct btree_write      writes[2];
 +
 +      /* Key/pointer for this btree node */
 +      __BKEY_PADDED(key, BKEY_BTREE_PTR_VAL_U64s_MAX);
 +
 +      /*
 +       * XXX: add a delete sequence number, so when bch2_btree_node_relock()
 +       * fails because the lock sequence number has changed - i.e. the
 +       * contents were modified - we can still relock the node if it's still
 +       * the one we want, without redoing the traversal
 +       */
 +
 +      /*
 +       * For asynchronous splits/interior node updates:
 +       * When we do a split, we allocate new child nodes and update the parent
 +       * node to point to them: we update the parent in memory immediately,
 +       * but then we must wait until the children have been written out before
 +       * the update to the parent can be written - this is a list of the
 +       * btree_updates that are blocking this node from being
 +       * written:
 +       */
 +      struct list_head        write_blocked;
 +
 +      /*
 +       * Also for asynchronous splits/interior node updates:
 +       * If a btree node isn't reachable yet, we don't want to kick off
 +       * another write - because that write also won't yet be reachable and
 +       * marking it as completed before it's reachable would be incorrect:
 +       */
 +      unsigned long           will_make_reachable;
 +
 +      struct open_buckets     ob;
 +
 +      /* lru list */
 +      struct list_head        list;
 +};
 +
 +struct btree_cache {
 +      struct rhashtable       table;
 +      bool                    table_init_done;
 +      /*
 +       * We never free a struct btree, except on shutdown - we just put it on
 +       * the btree_cache_freed list and reuse it later. This simplifies the
 +       * code, and it doesn't cost us much memory as the memory usage is
 +       * dominated by buffers that hold the actual btree node data and those
 +       * can be freed - and the number of struct btrees allocated is
 +       * effectively bounded.
 +       *
 +       * btree_cache_freeable effectively is a small cache - we use it because
 +       * high order page allocations can be rather expensive, and it's quite
 +       * common to delete and allocate btree nodes in quick succession. It
 +       * should never grow past ~2-3 nodes in practice.
 +       */
 +      struct mutex            lock;
 +      struct list_head        live;
 +      struct list_head        freeable;
 +      struct list_head        freed_pcpu;
 +      struct list_head        freed_nonpcpu;
 +
 +      /* Number of elements in live + freeable lists */
 +      unsigned                used;
 +      unsigned                reserve;
 +      atomic_t                dirty;
-       struct shrinker         shrink;
++      struct shrinker         *shrink;
 +
 +      /*
 +       * If we need to allocate memory for a new btree node and that
 +       * allocation fails, we can cannibalize another node in the btree cache
 +       * to satisfy the allocation - lock to guarantee only one thread does
 +       * this at a time:
 +       */
 +      struct task_struct      *alloc_lock;
 +      struct closure_waitlist alloc_wait;
 +};
 +
 +struct btree_node_iter {
 +      struct btree_node_iter_set {
 +              u16     k, end;
 +      } data[MAX_BSETS];
 +};
 +
 +/*
 + * Iterate over all possible positions, synthesizing deleted keys for holes:
 + */
 +static const __maybe_unused u16 BTREE_ITER_SLOTS              = 1 << 0;
 +static const __maybe_unused u16 BTREE_ITER_ALL_LEVELS         = 1 << 1;
 +/*
 + * Indicates that intent locks should be taken on leaf nodes, because we expect
 + * to be doing updates:
 + */
 +static const __maybe_unused u16 BTREE_ITER_INTENT             = 1 << 2;
 +/*
 + * Causes the btree iterator code to prefetch additional btree nodes from disk:
 + */
 +static const __maybe_unused u16 BTREE_ITER_PREFETCH           = 1 << 3;
 +/*
 + * Used in bch2_btree_iter_traverse(), to indicate whether we're searching for
 + * @pos or the first key strictly greater than @pos
 + */
 +static const __maybe_unused u16 BTREE_ITER_IS_EXTENTS         = 1 << 4;
 +static const __maybe_unused u16 BTREE_ITER_NOT_EXTENTS                = 1 << 5;
 +static const __maybe_unused u16 BTREE_ITER_CACHED             = 1 << 6;
 +static const __maybe_unused u16 BTREE_ITER_WITH_KEY_CACHE     = 1 << 7;
 +static const __maybe_unused u16 BTREE_ITER_WITH_UPDATES               = 1 << 8;
 +static const __maybe_unused u16 BTREE_ITER_WITH_JOURNAL               = 1 << 9;
 +static const __maybe_unused u16 __BTREE_ITER_ALL_SNAPSHOTS    = 1 << 10;
 +static const __maybe_unused u16 BTREE_ITER_ALL_SNAPSHOTS      = 1 << 11;
 +static const __maybe_unused u16 BTREE_ITER_FILTER_SNAPSHOTS   = 1 << 12;
 +static const __maybe_unused u16 BTREE_ITER_NOPRESERVE         = 1 << 13;
 +static const __maybe_unused u16 BTREE_ITER_CACHED_NOFILL      = 1 << 14;
 +static const __maybe_unused u16 BTREE_ITER_KEY_CACHE_FILL     = 1 << 15;
 +#define __BTREE_ITER_FLAGS_END                                               16
 +
 +enum btree_path_uptodate {
 +      BTREE_ITER_UPTODATE             = 0,
 +      BTREE_ITER_NEED_RELOCK          = 1,
 +      BTREE_ITER_NEED_TRAVERSE        = 2,
 +};
 +
 +#if defined(CONFIG_BCACHEFS_LOCK_TIME_STATS) || defined(CONFIG_BCACHEFS_DEBUG)
 +#define TRACK_PATH_ALLOCATED
 +#endif
 +
 +struct btree_path {
 +      u8                      idx;
 +      u8                      sorted_idx;
 +      u8                      ref;
 +      u8                      intent_ref;
 +
 +      /* btree_iter_copy starts here: */
 +      struct bpos             pos;
 +
 +      enum btree_id           btree_id:5;
 +      bool                    cached:1;
 +      bool                    preserve:1;
 +      enum btree_path_uptodate uptodate:2;
 +      /*
 +       * When true, failing to relock this path will cause the transaction to
 +       * restart:
 +       */
 +      bool                    should_be_locked:1;
 +      unsigned                level:3,
 +                              locks_want:3;
 +      u8                      nodes_locked;
 +
 +      struct btree_path_level {
 +              struct btree    *b;
 +              struct btree_node_iter iter;
 +              u32             lock_seq;
 +#ifdef CONFIG_BCACHEFS_LOCK_TIME_STATS
 +              u64             lock_taken_time;
 +#endif
 +      }                       l[BTREE_MAX_DEPTH];
 +#ifdef TRACK_PATH_ALLOCATED
 +      unsigned long           ip_allocated;
 +#endif
 +};
 +
 +static inline struct btree_path_level *path_l(struct btree_path *path)
 +{
 +      return path->l + path->level;
 +}
 +
 +static inline unsigned long btree_path_ip_allocated(struct btree_path *path)
 +{
 +#ifdef TRACK_PATH_ALLOCATED
 +      return path->ip_allocated;
 +#else
 +      return _THIS_IP_;
 +#endif
 +}
 +
 +/*
 + * @pos                       - iterator's current position
 + * @level             - current btree depth
 + * @locks_want                - btree level below which we start taking intent locks
 + * @nodes_locked      - bitmask indicating which nodes in @nodes are locked
 + * @nodes_intent_locked       - bitmask indicating which locks are intent locks
 + */
 +struct btree_iter {
 +      struct btree_trans      *trans;
 +      struct btree_path       *path;
 +      struct btree_path       *update_path;
 +      struct btree_path       *key_cache_path;
 +
 +      enum btree_id           btree_id:8;
 +      unsigned                min_depth:3;
 +      unsigned                advanced:1;
 +
 +      /* btree_iter_copy starts here: */
 +      u16                     flags;
 +
 +      /* When we're filtering by snapshot, the snapshot ID we're looking for: */
 +      unsigned                snapshot;
 +
 +      struct bpos             pos;
 +      /*
 +       * Current unpacked key - so that bch2_btree_iter_next()/
 +       * bch2_btree_iter_next_slot() can correctly advance pos.
 +       */
 +      struct bkey             k;
 +
 +      /* BTREE_ITER_WITH_JOURNAL: */
 +      size_t                  journal_idx;
 +      struct bpos             journal_pos;
 +#ifdef TRACK_PATH_ALLOCATED
 +      unsigned long           ip_allocated;
 +#endif
 +};
 +
 +struct btree_key_cache_freelist {
 +      struct bkey_cached      *objs[16];
 +      unsigned                nr;
 +};
 +
 +struct btree_key_cache {
 +      struct mutex            lock;
 +      struct rhashtable       table;
 +      bool                    table_init_done;
 +      struct list_head        freed_pcpu;
 +      struct list_head        freed_nonpcpu;
++      struct shrinker         *shrink;
 +      unsigned                shrink_iter;
 +      struct btree_key_cache_freelist __percpu *pcpu_freed;
 +
 +      atomic_long_t           nr_freed;
 +      atomic_long_t           nr_keys;
 +      atomic_long_t           nr_dirty;
 +};
 +
 +struct bkey_cached_key {
 +      u32                     btree_id;
 +      struct bpos             pos;
 +} __packed __aligned(4);
 +
 +#define BKEY_CACHED_ACCESSED          0
 +#define BKEY_CACHED_DIRTY             1
 +
 +struct bkey_cached {
 +      struct btree_bkey_cached_common c;
 +
 +      unsigned long           flags;
 +      u16                     u64s;
 +      bool                    valid;
 +      u32                     btree_trans_barrier_seq;
 +      struct bkey_cached_key  key;
 +
 +      struct rhash_head       hash;
 +      struct list_head        list;
 +
 +      struct journal_preres   res;
 +      struct journal_entry_pin journal;
 +      u64                     seq;
 +
 +      struct bkey_i           *k;
 +};
 +
 +static inline struct bpos btree_node_pos(struct btree_bkey_cached_common *b)
 +{
 +      return !b->cached
 +              ? container_of(b, struct btree, c)->key.k.p
 +              : container_of(b, struct bkey_cached, c)->key.pos;
 +}
 +
 +struct btree_insert_entry {
 +      unsigned                flags;
 +      u8                      bkey_type;
 +      enum btree_id           btree_id:8;
 +      u8                      level:4;
 +      bool                    cached:1;
 +      bool                    insert_trigger_run:1;
 +      bool                    overwrite_trigger_run:1;
 +      bool                    key_cache_already_flushed:1;
 +      /*
 +       * @old_k may be a key from the journal; @old_btree_u64s always refers
 +       * to the size of the key being overwritten in the btree:
 +       */
 +      u8                      old_btree_u64s;
 +      struct bkey_i           *k;
 +      struct btree_path       *path;
 +      u64                     seq;
 +      /* key being overwritten: */
 +      struct bkey             old_k;
 +      const struct bch_val    *old_v;
 +      unsigned long           ip_allocated;
 +};
 +
 +#ifndef CONFIG_LOCKDEP
 +#define BTREE_ITER_MAX                64
 +#else
 +#define BTREE_ITER_MAX                32
 +#endif
 +
 +struct btree_trans_commit_hook;
 +typedef int (btree_trans_commit_hook_fn)(struct btree_trans *, struct btree_trans_commit_hook *);
 +
 +struct btree_trans_commit_hook {
 +      btree_trans_commit_hook_fn      *fn;
 +      struct btree_trans_commit_hook  *next;
 +};
 +
 +#define BTREE_TRANS_MEM_MAX   (1U << 16)
 +
 +#define BTREE_TRANS_MAX_LOCK_HOLD_TIME_NS     10000
 +
 +struct btree_trans {
 +      struct bch_fs           *c;
 +      const char              *fn;
 +      struct closure          ref;
 +      struct list_head        list;
 +      u64                     last_begin_time;
 +
 +      u8                      lock_may_not_fail;
 +      u8                      lock_must_abort;
 +      struct btree_bkey_cached_common *locking;
 +      struct six_lock_waiter  locking_wait;
 +
 +      int                     srcu_idx;
 +
 +      u8                      fn_idx;
 +      u8                      nr_sorted;
 +      u8                      nr_updates;
 +      u8                      nr_wb_updates;
 +      u8                      wb_updates_size;
 +      bool                    used_mempool:1;
 +      bool                    in_traverse_all:1;
 +      bool                    paths_sorted:1;
 +      bool                    memory_allocation_failure:1;
 +      bool                    journal_transaction_names:1;
 +      bool                    journal_replay_not_finished:1;
 +      bool                    notrace_relock_fail:1;
 +      enum bch_errcode        restarted:16;
 +      u32                     restart_count;
 +      unsigned long           last_begin_ip;
 +      unsigned long           last_restarted_ip;
 +      unsigned long           srcu_lock_time;
 +
 +      /*
 +       * For when bch2_trans_update notices we'll be splitting a compressed
 +       * extent:
 +       */
 +      unsigned                extra_journal_res;
 +      unsigned                nr_max_paths;
 +
 +      u64                     paths_allocated;
 +
 +      unsigned                mem_top;
 +      unsigned                mem_max;
 +      unsigned                mem_bytes;
 +      void                    *mem;
 +
 +      u8                      sorted[BTREE_ITER_MAX + 8];
 +      struct btree_path       paths[BTREE_ITER_MAX];
 +      struct btree_insert_entry updates[BTREE_ITER_MAX];
 +      struct btree_write_buffered_key *wb_updates;
 +
 +      /* update path: */
 +      struct btree_trans_commit_hook *hooks;
 +      darray_u64              extra_journal_entries;
 +      struct journal_entry_pin *journal_pin;
 +
 +      struct journal_res      journal_res;
 +      struct journal_preres   journal_preres;
 +      u64                     *journal_seq;
 +      struct disk_reservation *disk_res;
 +      unsigned                journal_u64s;
 +      unsigned                journal_preres_u64s;
 +      struct replicas_delta_list *fs_usage_deltas;
 +};
 +
 +#define BCH_BTREE_WRITE_TYPES()                                               \
 +      x(initial,              0)                                      \
 +      x(init_next_bset,       1)                                      \
 +      x(cache_reclaim,        2)                                      \
 +      x(journal_reclaim,      3)                                      \
 +      x(interior,             4)
 +
 +enum btree_write_type {
 +#define x(t, n) BTREE_WRITE_##t,
 +      BCH_BTREE_WRITE_TYPES()
 +#undef x
 +      BTREE_WRITE_TYPE_NR,
 +};
 +
 +#define BTREE_WRITE_TYPE_MASK (roundup_pow_of_two(BTREE_WRITE_TYPE_NR) - 1)
 +#define BTREE_WRITE_TYPE_BITS ilog2(roundup_pow_of_two(BTREE_WRITE_TYPE_NR))
 +
 +#define BTREE_FLAGS()                                                 \
 +      x(read_in_flight)                                               \
 +      x(read_error)                                                   \
 +      x(dirty)                                                        \
 +      x(need_write)                                                   \
 +      x(write_blocked)                                                \
 +      x(will_make_reachable)                                          \
 +      x(noevict)                                                      \
 +      x(write_idx)                                                    \
 +      x(accessed)                                                     \
 +      x(write_in_flight)                                              \
 +      x(write_in_flight_inner)                                        \
 +      x(just_written)                                                 \
 +      x(dying)                                                        \
 +      x(fake)                                                         \
 +      x(need_rewrite)                                                 \
 +      x(never_write)
 +
 +enum btree_flags {
 +      /* First bits for btree node write type */
 +      BTREE_NODE_FLAGS_START = BTREE_WRITE_TYPE_BITS - 1,
 +#define x(flag)       BTREE_NODE_##flag,
 +      BTREE_FLAGS()
 +#undef x
 +};
 +
 +#define x(flag)                                                               \
 +static inline bool btree_node_ ## flag(struct btree *b)                       \
 +{     return test_bit(BTREE_NODE_ ## flag, &b->flags); }              \
 +                                                                      \
 +static inline void set_btree_node_ ## flag(struct btree *b)           \
 +{     set_bit(BTREE_NODE_ ## flag, &b->flags); }                      \
 +                                                                      \
 +static inline void clear_btree_node_ ## flag(struct btree *b)         \
 +{     clear_bit(BTREE_NODE_ ## flag, &b->flags); }
 +
 +BTREE_FLAGS()
 +#undef x
 +
 +static inline struct btree_write *btree_current_write(struct btree *b)
 +{
 +      return b->writes + btree_node_write_idx(b);
 +}
 +
 +static inline struct btree_write *btree_prev_write(struct btree *b)
 +{
 +      return b->writes + (btree_node_write_idx(b) ^ 1);
 +}
 +
 +static inline struct bset_tree *bset_tree_last(struct btree *b)
 +{
 +      EBUG_ON(!b->nsets);
 +      return b->set + b->nsets - 1;
 +}
 +
 +static inline void *
 +__btree_node_offset_to_ptr(const struct btree *b, u16 offset)
 +{
 +      return (void *) ((u64 *) b->data + 1 + offset);
 +}
 +
 +static inline u16
 +__btree_node_ptr_to_offset(const struct btree *b, const void *p)
 +{
 +      u16 ret = (u64 *) p - 1 - (u64 *) b->data;
 +
 +      EBUG_ON(__btree_node_offset_to_ptr(b, ret) != p);
 +      return ret;
 +}
 +
 +static inline struct bset *bset(const struct btree *b,
 +                              const struct bset_tree *t)
 +{
 +      return __btree_node_offset_to_ptr(b, t->data_offset);
 +}
 +
 +static inline void set_btree_bset_end(struct btree *b, struct bset_tree *t)
 +{
 +      t->end_offset =
 +              __btree_node_ptr_to_offset(b, vstruct_last(bset(b, t)));
 +}
 +
 +static inline void set_btree_bset(struct btree *b, struct bset_tree *t,
 +                                const struct bset *i)
 +{
 +      t->data_offset = __btree_node_ptr_to_offset(b, i);
 +      set_btree_bset_end(b, t);
 +}
 +
 +static inline struct bset *btree_bset_first(struct btree *b)
 +{
 +      return bset(b, b->set);
 +}
 +
 +static inline struct bset *btree_bset_last(struct btree *b)
 +{
 +      return bset(b, bset_tree_last(b));
 +}
 +
 +static inline u16
 +__btree_node_key_to_offset(const struct btree *b, const struct bkey_packed *k)
 +{
 +      return __btree_node_ptr_to_offset(b, k);
 +}
 +
 +static inline struct bkey_packed *
 +__btree_node_offset_to_key(const struct btree *b, u16 k)
 +{
 +      return __btree_node_offset_to_ptr(b, k);
 +}
 +
 +static inline unsigned btree_bkey_first_offset(const struct bset_tree *t)
 +{
 +      return t->data_offset + offsetof(struct bset, _data) / sizeof(u64);
 +}
 +
 +#define btree_bkey_first(_b, _t)                                      \
 +({                                                                    \
 +      EBUG_ON(bset(_b, _t)->start !=                                  \
 +              __btree_node_offset_to_key(_b, btree_bkey_first_offset(_t)));\
 +                                                                      \
 +      bset(_b, _t)->start;                                            \
 +})
 +
 +#define btree_bkey_last(_b, _t)                                               \
 +({                                                                    \
 +      EBUG_ON(__btree_node_offset_to_key(_b, (_t)->end_offset) !=     \
 +              vstruct_last(bset(_b, _t)));                            \
 +                                                                      \
 +      __btree_node_offset_to_key(_b, (_t)->end_offset);               \
 +})
 +
 +static inline unsigned bset_u64s(struct bset_tree *t)
 +{
 +      return t->end_offset - t->data_offset -
 +              sizeof(struct bset) / sizeof(u64);
 +}
 +
 +static inline unsigned bset_dead_u64s(struct btree *b, struct bset_tree *t)
 +{
 +      return bset_u64s(t) - b->nr.bset_u64s[t - b->set];
 +}
 +
 +static inline unsigned bset_byte_offset(struct btree *b, void *i)
 +{
 +      return i - (void *) b->data;
 +}
 +
 +enum btree_node_type {
 +#define x(kwd, val, ...) BKEY_TYPE_##kwd = val,
 +      BCH_BTREE_IDS()
 +#undef x
 +      BKEY_TYPE_btree,
 +};
 +
 +/* Type of a key in btree @id at level @level: */
 +static inline enum btree_node_type __btree_node_type(unsigned level, enum btree_id id)
 +{
 +      return level ? BKEY_TYPE_btree : (enum btree_node_type) id;
 +}
 +
 +/* Type of keys @b contains: */
 +static inline enum btree_node_type btree_node_type(struct btree *b)
 +{
 +      return __btree_node_type(b->c.level, b->c.btree_id);
 +}
 +
 +#define BTREE_NODE_TYPE_HAS_TRANS_TRIGGERS            \
 +      (BIT(BKEY_TYPE_extents)|                        \
 +       BIT(BKEY_TYPE_alloc)|                          \
 +       BIT(BKEY_TYPE_inodes)|                         \
 +       BIT(BKEY_TYPE_stripes)|                        \
 +       BIT(BKEY_TYPE_reflink)|                        \
 +       BIT(BKEY_TYPE_btree))
 +
 +#define BTREE_NODE_TYPE_HAS_MEM_TRIGGERS              \
 +      (BIT(BKEY_TYPE_alloc)|                          \
 +       BIT(BKEY_TYPE_inodes)|                         \
 +       BIT(BKEY_TYPE_stripes)|                        \
 +       BIT(BKEY_TYPE_snapshots))
 +
 +#define BTREE_NODE_TYPE_HAS_TRIGGERS                  \
 +      (BTREE_NODE_TYPE_HAS_TRANS_TRIGGERS|            \
 +       BTREE_NODE_TYPE_HAS_MEM_TRIGGERS)
 +
 +static inline bool btree_node_type_needs_gc(enum btree_node_type type)
 +{
 +      return BTREE_NODE_TYPE_HAS_TRIGGERS & (1U << type);
 +}
 +
 +static inline bool btree_node_type_is_extents(enum btree_node_type type)
 +{
 +      const unsigned mask = 0
 +#define x(name, nr, flags, ...)       |((!!((flags) & BTREE_ID_EXTENTS)) << nr)
 +      BCH_BTREE_IDS()
 +#undef x
 +      ;
 +
 +      return (1U << type) & mask;
 +}
 +
 +static inline bool btree_id_is_extents(enum btree_id btree)
 +{
 +      return btree_node_type_is_extents((enum btree_node_type) btree);
 +}
 +
 +static inline bool btree_type_has_snapshots(enum btree_id id)
 +{
 +      const unsigned mask = 0
 +#define x(name, nr, flags, ...)       |((!!((flags) & BTREE_ID_SNAPSHOTS)) << nr)
 +      BCH_BTREE_IDS()
 +#undef x
 +      ;
 +
 +      return (1U << id) & mask;
 +}
 +
 +static inline bool btree_type_has_ptrs(enum btree_id id)
 +{
 +      const unsigned mask = 0
 +#define x(name, nr, flags, ...)       |((!!((flags) & BTREE_ID_DATA)) << nr)
 +      BCH_BTREE_IDS()
 +#undef x
 +      ;
 +
 +      return (1U << id) & mask;
 +}
 +
 +struct btree_root {
 +      struct btree            *b;
 +
 +      /* On disk root - see async splits: */
 +      __BKEY_PADDED(key, BKEY_BTREE_PTR_VAL_U64s_MAX);
 +      u8                      level;
 +      u8                      alive;
 +      s8                      error;
 +};
 +
 +enum btree_gc_coalesce_fail_reason {
 +      BTREE_GC_COALESCE_FAIL_RESERVE_GET,
 +      BTREE_GC_COALESCE_FAIL_KEYLIST_REALLOC,
 +      BTREE_GC_COALESCE_FAIL_FORMAT_FITS,
 +};
 +
 +enum btree_node_sibling {
 +      btree_prev_sib,
 +      btree_next_sib,
 +};
 +
 +#endif /* _BCACHEFS_BTREE_TYPES_H */
diff --combined fs/bcachefs/fs.c
index 6642b88c41a0e27379d189372d9a369d282c338d,0000000000000000000000000000000000000000..a2a5133fb6b5aec8a3023dacf24c717d416c745f
mode 100644,000000..100644
--- /dev/null
@@@ -1,1980 -1,0 +1,1980 @@@
-       sb->s_shrink.seeks = 0;
 +// SPDX-License-Identifier: GPL-2.0
 +#ifndef NO_BCACHEFS_FS
 +
 +#include "bcachefs.h"
 +#include "acl.h"
 +#include "bkey_buf.h"
 +#include "btree_update.h"
 +#include "buckets.h"
 +#include "chardev.h"
 +#include "dirent.h"
 +#include "errcode.h"
 +#include "extents.h"
 +#include "fs.h"
 +#include "fs-common.h"
 +#include "fs-io.h"
 +#include "fs-ioctl.h"
 +#include "fs-io-buffered.h"
 +#include "fs-io-direct.h"
 +#include "fs-io-pagecache.h"
 +#include "fsck.h"
 +#include "inode.h"
 +#include "io_read.h"
 +#include "journal.h"
 +#include "keylist.h"
 +#include "quota.h"
 +#include "snapshot.h"
 +#include "super.h"
 +#include "xattr.h"
 +
 +#include <linux/aio.h>
 +#include <linux/backing-dev.h>
 +#include <linux/exportfs.h>
 +#include <linux/fiemap.h>
 +#include <linux/module.h>
 +#include <linux/pagemap.h>
 +#include <linux/posix_acl.h>
 +#include <linux/random.h>
 +#include <linux/seq_file.h>
 +#include <linux/statfs.h>
 +#include <linux/string.h>
 +#include <linux/xattr.h>
 +
 +static struct kmem_cache *bch2_inode_cache;
 +
 +static void bch2_vfs_inode_init(struct btree_trans *, subvol_inum,
 +                              struct bch_inode_info *,
 +                              struct bch_inode_unpacked *,
 +                              struct bch_subvolume *);
 +
 +void bch2_inode_update_after_write(struct btree_trans *trans,
 +                                 struct bch_inode_info *inode,
 +                                 struct bch_inode_unpacked *bi,
 +                                 unsigned fields)
 +{
 +      struct bch_fs *c = trans->c;
 +
 +      BUG_ON(bi->bi_inum != inode->v.i_ino);
 +
 +      bch2_assert_pos_locked(trans, BTREE_ID_inodes,
 +                             POS(0, bi->bi_inum),
 +                             c->opts.inodes_use_key_cache);
 +
 +      set_nlink(&inode->v, bch2_inode_nlink_get(bi));
 +      i_uid_write(&inode->v, bi->bi_uid);
 +      i_gid_write(&inode->v, bi->bi_gid);
 +      inode->v.i_mode = bi->bi_mode;
 +
 +      if (fields & ATTR_ATIME)
 +              inode_set_atime_to_ts(&inode->v, bch2_time_to_timespec(c, bi->bi_atime));
 +      if (fields & ATTR_MTIME)
 +              inode_set_mtime_to_ts(&inode->v, bch2_time_to_timespec(c, bi->bi_mtime));
 +      if (fields & ATTR_CTIME)
 +              inode_set_ctime_to_ts(&inode->v, bch2_time_to_timespec(c, bi->bi_ctime));
 +
 +      inode->ei_inode         = *bi;
 +
 +      bch2_inode_flags_to_vfs(inode);
 +}
 +
 +int __must_check bch2_write_inode(struct bch_fs *c,
 +                                struct bch_inode_info *inode,
 +                                inode_set_fn set,
 +                                void *p, unsigned fields)
 +{
 +      struct btree_trans *trans = bch2_trans_get(c);
 +      struct btree_iter iter = { NULL };
 +      struct bch_inode_unpacked inode_u;
 +      int ret;
 +retry:
 +      bch2_trans_begin(trans);
 +
 +      ret   = bch2_inode_peek(trans, &iter, &inode_u, inode_inum(inode),
 +                              BTREE_ITER_INTENT) ?:
 +              (set ? set(trans, inode, &inode_u, p) : 0) ?:
 +              bch2_inode_write(trans, &iter, &inode_u) ?:
 +              bch2_trans_commit(trans, NULL, NULL, BTREE_INSERT_NOFAIL);
 +
 +      /*
 +       * the btree node lock protects inode->ei_inode, not ei_update_lock;
 +       * this is important for inode updates via bchfs_write_index_update
 +       */
 +      if (!ret)
 +              bch2_inode_update_after_write(trans, inode, &inode_u, fields);
 +
 +      bch2_trans_iter_exit(trans, &iter);
 +
 +      if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
 +              goto retry;
 +
 +      bch2_fs_fatal_err_on(bch2_err_matches(ret, ENOENT), c,
 +                           "inode %u:%llu not found when updating",
 +                           inode_inum(inode).subvol,
 +                           inode_inum(inode).inum);
 +
 +      bch2_trans_put(trans);
 +      return ret < 0 ? ret : 0;
 +}
 +
 +int bch2_fs_quota_transfer(struct bch_fs *c,
 +                         struct bch_inode_info *inode,
 +                         struct bch_qid new_qid,
 +                         unsigned qtypes,
 +                         enum quota_acct_mode mode)
 +{
 +      unsigned i;
 +      int ret;
 +
 +      qtypes &= enabled_qtypes(c);
 +
 +      for (i = 0; i < QTYP_NR; i++)
 +              if (new_qid.q[i] == inode->ei_qid.q[i])
 +                      qtypes &= ~(1U << i);
 +
 +      if (!qtypes)
 +              return 0;
 +
 +      mutex_lock(&inode->ei_quota_lock);
 +
 +      ret = bch2_quota_transfer(c, qtypes, new_qid,
 +                                inode->ei_qid,
 +                                inode->v.i_blocks +
 +                                inode->ei_quota_reserved,
 +                                mode);
 +      if (!ret)
 +              for (i = 0; i < QTYP_NR; i++)
 +                      if (qtypes & (1 << i))
 +                              inode->ei_qid.q[i] = new_qid.q[i];
 +
 +      mutex_unlock(&inode->ei_quota_lock);
 +
 +      return ret;
 +}
 +
 +static int bch2_iget5_test(struct inode *vinode, void *p)
 +{
 +      struct bch_inode_info *inode = to_bch_ei(vinode);
 +      subvol_inum *inum = p;
 +
 +      return inode->ei_subvol == inum->subvol &&
 +              inode->ei_inode.bi_inum == inum->inum;
 +}
 +
 +static int bch2_iget5_set(struct inode *vinode, void *p)
 +{
 +      struct bch_inode_info *inode = to_bch_ei(vinode);
 +      subvol_inum *inum = p;
 +
 +      inode->v.i_ino          = inum->inum;
 +      inode->ei_subvol        = inum->subvol;
 +      inode->ei_inode.bi_inum = inum->inum;
 +      return 0;
 +}
 +
 +static unsigned bch2_inode_hash(subvol_inum inum)
 +{
 +      return jhash_3words(inum.subvol, inum.inum >> 32, inum.inum, JHASH_INITVAL);
 +}
 +
 +struct inode *bch2_vfs_inode_get(struct bch_fs *c, subvol_inum inum)
 +{
 +      struct bch_inode_unpacked inode_u;
 +      struct bch_inode_info *inode;
 +      struct btree_trans *trans;
 +      struct bch_subvolume subvol;
 +      int ret;
 +
 +      inode = to_bch_ei(iget5_locked(c->vfs_sb,
 +                                     bch2_inode_hash(inum),
 +                                     bch2_iget5_test,
 +                                     bch2_iget5_set,
 +                                     &inum));
 +      if (unlikely(!inode))
 +              return ERR_PTR(-ENOMEM);
 +      if (!(inode->v.i_state & I_NEW))
 +              return &inode->v;
 +
 +      trans = bch2_trans_get(c);
 +      ret = lockrestart_do(trans,
 +              bch2_subvolume_get(trans, inum.subvol, true, 0, &subvol) ?:
 +              bch2_inode_find_by_inum_trans(trans, inum, &inode_u));
 +
 +      if (!ret)
 +              bch2_vfs_inode_init(trans, inum, inode, &inode_u, &subvol);
 +      bch2_trans_put(trans);
 +
 +      if (ret) {
 +              iget_failed(&inode->v);
 +              return ERR_PTR(bch2_err_class(ret));
 +      }
 +
 +      mutex_lock(&c->vfs_inodes_lock);
 +      list_add(&inode->ei_vfs_inode_list, &c->vfs_inodes_list);
 +      mutex_unlock(&c->vfs_inodes_lock);
 +
 +      unlock_new_inode(&inode->v);
 +
 +      return &inode->v;
 +}
 +
 +struct bch_inode_info *
 +__bch2_create(struct mnt_idmap *idmap,
 +            struct bch_inode_info *dir, struct dentry *dentry,
 +            umode_t mode, dev_t rdev, subvol_inum snapshot_src,
 +            unsigned flags)
 +{
 +      struct bch_fs *c = dir->v.i_sb->s_fs_info;
 +      struct btree_trans *trans;
 +      struct bch_inode_unpacked dir_u;
 +      struct bch_inode_info *inode, *old;
 +      struct bch_inode_unpacked inode_u;
 +      struct posix_acl *default_acl = NULL, *acl = NULL;
 +      subvol_inum inum;
 +      struct bch_subvolume subvol;
 +      u64 journal_seq = 0;
 +      int ret;
 +
 +      /*
 +       * preallocate acls + vfs inode before btree transaction, so that
 +       * nothing can fail after the transaction succeeds:
 +       */
 +#ifdef CONFIG_BCACHEFS_POSIX_ACL
 +      ret = posix_acl_create(&dir->v, &mode, &default_acl, &acl);
 +      if (ret)
 +              return ERR_PTR(ret);
 +#endif
 +      inode = to_bch_ei(new_inode(c->vfs_sb));
 +      if (unlikely(!inode)) {
 +              inode = ERR_PTR(-ENOMEM);
 +              goto err;
 +      }
 +
 +      bch2_inode_init_early(c, &inode_u);
 +
 +      if (!(flags & BCH_CREATE_TMPFILE))
 +              mutex_lock(&dir->ei_update_lock);
 +
 +      trans = bch2_trans_get(c);
 +retry:
 +      bch2_trans_begin(trans);
 +
 +      ret   = bch2_create_trans(trans,
 +                                inode_inum(dir), &dir_u, &inode_u,
 +                                !(flags & BCH_CREATE_TMPFILE)
 +                                ? &dentry->d_name : NULL,
 +                                from_kuid(i_user_ns(&dir->v), current_fsuid()),
 +                                from_kgid(i_user_ns(&dir->v), current_fsgid()),
 +                                mode, rdev,
 +                                default_acl, acl, snapshot_src, flags) ?:
 +              bch2_quota_acct(c, bch_qid(&inode_u), Q_INO, 1,
 +                              KEY_TYPE_QUOTA_PREALLOC);
 +      if (unlikely(ret))
 +              goto err_before_quota;
 +
 +      inum.subvol = inode_u.bi_subvol ?: dir->ei_subvol;
 +      inum.inum = inode_u.bi_inum;
 +
 +      ret   = bch2_subvolume_get(trans, inum.subvol, true,
 +                                 BTREE_ITER_WITH_UPDATES, &subvol) ?:
 +              bch2_trans_commit(trans, NULL, &journal_seq, 0);
 +      if (unlikely(ret)) {
 +              bch2_quota_acct(c, bch_qid(&inode_u), Q_INO, -1,
 +                              KEY_TYPE_QUOTA_WARN);
 +err_before_quota:
 +              if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
 +                      goto retry;
 +              goto err_trans;
 +      }
 +
 +      if (!(flags & BCH_CREATE_TMPFILE)) {
 +              bch2_inode_update_after_write(trans, dir, &dir_u,
 +                                            ATTR_MTIME|ATTR_CTIME);
 +              mutex_unlock(&dir->ei_update_lock);
 +      }
 +
 +      bch2_iget5_set(&inode->v, &inum);
 +      bch2_vfs_inode_init(trans, inum, inode, &inode_u, &subvol);
 +
 +      set_cached_acl(&inode->v, ACL_TYPE_ACCESS, acl);
 +      set_cached_acl(&inode->v, ACL_TYPE_DEFAULT, default_acl);
 +
 +      /*
 +       * we must insert the new inode into the inode cache before calling
 +       * bch2_trans_exit() and dropping locks, else we could race with another
 +       * thread pulling the inode in and modifying it:
 +       */
 +
 +      inode->v.i_state |= I_CREATING;
 +
 +      old = to_bch_ei(inode_insert5(&inode->v,
 +                                    bch2_inode_hash(inum),
 +                                    bch2_iget5_test,
 +                                    bch2_iget5_set,
 +                                    &inum));
 +      BUG_ON(!old);
 +
 +      if (unlikely(old != inode)) {
 +              /*
 +               * We raced, another process pulled the new inode into cache
 +               * before us:
 +               */
 +              make_bad_inode(&inode->v);
 +              iput(&inode->v);
 +
 +              inode = old;
 +      } else {
 +              mutex_lock(&c->vfs_inodes_lock);
 +              list_add(&inode->ei_vfs_inode_list, &c->vfs_inodes_list);
 +              mutex_unlock(&c->vfs_inodes_lock);
 +              /*
 +               * we really don't want insert_inode_locked2() to be setting
 +               * I_NEW...
 +               */
 +              unlock_new_inode(&inode->v);
 +      }
 +
 +      bch2_trans_put(trans);
 +err:
 +      posix_acl_release(default_acl);
 +      posix_acl_release(acl);
 +      return inode;
 +err_trans:
 +      if (!(flags & BCH_CREATE_TMPFILE))
 +              mutex_unlock(&dir->ei_update_lock);
 +
 +      bch2_trans_put(trans);
 +      make_bad_inode(&inode->v);
 +      iput(&inode->v);
 +      inode = ERR_PTR(ret);
 +      goto err;
 +}
 +
 +/* methods */
 +
 +static struct dentry *bch2_lookup(struct inode *vdir, struct dentry *dentry,
 +                                unsigned int flags)
 +{
 +      struct bch_fs *c = vdir->i_sb->s_fs_info;
 +      struct bch_inode_info *dir = to_bch_ei(vdir);
 +      struct bch_hash_info hash = bch2_hash_info_init(c, &dir->ei_inode);
 +      struct inode *vinode = NULL;
 +      subvol_inum inum = { .subvol = 1 };
 +      int ret;
 +
 +      ret = bch2_dirent_lookup(c, inode_inum(dir), &hash,
 +                               &dentry->d_name, &inum);
 +
 +      if (!ret)
 +              vinode = bch2_vfs_inode_get(c, inum);
 +
 +      return d_splice_alias(vinode, dentry);
 +}
 +
 +static int bch2_mknod(struct mnt_idmap *idmap,
 +                    struct inode *vdir, struct dentry *dentry,
 +                    umode_t mode, dev_t rdev)
 +{
 +      struct bch_inode_info *inode =
 +              __bch2_create(idmap, to_bch_ei(vdir), dentry, mode, rdev,
 +                            (subvol_inum) { 0 }, 0);
 +
 +      if (IS_ERR(inode))
 +              return bch2_err_class(PTR_ERR(inode));
 +
 +      d_instantiate(dentry, &inode->v);
 +      return 0;
 +}
 +
 +static int bch2_create(struct mnt_idmap *idmap,
 +                     struct inode *vdir, struct dentry *dentry,
 +                     umode_t mode, bool excl)
 +{
 +      return bch2_mknod(idmap, vdir, dentry, mode|S_IFREG, 0);
 +}
 +
 +static int __bch2_link(struct bch_fs *c,
 +                     struct bch_inode_info *inode,
 +                     struct bch_inode_info *dir,
 +                     struct dentry *dentry)
 +{
 +      struct btree_trans *trans = bch2_trans_get(c);
 +      struct bch_inode_unpacked dir_u, inode_u;
 +      int ret;
 +
 +      mutex_lock(&inode->ei_update_lock);
 +
 +      ret = commit_do(trans, NULL, NULL, 0,
 +                      bch2_link_trans(trans,
 +                                      inode_inum(dir),   &dir_u,
 +                                      inode_inum(inode), &inode_u,
 +                                      &dentry->d_name));
 +
 +      if (likely(!ret)) {
 +              bch2_inode_update_after_write(trans, dir, &dir_u,
 +                                            ATTR_MTIME|ATTR_CTIME);
 +              bch2_inode_update_after_write(trans, inode, &inode_u, ATTR_CTIME);
 +      }
 +
 +      bch2_trans_put(trans);
 +      mutex_unlock(&inode->ei_update_lock);
 +      return ret;
 +}
 +
 +static int bch2_link(struct dentry *old_dentry, struct inode *vdir,
 +                   struct dentry *dentry)
 +{
 +      struct bch_fs *c = vdir->i_sb->s_fs_info;
 +      struct bch_inode_info *dir = to_bch_ei(vdir);
 +      struct bch_inode_info *inode = to_bch_ei(old_dentry->d_inode);
 +      int ret;
 +
 +      lockdep_assert_held(&inode->v.i_rwsem);
 +
 +      ret = __bch2_link(c, inode, dir, dentry);
 +      if (unlikely(ret))
 +              return ret;
 +
 +      ihold(&inode->v);
 +      d_instantiate(dentry, &inode->v);
 +      return 0;
 +}
 +
 +int __bch2_unlink(struct inode *vdir, struct dentry *dentry,
 +                bool deleting_snapshot)
 +{
 +      struct bch_fs *c = vdir->i_sb->s_fs_info;
 +      struct bch_inode_info *dir = to_bch_ei(vdir);
 +      struct bch_inode_info *inode = to_bch_ei(dentry->d_inode);
 +      struct bch_inode_unpacked dir_u, inode_u;
 +      struct btree_trans *trans = bch2_trans_get(c);
 +      int ret;
 +
 +      bch2_lock_inodes(INODE_UPDATE_LOCK, dir, inode);
 +
 +      ret = commit_do(trans, NULL, NULL,
 +                      BTREE_INSERT_NOFAIL,
 +              bch2_unlink_trans(trans,
 +                                inode_inum(dir), &dir_u,
 +                                &inode_u, &dentry->d_name,
 +                                deleting_snapshot));
 +      if (unlikely(ret))
 +              goto err;
 +
 +      bch2_inode_update_after_write(trans, dir, &dir_u,
 +                                    ATTR_MTIME|ATTR_CTIME);
 +      bch2_inode_update_after_write(trans, inode, &inode_u,
 +                                    ATTR_MTIME);
 +
 +      if (inode_u.bi_subvol) {
 +              /*
 +               * Subvolume deletion is asynchronous, but we still want to tell
 +               * the VFS that it's been deleted here:
 +               */
 +              set_nlink(&inode->v, 0);
 +      }
 +err:
 +      bch2_unlock_inodes(INODE_UPDATE_LOCK, dir, inode);
 +      bch2_trans_put(trans);
 +
 +      return ret;
 +}
 +
 +static int bch2_unlink(struct inode *vdir, struct dentry *dentry)
 +{
 +      return __bch2_unlink(vdir, dentry, false);
 +}
 +
 +static int bch2_symlink(struct mnt_idmap *idmap,
 +                      struct inode *vdir, struct dentry *dentry,
 +                      const char *symname)
 +{
 +      struct bch_fs *c = vdir->i_sb->s_fs_info;
 +      struct bch_inode_info *dir = to_bch_ei(vdir), *inode;
 +      int ret;
 +
 +      inode = __bch2_create(idmap, dir, dentry, S_IFLNK|S_IRWXUGO, 0,
 +                            (subvol_inum) { 0 }, BCH_CREATE_TMPFILE);
 +      if (IS_ERR(inode))
 +              return bch2_err_class(PTR_ERR(inode));
 +
 +      inode_lock(&inode->v);
 +      ret = page_symlink(&inode->v, symname, strlen(symname) + 1);
 +      inode_unlock(&inode->v);
 +
 +      if (unlikely(ret))
 +              goto err;
 +
 +      ret = filemap_write_and_wait_range(inode->v.i_mapping, 0, LLONG_MAX);
 +      if (unlikely(ret))
 +              goto err;
 +
 +      ret = __bch2_link(c, inode, dir, dentry);
 +      if (unlikely(ret))
 +              goto err;
 +
 +      d_instantiate(dentry, &inode->v);
 +      return 0;
 +err:
 +      iput(&inode->v);
 +      return ret;
 +}
 +
 +static int bch2_mkdir(struct mnt_idmap *idmap,
 +                    struct inode *vdir, struct dentry *dentry, umode_t mode)
 +{
 +      return bch2_mknod(idmap, vdir, dentry, mode|S_IFDIR, 0);
 +}
 +
 +static int bch2_rename2(struct mnt_idmap *idmap,
 +                      struct inode *src_vdir, struct dentry *src_dentry,
 +                      struct inode *dst_vdir, struct dentry *dst_dentry,
 +                      unsigned flags)
 +{
 +      struct bch_fs *c = src_vdir->i_sb->s_fs_info;
 +      struct bch_inode_info *src_dir = to_bch_ei(src_vdir);
 +      struct bch_inode_info *dst_dir = to_bch_ei(dst_vdir);
 +      struct bch_inode_info *src_inode = to_bch_ei(src_dentry->d_inode);
 +      struct bch_inode_info *dst_inode = to_bch_ei(dst_dentry->d_inode);
 +      struct bch_inode_unpacked dst_dir_u, src_dir_u;
 +      struct bch_inode_unpacked src_inode_u, dst_inode_u;
 +      struct btree_trans *trans;
 +      enum bch_rename_mode mode = flags & RENAME_EXCHANGE
 +              ? BCH_RENAME_EXCHANGE
 +              : dst_dentry->d_inode
 +              ? BCH_RENAME_OVERWRITE : BCH_RENAME;
 +      int ret;
 +
 +      if (flags & ~(RENAME_NOREPLACE|RENAME_EXCHANGE))
 +              return -EINVAL;
 +
 +      if (mode == BCH_RENAME_OVERWRITE) {
 +              ret = filemap_write_and_wait_range(src_inode->v.i_mapping,
 +                                                 0, LLONG_MAX);
 +              if (ret)
 +                      return ret;
 +      }
 +
 +      trans = bch2_trans_get(c);
 +
 +      bch2_lock_inodes(INODE_UPDATE_LOCK,
 +                       src_dir,
 +                       dst_dir,
 +                       src_inode,
 +                       dst_inode);
 +
 +      if (inode_attr_changing(dst_dir, src_inode, Inode_opt_project)) {
 +              ret = bch2_fs_quota_transfer(c, src_inode,
 +                                           dst_dir->ei_qid,
 +                                           1 << QTYP_PRJ,
 +                                           KEY_TYPE_QUOTA_PREALLOC);
 +              if (ret)
 +                      goto err;
 +      }
 +
 +      if (mode == BCH_RENAME_EXCHANGE &&
 +          inode_attr_changing(src_dir, dst_inode, Inode_opt_project)) {
 +              ret = bch2_fs_quota_transfer(c, dst_inode,
 +                                           src_dir->ei_qid,
 +                                           1 << QTYP_PRJ,
 +                                           KEY_TYPE_QUOTA_PREALLOC);
 +              if (ret)
 +                      goto err;
 +      }
 +
 +      ret = commit_do(trans, NULL, NULL, 0,
 +                      bch2_rename_trans(trans,
 +                                        inode_inum(src_dir), &src_dir_u,
 +                                        inode_inum(dst_dir), &dst_dir_u,
 +                                        &src_inode_u,
 +                                        &dst_inode_u,
 +                                        &src_dentry->d_name,
 +                                        &dst_dentry->d_name,
 +                                        mode));
 +      if (unlikely(ret))
 +              goto err;
 +
 +      BUG_ON(src_inode->v.i_ino != src_inode_u.bi_inum);
 +      BUG_ON(dst_inode &&
 +             dst_inode->v.i_ino != dst_inode_u.bi_inum);
 +
 +      bch2_inode_update_after_write(trans, src_dir, &src_dir_u,
 +                                    ATTR_MTIME|ATTR_CTIME);
 +
 +      if (src_dir != dst_dir)
 +              bch2_inode_update_after_write(trans, dst_dir, &dst_dir_u,
 +                                            ATTR_MTIME|ATTR_CTIME);
 +
 +      bch2_inode_update_after_write(trans, src_inode, &src_inode_u,
 +                                    ATTR_CTIME);
 +
 +      if (dst_inode)
 +              bch2_inode_update_after_write(trans, dst_inode, &dst_inode_u,
 +                                            ATTR_CTIME);
 +err:
 +      bch2_trans_put(trans);
 +
 +      bch2_fs_quota_transfer(c, src_inode,
 +                             bch_qid(&src_inode->ei_inode),
 +                             1 << QTYP_PRJ,
 +                             KEY_TYPE_QUOTA_NOCHECK);
 +      if (dst_inode)
 +              bch2_fs_quota_transfer(c, dst_inode,
 +                                     bch_qid(&dst_inode->ei_inode),
 +                                     1 << QTYP_PRJ,
 +                                     KEY_TYPE_QUOTA_NOCHECK);
 +
 +      bch2_unlock_inodes(INODE_UPDATE_LOCK,
 +                         src_dir,
 +                         dst_dir,
 +                         src_inode,
 +                         dst_inode);
 +
 +      return ret;
 +}
 +
 +static void bch2_setattr_copy(struct mnt_idmap *idmap,
 +                            struct bch_inode_info *inode,
 +                            struct bch_inode_unpacked *bi,
 +                            struct iattr *attr)
 +{
 +      struct bch_fs *c = inode->v.i_sb->s_fs_info;
 +      unsigned int ia_valid = attr->ia_valid;
 +
 +      if (ia_valid & ATTR_UID)
 +              bi->bi_uid = from_kuid(i_user_ns(&inode->v), attr->ia_uid);
 +      if (ia_valid & ATTR_GID)
 +              bi->bi_gid = from_kgid(i_user_ns(&inode->v), attr->ia_gid);
 +
 +      if (ia_valid & ATTR_SIZE)
 +              bi->bi_size = attr->ia_size;
 +
 +      if (ia_valid & ATTR_ATIME)
 +              bi->bi_atime = timespec_to_bch2_time(c, attr->ia_atime);
 +      if (ia_valid & ATTR_MTIME)
 +              bi->bi_mtime = timespec_to_bch2_time(c, attr->ia_mtime);
 +      if (ia_valid & ATTR_CTIME)
 +              bi->bi_ctime = timespec_to_bch2_time(c, attr->ia_ctime);
 +
 +      if (ia_valid & ATTR_MODE) {
 +              umode_t mode = attr->ia_mode;
 +              kgid_t gid = ia_valid & ATTR_GID
 +                      ? attr->ia_gid
 +                      : inode->v.i_gid;
 +
 +              if (!in_group_p(gid) &&
 +                  !capable_wrt_inode_uidgid(idmap, &inode->v, CAP_FSETID))
 +                      mode &= ~S_ISGID;
 +              bi->bi_mode = mode;
 +      }
 +}
 +
 +int bch2_setattr_nonsize(struct mnt_idmap *idmap,
 +                       struct bch_inode_info *inode,
 +                       struct iattr *attr)
 +{
 +      struct bch_fs *c = inode->v.i_sb->s_fs_info;
 +      struct bch_qid qid;
 +      struct btree_trans *trans;
 +      struct btree_iter inode_iter = { NULL };
 +      struct bch_inode_unpacked inode_u;
 +      struct posix_acl *acl = NULL;
 +      int ret;
 +
 +      mutex_lock(&inode->ei_update_lock);
 +
 +      qid = inode->ei_qid;
 +
 +      if (attr->ia_valid & ATTR_UID)
 +              qid.q[QTYP_USR] = from_kuid(i_user_ns(&inode->v), attr->ia_uid);
 +
 +      if (attr->ia_valid & ATTR_GID)
 +              qid.q[QTYP_GRP] = from_kgid(i_user_ns(&inode->v), attr->ia_gid);
 +
 +      ret = bch2_fs_quota_transfer(c, inode, qid, ~0,
 +                                   KEY_TYPE_QUOTA_PREALLOC);
 +      if (ret)
 +              goto err;
 +
 +      trans = bch2_trans_get(c);
 +retry:
 +      bch2_trans_begin(trans);
 +      kfree(acl);
 +      acl = NULL;
 +
 +      ret = bch2_inode_peek(trans, &inode_iter, &inode_u, inode_inum(inode),
 +                            BTREE_ITER_INTENT);
 +      if (ret)
 +              goto btree_err;
 +
 +      bch2_setattr_copy(idmap, inode, &inode_u, attr);
 +
 +      if (attr->ia_valid & ATTR_MODE) {
 +              ret = bch2_acl_chmod(trans, inode_inum(inode), &inode_u,
 +                                   inode_u.bi_mode, &acl);
 +              if (ret)
 +                      goto btree_err;
 +      }
 +
 +      ret =   bch2_inode_write(trans, &inode_iter, &inode_u) ?:
 +              bch2_trans_commit(trans, NULL, NULL,
 +                                BTREE_INSERT_NOFAIL);
 +btree_err:
 +      bch2_trans_iter_exit(trans, &inode_iter);
 +
 +      if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
 +              goto retry;
 +      if (unlikely(ret))
 +              goto err_trans;
 +
 +      bch2_inode_update_after_write(trans, inode, &inode_u, attr->ia_valid);
 +
 +      if (acl)
 +              set_cached_acl(&inode->v, ACL_TYPE_ACCESS, acl);
 +err_trans:
 +      bch2_trans_put(trans);
 +err:
 +      mutex_unlock(&inode->ei_update_lock);
 +
 +      return bch2_err_class(ret);
 +}
 +
 +static int bch2_getattr(struct mnt_idmap *idmap,
 +                      const struct path *path, struct kstat *stat,
 +                      u32 request_mask, unsigned query_flags)
 +{
 +      struct bch_inode_info *inode = to_bch_ei(d_inode(path->dentry));
 +      struct bch_fs *c = inode->v.i_sb->s_fs_info;
 +
 +      stat->dev       = inode->v.i_sb->s_dev;
 +      stat->ino       = inode->v.i_ino;
 +      stat->mode      = inode->v.i_mode;
 +      stat->nlink     = inode->v.i_nlink;
 +      stat->uid       = inode->v.i_uid;
 +      stat->gid       = inode->v.i_gid;
 +      stat->rdev      = inode->v.i_rdev;
 +      stat->size      = i_size_read(&inode->v);
 +      stat->atime     = inode_get_atime(&inode->v);
 +      stat->mtime     = inode_get_mtime(&inode->v);
 +      stat->ctime     = inode_get_ctime(&inode->v);
 +      stat->blksize   = block_bytes(c);
 +      stat->blocks    = inode->v.i_blocks;
 +
 +      if (request_mask & STATX_BTIME) {
 +              stat->result_mask |= STATX_BTIME;
 +              stat->btime = bch2_time_to_timespec(c, inode->ei_inode.bi_otime);
 +      }
 +
 +      if (inode->ei_inode.bi_flags & BCH_INODE_IMMUTABLE)
 +              stat->attributes |= STATX_ATTR_IMMUTABLE;
 +      stat->attributes_mask    |= STATX_ATTR_IMMUTABLE;
 +
 +      if (inode->ei_inode.bi_flags & BCH_INODE_APPEND)
 +              stat->attributes |= STATX_ATTR_APPEND;
 +      stat->attributes_mask    |= STATX_ATTR_APPEND;
 +
 +      if (inode->ei_inode.bi_flags & BCH_INODE_NODUMP)
 +              stat->attributes |= STATX_ATTR_NODUMP;
 +      stat->attributes_mask    |= STATX_ATTR_NODUMP;
 +
 +      return 0;
 +}
 +
 +static int bch2_setattr(struct mnt_idmap *idmap,
 +                      struct dentry *dentry, struct iattr *iattr)
 +{
 +      struct bch_inode_info *inode = to_bch_ei(dentry->d_inode);
 +      int ret;
 +
 +      lockdep_assert_held(&inode->v.i_rwsem);
 +
 +      ret = setattr_prepare(idmap, dentry, iattr);
 +      if (ret)
 +              return ret;
 +
 +      return iattr->ia_valid & ATTR_SIZE
 +              ? bchfs_truncate(idmap, inode, iattr)
 +              : bch2_setattr_nonsize(idmap, inode, iattr);
 +}
 +
 +static int bch2_tmpfile(struct mnt_idmap *idmap,
 +                      struct inode *vdir, struct file *file, umode_t mode)
 +{
 +      struct bch_inode_info *inode =
 +              __bch2_create(idmap, to_bch_ei(vdir),
 +                            file->f_path.dentry, mode, 0,
 +                            (subvol_inum) { 0 }, BCH_CREATE_TMPFILE);
 +
 +      if (IS_ERR(inode))
 +              return bch2_err_class(PTR_ERR(inode));
 +
 +      d_mark_tmpfile(file, &inode->v);
 +      d_instantiate(file->f_path.dentry, &inode->v);
 +      return finish_open_simple(file, 0);
 +}
 +
 +static int bch2_fill_extent(struct bch_fs *c,
 +                          struct fiemap_extent_info *info,
 +                          struct bkey_s_c k, unsigned flags)
 +{
 +      if (bkey_extent_is_direct_data(k.k)) {
 +              struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
 +              const union bch_extent_entry *entry;
 +              struct extent_ptr_decoded p;
 +              int ret;
 +
 +              if (k.k->type == KEY_TYPE_reflink_v)
 +                      flags |= FIEMAP_EXTENT_SHARED;
 +
 +              bkey_for_each_ptr_decode(k.k, ptrs, p, entry) {
 +                      int flags2 = 0;
 +                      u64 offset = p.ptr.offset;
 +
 +                      if (p.ptr.unwritten)
 +                              flags2 |= FIEMAP_EXTENT_UNWRITTEN;
 +
 +                      if (p.crc.compression_type)
 +                              flags2 |= FIEMAP_EXTENT_ENCODED;
 +                      else
 +                              offset += p.crc.offset;
 +
 +                      if ((offset & (block_sectors(c) - 1)) ||
 +                          (k.k->size & (block_sectors(c) - 1)))
 +                              flags2 |= FIEMAP_EXTENT_NOT_ALIGNED;
 +
 +                      ret = fiemap_fill_next_extent(info,
 +                                              bkey_start_offset(k.k) << 9,
 +                                              offset << 9,
 +                                              k.k->size << 9, flags|flags2);
 +                      if (ret)
 +                              return ret;
 +              }
 +
 +              return 0;
 +      } else if (bkey_extent_is_inline_data(k.k)) {
 +              return fiemap_fill_next_extent(info,
 +                                             bkey_start_offset(k.k) << 9,
 +                                             0, k.k->size << 9,
 +                                             flags|
 +                                             FIEMAP_EXTENT_DATA_INLINE);
 +      } else if (k.k->type == KEY_TYPE_reservation) {
 +              return fiemap_fill_next_extent(info,
 +                                             bkey_start_offset(k.k) << 9,
 +                                             0, k.k->size << 9,
 +                                             flags|
 +                                             FIEMAP_EXTENT_DELALLOC|
 +                                             FIEMAP_EXTENT_UNWRITTEN);
 +      } else {
 +              BUG();
 +      }
 +}
 +
 +static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 +                     u64 start, u64 len)
 +{
 +      struct bch_fs *c = vinode->i_sb->s_fs_info;
 +      struct bch_inode_info *ei = to_bch_ei(vinode);
 +      struct btree_trans *trans;
 +      struct btree_iter iter;
 +      struct bkey_s_c k;
 +      struct bkey_buf cur, prev;
 +      struct bpos end = POS(ei->v.i_ino, (start + len) >> 9);
 +      unsigned offset_into_extent, sectors;
 +      bool have_extent = false;
 +      u32 snapshot;
 +      int ret = 0;
 +
 +      ret = fiemap_prep(&ei->v, info, start, &len, FIEMAP_FLAG_SYNC);
 +      if (ret)
 +              return ret;
 +
 +      if (start + len < start)
 +              return -EINVAL;
 +
 +      start >>= 9;
 +
 +      bch2_bkey_buf_init(&cur);
 +      bch2_bkey_buf_init(&prev);
 +      trans = bch2_trans_get(c);
 +retry:
 +      bch2_trans_begin(trans);
 +
 +      ret = bch2_subvolume_get_snapshot(trans, ei->ei_subvol, &snapshot);
 +      if (ret)
 +              goto err;
 +
 +      bch2_trans_iter_init(trans, &iter, BTREE_ID_extents,
 +                           SPOS(ei->v.i_ino, start, snapshot), 0);
 +
 +      while (!(ret = btree_trans_too_many_iters(trans)) &&
 +             (k = bch2_btree_iter_peek_upto(&iter, end)).k &&
 +             !(ret = bkey_err(k))) {
 +              enum btree_id data_btree = BTREE_ID_extents;
 +
 +              if (!bkey_extent_is_data(k.k) &&
 +                  k.k->type != KEY_TYPE_reservation) {
 +                      bch2_btree_iter_advance(&iter);
 +                      continue;
 +              }
 +
 +              offset_into_extent      = iter.pos.offset -
 +                      bkey_start_offset(k.k);
 +              sectors                 = k.k->size - offset_into_extent;
 +
 +              bch2_bkey_buf_reassemble(&cur, c, k);
 +
 +              ret = bch2_read_indirect_extent(trans, &data_btree,
 +                                      &offset_into_extent, &cur);
 +              if (ret)
 +                      break;
 +
 +              k = bkey_i_to_s_c(cur.k);
 +              bch2_bkey_buf_realloc(&prev, c, k.k->u64s);
 +
 +              sectors = min(sectors, k.k->size - offset_into_extent);
 +
 +              bch2_cut_front(POS(k.k->p.inode,
 +                                 bkey_start_offset(k.k) +
 +                                 offset_into_extent),
 +                             cur.k);
 +              bch2_key_resize(&cur.k->k, sectors);
 +              cur.k->k.p = iter.pos;
 +              cur.k->k.p.offset += cur.k->k.size;
 +
 +              if (have_extent) {
 +                      bch2_trans_unlock(trans);
 +                      ret = bch2_fill_extent(c, info,
 +                                      bkey_i_to_s_c(prev.k), 0);
 +                      if (ret)
 +                              break;
 +              }
 +
 +              bkey_copy(prev.k, cur.k);
 +              have_extent = true;
 +
 +              bch2_btree_iter_set_pos(&iter,
 +                      POS(iter.pos.inode, iter.pos.offset + sectors));
 +      }
 +      start = iter.pos.offset;
 +      bch2_trans_iter_exit(trans, &iter);
 +err:
 +      if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
 +              goto retry;
 +
 +      if (!ret && have_extent) {
 +              bch2_trans_unlock(trans);
 +              ret = bch2_fill_extent(c, info, bkey_i_to_s_c(prev.k),
 +                                     FIEMAP_EXTENT_LAST);
 +      }
 +
 +      bch2_trans_put(trans);
 +      bch2_bkey_buf_exit(&cur, c);
 +      bch2_bkey_buf_exit(&prev, c);
 +      return ret < 0 ? ret : 0;
 +}
 +
 +static const struct vm_operations_struct bch_vm_ops = {
 +      .fault          = bch2_page_fault,
 +      .map_pages      = filemap_map_pages,
 +      .page_mkwrite   = bch2_page_mkwrite,
 +};
 +
 +static int bch2_mmap(struct file *file, struct vm_area_struct *vma)
 +{
 +      file_accessed(file);
 +
 +      vma->vm_ops = &bch_vm_ops;
 +      return 0;
 +}
 +
 +/* Directories: */
 +
 +static loff_t bch2_dir_llseek(struct file *file, loff_t offset, int whence)
 +{
 +      return generic_file_llseek_size(file, offset, whence,
 +                                      S64_MAX, S64_MAX);
 +}
 +
 +static int bch2_vfs_readdir(struct file *file, struct dir_context *ctx)
 +{
 +      struct bch_inode_info *inode = file_bch_inode(file);
 +      struct bch_fs *c = inode->v.i_sb->s_fs_info;
 +      int ret;
 +
 +      if (!dir_emit_dots(file, ctx))
 +              return 0;
 +
 +      ret = bch2_readdir(c, inode_inum(inode), ctx);
 +      if (ret)
 +              bch_err_fn(c, ret);
 +
 +      return bch2_err_class(ret);
 +}
 +
 +static const struct file_operations bch_file_operations = {
 +      .llseek         = bch2_llseek,
 +      .read_iter      = bch2_read_iter,
 +      .write_iter     = bch2_write_iter,
 +      .mmap           = bch2_mmap,
 +      .open           = generic_file_open,
 +      .fsync          = bch2_fsync,
 +      .splice_read    = filemap_splice_read,
 +      .splice_write   = iter_file_splice_write,
 +      .fallocate      = bch2_fallocate_dispatch,
 +      .unlocked_ioctl = bch2_fs_file_ioctl,
 +#ifdef CONFIG_COMPAT
 +      .compat_ioctl   = bch2_compat_fs_ioctl,
 +#endif
 +      .remap_file_range = bch2_remap_file_range,
 +};
 +
 +static const struct inode_operations bch_file_inode_operations = {
 +      .getattr        = bch2_getattr,
 +      .setattr        = bch2_setattr,
 +      .fiemap         = bch2_fiemap,
 +      .listxattr      = bch2_xattr_list,
 +#ifdef CONFIG_BCACHEFS_POSIX_ACL
 +      .get_acl        = bch2_get_acl,
 +      .set_acl        = bch2_set_acl,
 +#endif
 +};
 +
 +static const struct inode_operations bch_dir_inode_operations = {
 +      .lookup         = bch2_lookup,
 +      .create         = bch2_create,
 +      .link           = bch2_link,
 +      .unlink         = bch2_unlink,
 +      .symlink        = bch2_symlink,
 +      .mkdir          = bch2_mkdir,
 +      .rmdir          = bch2_unlink,
 +      .mknod          = bch2_mknod,
 +      .rename         = bch2_rename2,
 +      .getattr        = bch2_getattr,
 +      .setattr        = bch2_setattr,
 +      .tmpfile        = bch2_tmpfile,
 +      .listxattr      = bch2_xattr_list,
 +#ifdef CONFIG_BCACHEFS_POSIX_ACL
 +      .get_acl        = bch2_get_acl,
 +      .set_acl        = bch2_set_acl,
 +#endif
 +};
 +
 +static const struct file_operations bch_dir_file_operations = {
 +      .llseek         = bch2_dir_llseek,
 +      .read           = generic_read_dir,
 +      .iterate_shared = bch2_vfs_readdir,
 +      .fsync          = bch2_fsync,
 +      .unlocked_ioctl = bch2_fs_file_ioctl,
 +#ifdef CONFIG_COMPAT
 +      .compat_ioctl   = bch2_compat_fs_ioctl,
 +#endif
 +};
 +
 +static const struct inode_operations bch_symlink_inode_operations = {
 +      .get_link       = page_get_link,
 +      .getattr        = bch2_getattr,
 +      .setattr        = bch2_setattr,
 +      .listxattr      = bch2_xattr_list,
 +#ifdef CONFIG_BCACHEFS_POSIX_ACL
 +      .get_acl        = bch2_get_acl,
 +      .set_acl        = bch2_set_acl,
 +#endif
 +};
 +
 +static const struct inode_operations bch_special_inode_operations = {
 +      .getattr        = bch2_getattr,
 +      .setattr        = bch2_setattr,
 +      .listxattr      = bch2_xattr_list,
 +#ifdef CONFIG_BCACHEFS_POSIX_ACL
 +      .get_acl        = bch2_get_acl,
 +      .set_acl        = bch2_set_acl,
 +#endif
 +};
 +
 +static const struct address_space_operations bch_address_space_operations = {
 +      .read_folio     = bch2_read_folio,
 +      .writepages     = bch2_writepages,
 +      .readahead      = bch2_readahead,
 +      .dirty_folio    = filemap_dirty_folio,
 +      .write_begin    = bch2_write_begin,
 +      .write_end      = bch2_write_end,
 +      .invalidate_folio = bch2_invalidate_folio,
 +      .release_folio  = bch2_release_folio,
 +      .direct_IO      = noop_direct_IO,
 +#ifdef CONFIG_MIGRATION
 +      .migrate_folio  = filemap_migrate_folio,
 +#endif
 +      .error_remove_page = generic_error_remove_page,
 +};
 +
 +struct bcachefs_fid {
 +      u64             inum;
 +      u32             subvol;
 +      u32             gen;
 +} __packed;
 +
 +struct bcachefs_fid_with_parent {
 +      struct bcachefs_fid     fid;
 +      struct bcachefs_fid     dir;
 +} __packed;
 +
 +static int bcachefs_fid_valid(int fh_len, int fh_type)
 +{
 +      switch (fh_type) {
 +      case FILEID_BCACHEFS_WITHOUT_PARENT:
 +              return fh_len == sizeof(struct bcachefs_fid) / sizeof(u32);
 +      case FILEID_BCACHEFS_WITH_PARENT:
 +              return fh_len == sizeof(struct bcachefs_fid_with_parent) / sizeof(u32);
 +      default:
 +              return false;
 +      }
 +}
 +
 +static struct bcachefs_fid bch2_inode_to_fid(struct bch_inode_info *inode)
 +{
 +      return (struct bcachefs_fid) {
 +              .inum   = inode->ei_inode.bi_inum,
 +              .subvol = inode->ei_subvol,
 +              .gen    = inode->ei_inode.bi_generation,
 +      };
 +}
 +
 +static int bch2_encode_fh(struct inode *vinode, u32 *fh, int *len,
 +                        struct inode *vdir)
 +{
 +      struct bch_inode_info *inode    = to_bch_ei(vinode);
 +      struct bch_inode_info *dir      = to_bch_ei(vdir);
 +
 +      if (*len < sizeof(struct bcachefs_fid_with_parent) / sizeof(u32))
 +              return FILEID_INVALID;
 +
 +      if (!S_ISDIR(inode->v.i_mode) && dir) {
 +              struct bcachefs_fid_with_parent *fid = (void *) fh;
 +
 +              fid->fid = bch2_inode_to_fid(inode);
 +              fid->dir = bch2_inode_to_fid(dir);
 +
 +              *len = sizeof(*fid) / sizeof(u32);
 +              return FILEID_BCACHEFS_WITH_PARENT;
 +      } else {
 +              struct bcachefs_fid *fid = (void *) fh;
 +
 +              *fid = bch2_inode_to_fid(inode);
 +
 +              *len = sizeof(*fid) / sizeof(u32);
 +              return FILEID_BCACHEFS_WITHOUT_PARENT;
 +      }
 +}
 +
 +static struct inode *bch2_nfs_get_inode(struct super_block *sb,
 +                                      struct bcachefs_fid fid)
 +{
 +      struct bch_fs *c = sb->s_fs_info;
 +      struct inode *vinode = bch2_vfs_inode_get(c, (subvol_inum) {
 +                                  .subvol = fid.subvol,
 +                                  .inum = fid.inum,
 +      });
 +      if (!IS_ERR(vinode) && vinode->i_generation != fid.gen) {
 +              iput(vinode);
 +              vinode = ERR_PTR(-ESTALE);
 +      }
 +      return vinode;
 +}
 +
 +static struct dentry *bch2_fh_to_dentry(struct super_block *sb, struct fid *_fid,
 +              int fh_len, int fh_type)
 +{
 +      struct bcachefs_fid *fid = (void *) _fid;
 +
 +      if (!bcachefs_fid_valid(fh_len, fh_type))
 +              return NULL;
 +
 +      return d_obtain_alias(bch2_nfs_get_inode(sb, *fid));
 +}
 +
 +static struct dentry *bch2_fh_to_parent(struct super_block *sb, struct fid *_fid,
 +              int fh_len, int fh_type)
 +{
 +      struct bcachefs_fid_with_parent *fid = (void *) _fid;
 +
 +      if (!bcachefs_fid_valid(fh_len, fh_type) ||
 +          fh_type != FILEID_BCACHEFS_WITH_PARENT)
 +              return NULL;
 +
 +      return d_obtain_alias(bch2_nfs_get_inode(sb, fid->dir));
 +}
 +
 +static struct dentry *bch2_get_parent(struct dentry *child)
 +{
 +      struct bch_inode_info *inode = to_bch_ei(child->d_inode);
 +      struct bch_fs *c = inode->v.i_sb->s_fs_info;
 +      subvol_inum parent_inum = {
 +              .subvol = inode->ei_inode.bi_parent_subvol ?:
 +                      inode->ei_subvol,
 +              .inum = inode->ei_inode.bi_dir,
 +      };
 +
 +      if (!parent_inum.inum)
 +              return NULL;
 +
 +      return d_obtain_alias(bch2_vfs_inode_get(c, parent_inum));
 +}
 +
 +static int bch2_get_name(struct dentry *parent, char *name, struct dentry *child)
 +{
 +      struct bch_inode_info *inode    = to_bch_ei(child->d_inode);
 +      struct bch_inode_info *dir      = to_bch_ei(parent->d_inode);
 +      struct bch_fs *c = inode->v.i_sb->s_fs_info;
 +      struct btree_trans *trans;
 +      struct btree_iter iter1;
 +      struct btree_iter iter2;
 +      struct bkey_s_c k;
 +      struct bkey_s_c_dirent d;
 +      struct bch_inode_unpacked inode_u;
 +      subvol_inum target;
 +      u32 snapshot;
 +      struct qstr dirent_name;
 +      unsigned name_len = 0;
 +      int ret;
 +
 +      if (!S_ISDIR(dir->v.i_mode))
 +              return -EINVAL;
 +
 +      trans = bch2_trans_get(c);
 +
 +      bch2_trans_iter_init(trans, &iter1, BTREE_ID_dirents,
 +                           POS(dir->ei_inode.bi_inum, 0), 0);
 +      bch2_trans_iter_init(trans, &iter2, BTREE_ID_dirents,
 +                           POS(dir->ei_inode.bi_inum, 0), 0);
 +retry:
 +      bch2_trans_begin(trans);
 +
 +      ret = bch2_subvolume_get_snapshot(trans, dir->ei_subvol, &snapshot);
 +      if (ret)
 +              goto err;
 +
 +      bch2_btree_iter_set_snapshot(&iter1, snapshot);
 +      bch2_btree_iter_set_snapshot(&iter2, snapshot);
 +
 +      ret = bch2_inode_find_by_inum_trans(trans, inode_inum(inode), &inode_u);
 +      if (ret)
 +              goto err;
 +
 +      if (inode_u.bi_dir == dir->ei_inode.bi_inum) {
 +              bch2_btree_iter_set_pos(&iter1, POS(inode_u.bi_dir, inode_u.bi_dir_offset));
 +
 +              k = bch2_btree_iter_peek_slot(&iter1);
 +              ret = bkey_err(k);
 +              if (ret)
 +                      goto err;
 +
 +              if (k.k->type != KEY_TYPE_dirent) {
 +                      ret = -BCH_ERR_ENOENT_dirent_doesnt_match_inode;
 +                      goto err;
 +              }
 +
 +              d = bkey_s_c_to_dirent(k);
 +              ret = bch2_dirent_read_target(trans, inode_inum(dir), d, &target);
 +              if (ret > 0)
 +                      ret = -BCH_ERR_ENOENT_dirent_doesnt_match_inode;
 +              if (ret)
 +                      goto err;
 +
 +              if (target.subvol       == inode->ei_subvol &&
 +                  target.inum         == inode->ei_inode.bi_inum)
 +                      goto found;
 +      } else {
 +              /*
 +               * File with multiple hardlinks and our backref is to the wrong
 +               * directory - linear search:
 +               */
 +              for_each_btree_key_continue_norestart(iter2, 0, k, ret) {
 +                      if (k.k->p.inode > dir->ei_inode.bi_inum)
 +                              break;
 +
 +                      if (k.k->type != KEY_TYPE_dirent)
 +                              continue;
 +
 +                      d = bkey_s_c_to_dirent(k);
 +                      ret = bch2_dirent_read_target(trans, inode_inum(dir), d, &target);
 +                      if (ret < 0)
 +                              break;
 +                      if (ret)
 +                              continue;
 +
 +                      if (target.subvol       == inode->ei_subvol &&
 +                          target.inum         == inode->ei_inode.bi_inum)
 +                              goto found;
 +              }
 +      }
 +
 +      ret = -ENOENT;
 +      goto err;
 +found:
 +      dirent_name = bch2_dirent_get_name(d);
 +
 +      name_len = min_t(unsigned, dirent_name.len, NAME_MAX);
 +      memcpy(name, dirent_name.name, name_len);
 +      name[name_len] = '\0';
 +err:
 +      if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
 +              goto retry;
 +
 +      bch2_trans_iter_exit(trans, &iter1);
 +      bch2_trans_iter_exit(trans, &iter2);
 +      bch2_trans_put(trans);
 +
 +      return ret;
 +}
 +
 +static const struct export_operations bch_export_ops = {
 +      .encode_fh      = bch2_encode_fh,
 +      .fh_to_dentry   = bch2_fh_to_dentry,
 +      .fh_to_parent   = bch2_fh_to_parent,
 +      .get_parent     = bch2_get_parent,
 +      .get_name       = bch2_get_name,
 +};
 +
 +static void bch2_vfs_inode_init(struct btree_trans *trans, subvol_inum inum,
 +                              struct bch_inode_info *inode,
 +                              struct bch_inode_unpacked *bi,
 +                              struct bch_subvolume *subvol)
 +{
 +      bch2_inode_update_after_write(trans, inode, bi, ~0);
 +
 +      if (BCH_SUBVOLUME_SNAP(subvol))
 +              set_bit(EI_INODE_SNAPSHOT, &inode->ei_flags);
 +      else
 +              clear_bit(EI_INODE_SNAPSHOT, &inode->ei_flags);
 +
 +      inode->v.i_blocks       = bi->bi_sectors;
 +      inode->v.i_ino          = bi->bi_inum;
 +      inode->v.i_rdev         = bi->bi_dev;
 +      inode->v.i_generation   = bi->bi_generation;
 +      inode->v.i_size         = bi->bi_size;
 +
 +      inode->ei_flags         = 0;
 +      inode->ei_quota_reserved = 0;
 +      inode->ei_qid           = bch_qid(bi);
 +      inode->ei_subvol        = inum.subvol;
 +
 +      inode->v.i_mapping->a_ops = &bch_address_space_operations;
 +
 +      switch (inode->v.i_mode & S_IFMT) {
 +      case S_IFREG:
 +              inode->v.i_op   = &bch_file_inode_operations;
 +              inode->v.i_fop  = &bch_file_operations;
 +              break;
 +      case S_IFDIR:
 +              inode->v.i_op   = &bch_dir_inode_operations;
 +              inode->v.i_fop  = &bch_dir_file_operations;
 +              break;
 +      case S_IFLNK:
 +              inode_nohighmem(&inode->v);
 +              inode->v.i_op   = &bch_symlink_inode_operations;
 +              break;
 +      default:
 +              init_special_inode(&inode->v, inode->v.i_mode, inode->v.i_rdev);
 +              inode->v.i_op   = &bch_special_inode_operations;
 +              break;
 +      }
 +
 +      mapping_set_large_folios(inode->v.i_mapping);
 +}
 +
 +static struct inode *bch2_alloc_inode(struct super_block *sb)
 +{
 +      struct bch_inode_info *inode;
 +
 +      inode = kmem_cache_alloc(bch2_inode_cache, GFP_NOFS);
 +      if (!inode)
 +              return NULL;
 +
 +      inode_init_once(&inode->v);
 +      mutex_init(&inode->ei_update_lock);
 +      two_state_lock_init(&inode->ei_pagecache_lock);
 +      INIT_LIST_HEAD(&inode->ei_vfs_inode_list);
 +      mutex_init(&inode->ei_quota_lock);
 +
 +      return &inode->v;
 +}
 +
 +static void bch2_i_callback(struct rcu_head *head)
 +{
 +      struct inode *vinode = container_of(head, struct inode, i_rcu);
 +      struct bch_inode_info *inode = to_bch_ei(vinode);
 +
 +      kmem_cache_free(bch2_inode_cache, inode);
 +}
 +
 +static void bch2_destroy_inode(struct inode *vinode)
 +{
 +      call_rcu(&vinode->i_rcu, bch2_i_callback);
 +}
 +
 +static int inode_update_times_fn(struct btree_trans *trans,
 +                               struct bch_inode_info *inode,
 +                               struct bch_inode_unpacked *bi,
 +                               void *p)
 +{
 +      struct bch_fs *c = inode->v.i_sb->s_fs_info;
 +
 +      bi->bi_atime    = timespec_to_bch2_time(c, inode_get_atime(&inode->v));
 +      bi->bi_mtime    = timespec_to_bch2_time(c, inode_get_mtime(&inode->v));
 +      bi->bi_ctime    = timespec_to_bch2_time(c, inode_get_ctime(&inode->v));
 +
 +      return 0;
 +}
 +
 +static int bch2_vfs_write_inode(struct inode *vinode,
 +                              struct writeback_control *wbc)
 +{
 +      struct bch_fs *c = vinode->i_sb->s_fs_info;
 +      struct bch_inode_info *inode = to_bch_ei(vinode);
 +      int ret;
 +
 +      mutex_lock(&inode->ei_update_lock);
 +      ret = bch2_write_inode(c, inode, inode_update_times_fn, NULL,
 +                             ATTR_ATIME|ATTR_MTIME|ATTR_CTIME);
 +      mutex_unlock(&inode->ei_update_lock);
 +
 +      return bch2_err_class(ret);
 +}
 +
 +static void bch2_evict_inode(struct inode *vinode)
 +{
 +      struct bch_fs *c = vinode->i_sb->s_fs_info;
 +      struct bch_inode_info *inode = to_bch_ei(vinode);
 +
 +      truncate_inode_pages_final(&inode->v.i_data);
 +
 +      clear_inode(&inode->v);
 +
 +      BUG_ON(!is_bad_inode(&inode->v) && inode->ei_quota_reserved);
 +
 +      if (!inode->v.i_nlink && !is_bad_inode(&inode->v)) {
 +              bch2_quota_acct(c, inode->ei_qid, Q_SPC, -((s64) inode->v.i_blocks),
 +                              KEY_TYPE_QUOTA_WARN);
 +              bch2_quota_acct(c, inode->ei_qid, Q_INO, -1,
 +                              KEY_TYPE_QUOTA_WARN);
 +              bch2_inode_rm(c, inode_inum(inode));
 +      }
 +
 +      mutex_lock(&c->vfs_inodes_lock);
 +      list_del_init(&inode->ei_vfs_inode_list);
 +      mutex_unlock(&c->vfs_inodes_lock);
 +}
 +
 +void bch2_evict_subvolume_inodes(struct bch_fs *c, snapshot_id_list *s)
 +{
 +      struct bch_inode_info *inode, **i;
 +      DARRAY(struct bch_inode_info *) grabbed;
 +      bool clean_pass = false, this_pass_clean;
 +
 +      /*
 +       * Initially, we scan for inodes without I_DONTCACHE, then mark them to
 +       * be pruned with d_mark_dontcache().
 +       *
 +       * Once we've had a clean pass where we didn't find any inodes without
 +       * I_DONTCACHE, we wait for them to be freed:
 +       */
 +
 +      darray_init(&grabbed);
 +      darray_make_room(&grabbed, 1024);
 +again:
 +      cond_resched();
 +      this_pass_clean = true;
 +
 +      mutex_lock(&c->vfs_inodes_lock);
 +      list_for_each_entry(inode, &c->vfs_inodes_list, ei_vfs_inode_list) {
 +              if (!snapshot_list_has_id(s, inode->ei_subvol))
 +                      continue;
 +
 +              if (!(inode->v.i_state & I_DONTCACHE) &&
 +                  !(inode->v.i_state & I_FREEING) &&
 +                  igrab(&inode->v)) {
 +                      this_pass_clean = false;
 +
 +                      if (darray_push_gfp(&grabbed, inode, GFP_ATOMIC|__GFP_NOWARN)) {
 +                              iput(&inode->v);
 +                              break;
 +                      }
 +              } else if (clean_pass && this_pass_clean) {
 +                      wait_queue_head_t *wq = bit_waitqueue(&inode->v.i_state, __I_NEW);
 +                      DEFINE_WAIT_BIT(wait, &inode->v.i_state, __I_NEW);
 +
 +                      prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
 +                      mutex_unlock(&c->vfs_inodes_lock);
 +
 +                      schedule();
 +                      finish_wait(wq, &wait.wq_entry);
 +                      goto again;
 +              }
 +      }
 +      mutex_unlock(&c->vfs_inodes_lock);
 +
 +      darray_for_each(grabbed, i) {
 +              inode = *i;
 +              d_mark_dontcache(&inode->v);
 +              d_prune_aliases(&inode->v);
 +              iput(&inode->v);
 +      }
 +      grabbed.nr = 0;
 +
 +      if (!clean_pass || !this_pass_clean) {
 +              clean_pass = this_pass_clean;
 +              goto again;
 +      }
 +
 +      darray_exit(&grabbed);
 +}
 +
 +static int bch2_statfs(struct dentry *dentry, struct kstatfs *buf)
 +{
 +      struct super_block *sb = dentry->d_sb;
 +      struct bch_fs *c = sb->s_fs_info;
 +      struct bch_fs_usage_short usage = bch2_fs_usage_read_short(c);
 +      unsigned shift = sb->s_blocksize_bits - 9;
 +      /*
 +       * this assumes inodes take up 64 bytes, which is a decent average
 +       * number:
 +       */
 +      u64 avail_inodes = ((usage.capacity - usage.used) << 3);
 +      u64 fsid;
 +
 +      buf->f_type     = BCACHEFS_STATFS_MAGIC;
 +      buf->f_bsize    = sb->s_blocksize;
 +      buf->f_blocks   = usage.capacity >> shift;
 +      buf->f_bfree    = usage.free >> shift;
 +      buf->f_bavail   = avail_factor(usage.free) >> shift;
 +
 +      buf->f_files    = usage.nr_inodes + avail_inodes;
 +      buf->f_ffree    = avail_inodes;
 +
 +      fsid = le64_to_cpup((void *) c->sb.user_uuid.b) ^
 +             le64_to_cpup((void *) c->sb.user_uuid.b + sizeof(u64));
 +      buf->f_fsid.val[0] = fsid & 0xFFFFFFFFUL;
 +      buf->f_fsid.val[1] = (fsid >> 32) & 0xFFFFFFFFUL;
 +      buf->f_namelen  = BCH_NAME_MAX;
 +
 +      return 0;
 +}
 +
 +static int bch2_sync_fs(struct super_block *sb, int wait)
 +{
 +      struct bch_fs *c = sb->s_fs_info;
 +      int ret;
 +
 +      if (c->opts.journal_flush_disabled)
 +              return 0;
 +
 +      if (!wait) {
 +              bch2_journal_flush_async(&c->journal, NULL);
 +              return 0;
 +      }
 +
 +      ret = bch2_journal_flush(&c->journal);
 +      return bch2_err_class(ret);
 +}
 +
 +static struct bch_fs *bch2_path_to_fs(const char *path)
 +{
 +      struct bch_fs *c;
 +      dev_t dev;
 +      int ret;
 +
 +      ret = lookup_bdev(path, &dev);
 +      if (ret)
 +              return ERR_PTR(ret);
 +
 +      c = bch2_dev_to_fs(dev);
 +      if (c)
 +              closure_put(&c->cl);
 +      return c ?: ERR_PTR(-ENOENT);
 +}
 +
 +static char **split_devs(const char *_dev_name, unsigned *nr)
 +{
 +      char *dev_name = NULL, **devs = NULL, *s;
 +      size_t i = 0, nr_devs = 0;
 +
 +      dev_name = kstrdup(_dev_name, GFP_KERNEL);
 +      if (!dev_name)
 +              return NULL;
 +
 +      for (s = dev_name; s; s = strchr(s + 1, ':'))
 +              nr_devs++;
 +
 +      devs = kcalloc(nr_devs + 1, sizeof(const char *), GFP_KERNEL);
 +      if (!devs) {
 +              kfree(dev_name);
 +              return NULL;
 +      }
 +
 +      while ((s = strsep(&dev_name, ":")))
 +              devs[i++] = s;
 +
 +      *nr = nr_devs;
 +      return devs;
 +}
 +
 +static int bch2_remount(struct super_block *sb, int *flags, char *data)
 +{
 +      struct bch_fs *c = sb->s_fs_info;
 +      struct bch_opts opts = bch2_opts_empty();
 +      int ret;
 +
 +      opt_set(opts, read_only, (*flags & SB_RDONLY) != 0);
 +
 +      ret = bch2_parse_mount_opts(c, &opts, data);
 +      if (ret)
 +              goto err;
 +
 +      if (opts.read_only != c->opts.read_only) {
 +              down_write(&c->state_lock);
 +
 +              if (opts.read_only) {
 +                      bch2_fs_read_only(c);
 +
 +                      sb->s_flags |= SB_RDONLY;
 +              } else {
 +                      ret = bch2_fs_read_write(c);
 +                      if (ret) {
 +                              bch_err(c, "error going rw: %i", ret);
 +                              up_write(&c->state_lock);
 +                              ret = -EINVAL;
 +                              goto err;
 +                      }
 +
 +                      sb->s_flags &= ~SB_RDONLY;
 +              }
 +
 +              c->opts.read_only = opts.read_only;
 +
 +              up_write(&c->state_lock);
 +      }
 +
 +      if (opt_defined(opts, errors))
 +              c->opts.errors = opts.errors;
 +err:
 +      return bch2_err_class(ret);
 +}
 +
 +static int bch2_show_devname(struct seq_file *seq, struct dentry *root)
 +{
 +      struct bch_fs *c = root->d_sb->s_fs_info;
 +      struct bch_dev *ca;
 +      unsigned i;
 +      bool first = true;
 +
 +      for_each_online_member(ca, c, i) {
 +              if (!first)
 +                      seq_putc(seq, ':');
 +              first = false;
 +              seq_puts(seq, "/dev/");
 +              seq_puts(seq, ca->name);
 +      }
 +
 +      return 0;
 +}
 +
 +static int bch2_show_options(struct seq_file *seq, struct dentry *root)
 +{
 +      struct bch_fs *c = root->d_sb->s_fs_info;
 +      enum bch_opt_id i;
 +      struct printbuf buf = PRINTBUF;
 +      int ret = 0;
 +
 +      for (i = 0; i < bch2_opts_nr; i++) {
 +              const struct bch_option *opt = &bch2_opt_table[i];
 +              u64 v = bch2_opt_get_by_id(&c->opts, i);
 +
 +              if (!(opt->flags & OPT_MOUNT))
 +                      continue;
 +
 +              if (v == bch2_opt_get_by_id(&bch2_opts_default, i))
 +                      continue;
 +
 +              printbuf_reset(&buf);
 +              bch2_opt_to_text(&buf, c, c->disk_sb.sb, opt, v,
 +                               OPT_SHOW_MOUNT_STYLE);
 +              seq_putc(seq, ',');
 +              seq_puts(seq, buf.buf);
 +      }
 +
 +      if (buf.allocation_failure)
 +              ret = -ENOMEM;
 +      printbuf_exit(&buf);
 +      return ret;
 +}
 +
 +static void bch2_put_super(struct super_block *sb)
 +{
 +      struct bch_fs *c = sb->s_fs_info;
 +
 +      __bch2_fs_stop(c);
 +}
 +
 +/*
 + * bcachefs doesn't currently integrate intwrite freeze protection but the
 + * internal write references serve the same purpose. Therefore reuse the
 + * read-only transition code to perform the quiesce. The caveat is that we don't
 + * currently have the ability to block tasks that want a write reference while
 + * the superblock is frozen. This is fine for now, but we should either add
 + * blocking support or find a way to integrate sb_start_intwrite() and friends.
 + */
 +static int bch2_freeze(struct super_block *sb)
 +{
 +      struct bch_fs *c = sb->s_fs_info;
 +
 +      down_write(&c->state_lock);
 +      bch2_fs_read_only(c);
 +      up_write(&c->state_lock);
 +      return 0;
 +}
 +
 +static int bch2_unfreeze(struct super_block *sb)
 +{
 +      struct bch_fs *c = sb->s_fs_info;
 +      int ret;
 +
 +      down_write(&c->state_lock);
 +      ret = bch2_fs_read_write(c);
 +      up_write(&c->state_lock);
 +      return ret;
 +}
 +
 +static const struct super_operations bch_super_operations = {
 +      .alloc_inode    = bch2_alloc_inode,
 +      .destroy_inode  = bch2_destroy_inode,
 +      .write_inode    = bch2_vfs_write_inode,
 +      .evict_inode    = bch2_evict_inode,
 +      .sync_fs        = bch2_sync_fs,
 +      .statfs         = bch2_statfs,
 +      .show_devname   = bch2_show_devname,
 +      .show_options   = bch2_show_options,
 +      .remount_fs     = bch2_remount,
 +      .put_super      = bch2_put_super,
 +      .freeze_fs      = bch2_freeze,
 +      .unfreeze_fs    = bch2_unfreeze,
 +};
 +
 +static int bch2_set_super(struct super_block *s, void *data)
 +{
 +      s->s_fs_info = data;
 +      return 0;
 +}
 +
 +static int bch2_noset_super(struct super_block *s, void *data)
 +{
 +      return -EBUSY;
 +}
 +
 +static int bch2_test_super(struct super_block *s, void *data)
 +{
 +      struct bch_fs *c = s->s_fs_info;
 +      struct bch_fs **devs = data;
 +      unsigned i;
 +
 +      if (!c)
 +              return false;
 +
 +      for (i = 0; devs[i]; i++)
 +              if (c != devs[i])
 +                      return false;
 +      return true;
 +}
 +
 +static struct dentry *bch2_mount(struct file_system_type *fs_type,
 +                               int flags, const char *dev_name, void *data)
 +{
 +      struct bch_fs *c;
 +      struct bch_dev *ca;
 +      struct super_block *sb;
 +      struct inode *vinode;
 +      struct bch_opts opts = bch2_opts_empty();
 +      char **devs;
 +      struct bch_fs **devs_to_fs = NULL;
 +      unsigned i, nr_devs;
 +      int ret;
 +
 +      opt_set(opts, read_only, (flags & SB_RDONLY) != 0);
 +
 +      ret = bch2_parse_mount_opts(NULL, &opts, data);
 +      if (ret)
 +              return ERR_PTR(ret);
 +
 +      if (!dev_name || strlen(dev_name) == 0)
 +              return ERR_PTR(-EINVAL);
 +
 +      devs = split_devs(dev_name, &nr_devs);
 +      if (!devs)
 +              return ERR_PTR(-ENOMEM);
 +
 +      devs_to_fs = kcalloc(nr_devs + 1, sizeof(void *), GFP_KERNEL);
 +      if (!devs_to_fs) {
 +              sb = ERR_PTR(-ENOMEM);
 +              goto got_sb;
 +      }
 +
 +      for (i = 0; i < nr_devs; i++)
 +              devs_to_fs[i] = bch2_path_to_fs(devs[i]);
 +
 +      sb = sget(fs_type, bch2_test_super, bch2_noset_super,
 +                flags|SB_NOSEC, devs_to_fs);
 +      if (!IS_ERR(sb))
 +              goto got_sb;
 +
 +      c = bch2_fs_open(devs, nr_devs, opts);
 +      if (IS_ERR(c)) {
 +              sb = ERR_CAST(c);
 +              goto got_sb;
 +      }
 +
 +      /* Some options can't be parsed until after the fs is started: */
 +      ret = bch2_parse_mount_opts(c, &opts, data);
 +      if (ret) {
 +              bch2_fs_stop(c);
 +              sb = ERR_PTR(ret);
 +              goto got_sb;
 +      }
 +
 +      bch2_opts_apply(&c->opts, opts);
 +
 +      sb = sget(fs_type, NULL, bch2_set_super, flags|SB_NOSEC, c);
 +      if (IS_ERR(sb))
 +              bch2_fs_stop(c);
 +got_sb:
 +      kfree(devs_to_fs);
 +      kfree(devs[0]);
 +      kfree(devs);
 +
 +      if (IS_ERR(sb)) {
 +              ret = PTR_ERR(sb);
 +              ret = bch2_err_class(ret);
 +              return ERR_PTR(ret);
 +      }
 +
 +      c = sb->s_fs_info;
 +
 +      if (sb->s_root) {
 +              if ((flags ^ sb->s_flags) & SB_RDONLY) {
 +                      ret = -EBUSY;
 +                      goto err_put_super;
 +              }
 +              goto out;
 +      }
 +
 +      sb->s_blocksize         = block_bytes(c);
 +      sb->s_blocksize_bits    = ilog2(block_bytes(c));
 +      sb->s_maxbytes          = MAX_LFS_FILESIZE;
 +      sb->s_op                = &bch_super_operations;
 +      sb->s_export_op         = &bch_export_ops;
 +#ifdef CONFIG_BCACHEFS_QUOTA
 +      sb->s_qcop              = &bch2_quotactl_operations;
 +      sb->s_quota_types       = QTYPE_MASK_USR|QTYPE_MASK_GRP|QTYPE_MASK_PRJ;
 +#endif
 +      sb->s_xattr             = bch2_xattr_handlers;
 +      sb->s_magic             = BCACHEFS_STATFS_MAGIC;
 +      sb->s_time_gran         = c->sb.nsec_per_time_unit;
 +      sb->s_time_min          = div_s64(S64_MIN, c->sb.time_units_per_sec) + 1;
 +      sb->s_time_max          = div_s64(S64_MAX, c->sb.time_units_per_sec);
 +      c->vfs_sb               = sb;
 +      strscpy(sb->s_id, c->name, sizeof(sb->s_id));
 +
 +      ret = super_setup_bdi(sb);
 +      if (ret)
 +              goto err_put_super;
 +
 +      sb->s_bdi->ra_pages             = VM_READAHEAD_PAGES;
 +
 +      for_each_online_member(ca, c, i) {
 +              struct block_device *bdev = ca->disk_sb.bdev;
 +
 +              /* XXX: create an anonymous device for multi device filesystems */
 +              sb->s_bdev      = bdev;
 +              sb->s_dev       = bdev->bd_dev;
 +              percpu_ref_put(&ca->io_ref);
 +              break;
 +      }
 +
 +      c->dev = sb->s_dev;
 +
 +#ifdef CONFIG_BCACHEFS_POSIX_ACL
 +      if (c->opts.acl)
 +              sb->s_flags     |= SB_POSIXACL;
 +#endif
 +
++      sb->s_shrink->seeks = 0;
 +
 +      vinode = bch2_vfs_inode_get(c, BCACHEFS_ROOT_SUBVOL_INUM);
 +      ret = PTR_ERR_OR_ZERO(vinode);
 +      if (ret) {
 +              bch_err_msg(c, ret, "mounting: error getting root inode");
 +              goto err_put_super;
 +      }
 +
 +      sb->s_root = d_make_root(vinode);
 +      if (!sb->s_root) {
 +              bch_err(c, "error mounting: error allocating root dentry");
 +              ret = -ENOMEM;
 +              goto err_put_super;
 +      }
 +
 +      sb->s_flags |= SB_ACTIVE;
 +out:
 +      return dget(sb->s_root);
 +
 +err_put_super:
 +      sb->s_fs_info = NULL;
 +      c->vfs_sb = NULL;
 +      deactivate_locked_super(sb);
 +      bch2_fs_stop(c);
 +      return ERR_PTR(bch2_err_class(ret));
 +}
 +
 +static void bch2_kill_sb(struct super_block *sb)
 +{
 +      struct bch_fs *c = sb->s_fs_info;
 +
 +      if (c)
 +              c->vfs_sb = NULL;
 +      generic_shutdown_super(sb);
 +      if (c)
 +              bch2_fs_free(c);
 +}
 +
 +static struct file_system_type bcache_fs_type = {
 +      .owner          = THIS_MODULE,
 +      .name           = "bcachefs",
 +      .mount          = bch2_mount,
 +      .kill_sb        = bch2_kill_sb,
 +      .fs_flags       = FS_REQUIRES_DEV,
 +};
 +
 +MODULE_ALIAS_FS("bcachefs");
 +
 +void bch2_vfs_exit(void)
 +{
 +      unregister_filesystem(&bcache_fs_type);
 +      kmem_cache_destroy(bch2_inode_cache);
 +}
 +
 +int __init bch2_vfs_init(void)
 +{
 +      int ret = -ENOMEM;
 +
 +      bch2_inode_cache = KMEM_CACHE(bch_inode_info, SLAB_RECLAIM_ACCOUNT);
 +      if (!bch2_inode_cache)
 +              goto err;
 +
 +      ret = register_filesystem(&bcache_fs_type);
 +      if (ret)
 +              goto err;
 +
 +      return 0;
 +err:
 +      bch2_vfs_exit();
 +      return ret;
 +}
 +
 +#endif /* NO_BCACHEFS_FS */
diff --combined fs/bcachefs/sysfs.c
index eb764b9a4629696e9444103c272ad01ed07643c0,0000000000000000000000000000000000000000..397116966a7cd40ef629b98cf16670476ff583a6
mode 100644,000000..100644
--- /dev/null
@@@ -1,1031 -1,0 +1,1031 @@@
-               c->btree_cache.shrink.scan_objects(&c->btree_cache.shrink, &sc);
 +// SPDX-License-Identifier: GPL-2.0
 +/*
 + * bcache sysfs interfaces
 + *
 + * Copyright 2010, 2011 Kent Overstreet <[email protected]>
 + * Copyright 2012 Google, Inc.
 + */
 +
 +#ifndef NO_BCACHEFS_SYSFS
 +
 +#include "bcachefs.h"
 +#include "alloc_background.h"
 +#include "alloc_foreground.h"
 +#include "sysfs.h"
 +#include "btree_cache.h"
 +#include "btree_io.h"
 +#include "btree_iter.h"
 +#include "btree_key_cache.h"
 +#include "btree_update.h"
 +#include "btree_update_interior.h"
 +#include "btree_gc.h"
 +#include "buckets.h"
 +#include "clock.h"
 +#include "disk_groups.h"
 +#include "ec.h"
 +#include "inode.h"
 +#include "journal.h"
 +#include "keylist.h"
 +#include "move.h"
 +#include "movinggc.h"
 +#include "nocow_locking.h"
 +#include "opts.h"
 +#include "rebalance.h"
 +#include "replicas.h"
 +#include "super-io.h"
 +#include "tests.h"
 +
 +#include <linux/blkdev.h>
 +#include <linux/sort.h>
 +#include <linux/sched/clock.h>
 +
 +#include "util.h"
 +
 +#define SYSFS_OPS(type)                                                       \
 +const struct sysfs_ops type ## _sysfs_ops = {                         \
 +      .show   = type ## _show,                                        \
 +      .store  = type ## _store                                        \
 +}
 +
 +#define SHOW(fn)                                                      \
 +static ssize_t fn ## _to_text(struct printbuf *,                      \
 +                            struct kobject *, struct attribute *);    \
 +                                                                      \
 +static ssize_t fn ## _show(struct kobject *kobj, struct attribute *attr,\
 +                         char *buf)                                   \
 +{                                                                     \
 +      struct printbuf out = PRINTBUF;                                 \
 +      ssize_t ret = fn ## _to_text(&out, kobj, attr);                 \
 +                                                                      \
 +      if (out.pos && out.buf[out.pos - 1] != '\n')                    \
 +              prt_newline(&out);                                      \
 +                                                                      \
 +      if (!ret && out.allocation_failure)                             \
 +              ret = -ENOMEM;                                          \
 +                                                                      \
 +      if (!ret) {                                                     \
 +              ret = min_t(size_t, out.pos, PAGE_SIZE - 1);            \
 +              memcpy(buf, out.buf, ret);                              \
 +      }                                                               \
 +      printbuf_exit(&out);                                            \
 +      return bch2_err_class(ret);                                     \
 +}                                                                     \
 +                                                                      \
 +static ssize_t fn ## _to_text(struct printbuf *out, struct kobject *kobj,\
 +                            struct attribute *attr)
 +
 +#define STORE(fn)                                                     \
 +static ssize_t fn ## _store_inner(struct kobject *, struct attribute *,\
 +                          const char *, size_t);                      \
 +                                                                      \
 +static ssize_t fn ## _store(struct kobject *kobj, struct attribute *attr,\
 +                          const char *buf, size_t size)               \
 +{                                                                     \
 +      return bch2_err_class(fn##_store_inner(kobj, attr, buf, size)); \
 +}                                                                     \
 +                                                                      \
 +static ssize_t fn ## _store_inner(struct kobject *kobj, struct attribute *attr,\
 +                                const char *buf, size_t size)
 +
 +#define __sysfs_attribute(_name, _mode)                                       \
 +      static struct attribute sysfs_##_name =                         \
 +              { .name = #_name, .mode = _mode }
 +
 +#define write_attribute(n)    __sysfs_attribute(n, 0200)
 +#define read_attribute(n)     __sysfs_attribute(n, 0444)
 +#define rw_attribute(n)               __sysfs_attribute(n, 0644)
 +
 +#define sysfs_printf(file, fmt, ...)                                  \
 +do {                                                                  \
 +      if (attr == &sysfs_ ## file)                                    \
 +              prt_printf(out, fmt "\n", __VA_ARGS__);                 \
 +} while (0)
 +
 +#define sysfs_print(file, var)                                                \
 +do {                                                                  \
 +      if (attr == &sysfs_ ## file)                                    \
 +              snprint(out, var);                                      \
 +} while (0)
 +
 +#define sysfs_hprint(file, val)                                               \
 +do {                                                                  \
 +      if (attr == &sysfs_ ## file)                                    \
 +              prt_human_readable_s64(out, val);                       \
 +} while (0)
 +
 +#define sysfs_strtoul(file, var)                                      \
 +do {                                                                  \
 +      if (attr == &sysfs_ ## file)                                    \
 +              return strtoul_safe(buf, var) ?: (ssize_t) size;        \
 +} while (0)
 +
 +#define sysfs_strtoul_clamp(file, var, min, max)                      \
 +do {                                                                  \
 +      if (attr == &sysfs_ ## file)                                    \
 +              return strtoul_safe_clamp(buf, var, min, max)           \
 +                      ?: (ssize_t) size;                              \
 +} while (0)
 +
 +#define strtoul_or_return(cp)                                         \
 +({                                                                    \
 +      unsigned long _v;                                               \
 +      int _r = kstrtoul(cp, 10, &_v);                                 \
 +      if (_r)                                                         \
 +              return _r;                                              \
 +      _v;                                                             \
 +})
 +
 +write_attribute(trigger_gc);
 +write_attribute(trigger_discards);
 +write_attribute(trigger_invalidates);
 +write_attribute(prune_cache);
 +write_attribute(btree_wakeup);
 +rw_attribute(btree_gc_periodic);
 +rw_attribute(gc_gens_pos);
 +
 +read_attribute(uuid);
 +read_attribute(minor);
 +read_attribute(bucket_size);
 +read_attribute(first_bucket);
 +read_attribute(nbuckets);
 +rw_attribute(durability);
 +read_attribute(iodone);
 +
 +read_attribute(io_latency_read);
 +read_attribute(io_latency_write);
 +read_attribute(io_latency_stats_read);
 +read_attribute(io_latency_stats_write);
 +read_attribute(congested);
 +
 +read_attribute(btree_write_stats);
 +
 +read_attribute(btree_cache_size);
 +read_attribute(compression_stats);
 +read_attribute(journal_debug);
 +read_attribute(btree_updates);
 +read_attribute(btree_cache);
 +read_attribute(btree_key_cache);
 +read_attribute(stripes_heap);
 +read_attribute(open_buckets);
 +read_attribute(open_buckets_partial);
 +read_attribute(write_points);
 +read_attribute(nocow_lock_table);
 +
 +#ifdef BCH_WRITE_REF_DEBUG
 +read_attribute(write_refs);
 +
 +static const char * const bch2_write_refs[] = {
 +#define x(n)  #n,
 +      BCH_WRITE_REFS()
 +#undef x
 +      NULL
 +};
 +
 +static void bch2_write_refs_to_text(struct printbuf *out, struct bch_fs *c)
 +{
 +      bch2_printbuf_tabstop_push(out, 24);
 +
 +      for (unsigned i = 0; i < ARRAY_SIZE(c->writes); i++) {
 +              prt_str(out, bch2_write_refs[i]);
 +              prt_tab(out);
 +              prt_printf(out, "%li", atomic_long_read(&c->writes[i]));
 +              prt_newline(out);
 +      }
 +}
 +#endif
 +
 +read_attribute(internal_uuid);
 +read_attribute(disk_groups);
 +
 +read_attribute(has_data);
 +read_attribute(alloc_debug);
 +
 +#define x(t, n, ...) read_attribute(t);
 +BCH_PERSISTENT_COUNTERS()
 +#undef x
 +
 +rw_attribute(discard);
 +rw_attribute(label);
 +
 +rw_attribute(copy_gc_enabled);
 +read_attribute(copy_gc_wait);
 +
 +rw_attribute(rebalance_enabled);
 +sysfs_pd_controller_attribute(rebalance);
 +read_attribute(rebalance_work);
 +rw_attribute(promote_whole_extents);
 +
 +read_attribute(new_stripes);
 +
 +read_attribute(io_timers_read);
 +read_attribute(io_timers_write);
 +
 +read_attribute(moving_ctxts);
 +
 +#ifdef CONFIG_BCACHEFS_TESTS
 +write_attribute(perf_test);
 +#endif /* CONFIG_BCACHEFS_TESTS */
 +
 +#define x(_name)                                              \
 +      static struct attribute sysfs_time_stat_##_name =               \
 +              { .name = #_name, .mode = 0444 };
 +      BCH_TIME_STATS()
 +#undef x
 +
 +static struct attribute sysfs_state_rw = {
 +      .name = "state",
 +      .mode =  0444,
 +};
 +
 +static size_t bch2_btree_cache_size(struct bch_fs *c)
 +{
 +      size_t ret = 0;
 +      struct btree *b;
 +
 +      mutex_lock(&c->btree_cache.lock);
 +      list_for_each_entry(b, &c->btree_cache.live, list)
 +              ret += btree_bytes(c);
 +
 +      mutex_unlock(&c->btree_cache.lock);
 +      return ret;
 +}
 +
 +static int bch2_compression_stats_to_text(struct printbuf *out, struct bch_fs *c)
 +{
 +      struct btree_trans *trans;
 +      struct btree_iter iter;
 +      struct bkey_s_c k;
 +      enum btree_id id;
 +      u64 nr_uncompressed_extents = 0,
 +          nr_compressed_extents = 0,
 +          nr_incompressible_extents = 0,
 +          uncompressed_sectors = 0,
 +          incompressible_sectors = 0,
 +          compressed_sectors_compressed = 0,
 +          compressed_sectors_uncompressed = 0;
 +      int ret = 0;
 +
 +      if (!test_bit(BCH_FS_STARTED, &c->flags))
 +              return -EPERM;
 +
 +      trans = bch2_trans_get(c);
 +
 +      for (id = 0; id < BTREE_ID_NR; id++) {
 +              if (!btree_type_has_ptrs(id))
 +                      continue;
 +
 +              for_each_btree_key(trans, iter, id, POS_MIN,
 +                                 BTREE_ITER_ALL_SNAPSHOTS, k, ret) {
 +                      struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
 +                      const union bch_extent_entry *entry;
 +                      struct extent_ptr_decoded p;
 +                      bool compressed = false, uncompressed = false, incompressible = false;
 +
 +                      bkey_for_each_ptr_decode(k.k, ptrs, p, entry) {
 +                              switch (p.crc.compression_type) {
 +                              case BCH_COMPRESSION_TYPE_none:
 +                                      uncompressed = true;
 +                                      uncompressed_sectors += k.k->size;
 +                                      break;
 +                              case BCH_COMPRESSION_TYPE_incompressible:
 +                                      incompressible = true;
 +                                      incompressible_sectors += k.k->size;
 +                                      break;
 +                              default:
 +                                      compressed_sectors_compressed +=
 +                                              p.crc.compressed_size;
 +                                      compressed_sectors_uncompressed +=
 +                                              p.crc.uncompressed_size;
 +                                      compressed = true;
 +                                      break;
 +                              }
 +                      }
 +
 +                      if (incompressible)
 +                              nr_incompressible_extents++;
 +                      else if (uncompressed)
 +                              nr_uncompressed_extents++;
 +                      else if (compressed)
 +                              nr_compressed_extents++;
 +              }
 +              bch2_trans_iter_exit(trans, &iter);
 +      }
 +
 +      bch2_trans_put(trans);
 +
 +      if (ret)
 +              return ret;
 +
 +      prt_printf(out, "uncompressed:\n");
 +      prt_printf(out, "       nr extents:             %llu\n", nr_uncompressed_extents);
 +      prt_printf(out, "       size:                   ");
 +      prt_human_readable_u64(out, uncompressed_sectors << 9);
 +      prt_printf(out, "\n");
 +
 +      prt_printf(out, "compressed:\n");
 +      prt_printf(out, "       nr extents:             %llu\n", nr_compressed_extents);
 +      prt_printf(out, "       compressed size:        ");
 +      prt_human_readable_u64(out, compressed_sectors_compressed << 9);
 +      prt_printf(out, "\n");
 +      prt_printf(out, "       uncompressed size:      ");
 +      prt_human_readable_u64(out, compressed_sectors_uncompressed << 9);
 +      prt_printf(out, "\n");
 +
 +      prt_printf(out, "incompressible:\n");
 +      prt_printf(out, "       nr extents:             %llu\n", nr_incompressible_extents);
 +      prt_printf(out, "       size:                   ");
 +      prt_human_readable_u64(out, incompressible_sectors << 9);
 +      prt_printf(out, "\n");
 +      return 0;
 +}
 +
 +static void bch2_gc_gens_pos_to_text(struct printbuf *out, struct bch_fs *c)
 +{
 +      prt_printf(out, "%s: ", bch2_btree_ids[c->gc_gens_btree]);
 +      bch2_bpos_to_text(out, c->gc_gens_pos);
 +      prt_printf(out, "\n");
 +}
 +
 +static void bch2_btree_wakeup_all(struct bch_fs *c)
 +{
 +      struct btree_trans *trans;
 +
 +      seqmutex_lock(&c->btree_trans_lock);
 +      list_for_each_entry(trans, &c->btree_trans_list, list) {
 +              struct btree_bkey_cached_common *b = READ_ONCE(trans->locking);
 +
 +              if (b)
 +                      six_lock_wakeup_all(&b->lock);
 +
 +      }
 +      seqmutex_unlock(&c->btree_trans_lock);
 +}
 +
 +SHOW(bch2_fs)
 +{
 +      struct bch_fs *c = container_of(kobj, struct bch_fs, kobj);
 +
 +      sysfs_print(minor,                      c->minor);
 +      sysfs_printf(internal_uuid, "%pU",      c->sb.uuid.b);
 +
 +      sysfs_hprint(btree_cache_size,          bch2_btree_cache_size(c));
 +
 +      if (attr == &sysfs_btree_write_stats)
 +              bch2_btree_write_stats_to_text(out, c);
 +
 +      sysfs_printf(btree_gc_periodic, "%u",   (int) c->btree_gc_periodic);
 +
 +      if (attr == &sysfs_gc_gens_pos)
 +              bch2_gc_gens_pos_to_text(out, c);
 +
 +      sysfs_printf(copy_gc_enabled, "%i", c->copy_gc_enabled);
 +
 +      sysfs_printf(rebalance_enabled,         "%i", c->rebalance.enabled);
 +      sysfs_pd_controller_show(rebalance,     &c->rebalance.pd); /* XXX */
 +
 +      if (attr == &sysfs_copy_gc_wait)
 +              bch2_copygc_wait_to_text(out, c);
 +
 +      if (attr == &sysfs_rebalance_work)
 +              bch2_rebalance_work_to_text(out, c);
 +
 +      sysfs_print(promote_whole_extents,      c->promote_whole_extents);
 +
 +      /* Debugging: */
 +
 +      if (attr == &sysfs_journal_debug)
 +              bch2_journal_debug_to_text(out, &c->journal);
 +
 +      if (attr == &sysfs_btree_updates)
 +              bch2_btree_updates_to_text(out, c);
 +
 +      if (attr == &sysfs_btree_cache)
 +              bch2_btree_cache_to_text(out, c);
 +
 +      if (attr == &sysfs_btree_key_cache)
 +              bch2_btree_key_cache_to_text(out, &c->btree_key_cache);
 +
 +      if (attr == &sysfs_stripes_heap)
 +              bch2_stripes_heap_to_text(out, c);
 +
 +      if (attr == &sysfs_open_buckets)
 +              bch2_open_buckets_to_text(out, c);
 +
 +      if (attr == &sysfs_open_buckets_partial)
 +              bch2_open_buckets_partial_to_text(out, c);
 +
 +      if (attr == &sysfs_write_points)
 +              bch2_write_points_to_text(out, c);
 +
 +      if (attr == &sysfs_compression_stats)
 +              bch2_compression_stats_to_text(out, c);
 +
 +      if (attr == &sysfs_new_stripes)
 +              bch2_new_stripes_to_text(out, c);
 +
 +      if (attr == &sysfs_io_timers_read)
 +              bch2_io_timers_to_text(out, &c->io_clock[READ]);
 +
 +      if (attr == &sysfs_io_timers_write)
 +              bch2_io_timers_to_text(out, &c->io_clock[WRITE]);
 +
 +      if (attr == &sysfs_moving_ctxts)
 +              bch2_fs_moving_ctxts_to_text(out, c);
 +
 +#ifdef BCH_WRITE_REF_DEBUG
 +      if (attr == &sysfs_write_refs)
 +              bch2_write_refs_to_text(out, c);
 +#endif
 +
 +      if (attr == &sysfs_nocow_lock_table)
 +              bch2_nocow_locks_to_text(out, &c->nocow_locks);
 +
 +      if (attr == &sysfs_disk_groups)
 +              bch2_disk_groups_to_text(out, c);
 +
 +      return 0;
 +}
 +
 +STORE(bch2_fs)
 +{
 +      struct bch_fs *c = container_of(kobj, struct bch_fs, kobj);
 +
 +      if (attr == &sysfs_btree_gc_periodic) {
 +              ssize_t ret = strtoul_safe(buf, c->btree_gc_periodic)
 +                      ?: (ssize_t) size;
 +
 +              wake_up_process(c->gc_thread);
 +              return ret;
 +      }
 +
 +      if (attr == &sysfs_copy_gc_enabled) {
 +              ssize_t ret = strtoul_safe(buf, c->copy_gc_enabled)
 +                      ?: (ssize_t) size;
 +
 +              if (c->copygc_thread)
 +                      wake_up_process(c->copygc_thread);
 +              return ret;
 +      }
 +
 +      if (attr == &sysfs_rebalance_enabled) {
 +              ssize_t ret = strtoul_safe(buf, c->rebalance.enabled)
 +                      ?: (ssize_t) size;
 +
 +              rebalance_wakeup(c);
 +              return ret;
 +      }
 +
 +      sysfs_pd_controller_store(rebalance,    &c->rebalance.pd);
 +
 +      sysfs_strtoul(promote_whole_extents,    c->promote_whole_extents);
 +
 +      /* Debugging: */
 +
 +      if (!test_bit(BCH_FS_STARTED, &c->flags))
 +              return -EPERM;
 +
 +      /* Debugging: */
 +
 +      if (!test_bit(BCH_FS_RW, &c->flags))
 +              return -EROFS;
 +
 +      if (attr == &sysfs_prune_cache) {
 +              struct shrink_control sc;
 +
 +              sc.gfp_mask = GFP_KERNEL;
 +              sc.nr_to_scan = strtoul_or_return(buf);
++              c->btree_cache.shrink->scan_objects(c->btree_cache.shrink, &sc);
 +      }
 +
 +      if (attr == &sysfs_btree_wakeup)
 +              bch2_btree_wakeup_all(c);
 +
 +      if (attr == &sysfs_trigger_gc) {
 +              /*
 +               * Full gc is currently incompatible with btree key cache:
 +               */
 +#if 0
 +              down_read(&c->state_lock);
 +              bch2_gc(c, false, false);
 +              up_read(&c->state_lock);
 +#else
 +              bch2_gc_gens(c);
 +#endif
 +      }
 +
 +      if (attr == &sysfs_trigger_discards)
 +              bch2_do_discards(c);
 +
 +      if (attr == &sysfs_trigger_invalidates)
 +              bch2_do_invalidates(c);
 +
 +#ifdef CONFIG_BCACHEFS_TESTS
 +      if (attr == &sysfs_perf_test) {
 +              char *tmp = kstrdup(buf, GFP_KERNEL), *p = tmp;
 +              char *test              = strsep(&p, " \t\n");
 +              char *nr_str            = strsep(&p, " \t\n");
 +              char *threads_str       = strsep(&p, " \t\n");
 +              unsigned threads;
 +              u64 nr;
 +              int ret = -EINVAL;
 +
 +              if (threads_str &&
 +                  !(ret = kstrtouint(threads_str, 10, &threads)) &&
 +                  !(ret = bch2_strtoull_h(nr_str, &nr)))
 +                      ret = bch2_btree_perf_test(c, test, nr, threads);
 +              kfree(tmp);
 +
 +              if (ret)
 +                      size = ret;
 +      }
 +#endif
 +      return size;
 +}
 +SYSFS_OPS(bch2_fs);
 +
 +struct attribute *bch2_fs_files[] = {
 +      &sysfs_minor,
 +      &sysfs_btree_cache_size,
 +      &sysfs_btree_write_stats,
 +
 +      &sysfs_promote_whole_extents,
 +
 +      &sysfs_compression_stats,
 +
 +#ifdef CONFIG_BCACHEFS_TESTS
 +      &sysfs_perf_test,
 +#endif
 +      NULL
 +};
 +
 +/* counters dir */
 +
 +SHOW(bch2_fs_counters)
 +{
 +      struct bch_fs *c = container_of(kobj, struct bch_fs, counters_kobj);
 +      u64 counter = 0;
 +      u64 counter_since_mount = 0;
 +
 +      printbuf_tabstop_push(out, 32);
 +
 +      #define x(t, ...) \
 +              if (attr == &sysfs_##t) {                                       \
 +                      counter             = percpu_u64_get(&c->counters[BCH_COUNTER_##t]);\
 +                      counter_since_mount = counter - c->counters_on_mount[BCH_COUNTER_##t];\
 +                      prt_printf(out, "since mount:");                                \
 +                      prt_tab(out);                                           \
 +                      prt_human_readable_u64(out, counter_since_mount);       \
 +                      prt_newline(out);                                       \
 +                                                                              \
 +                      prt_printf(out, "since filesystem creation:");          \
 +                      prt_tab(out);                                           \
 +                      prt_human_readable_u64(out, counter);                   \
 +                      prt_newline(out);                                       \
 +              }
 +      BCH_PERSISTENT_COUNTERS()
 +      #undef x
 +      return 0;
 +}
 +
 +STORE(bch2_fs_counters) {
 +      return 0;
 +}
 +
 +SYSFS_OPS(bch2_fs_counters);
 +
 +struct attribute *bch2_fs_counters_files[] = {
 +#define x(t, ...) \
 +      &sysfs_##t,
 +      BCH_PERSISTENT_COUNTERS()
 +#undef x
 +      NULL
 +};
 +/* internal dir - just a wrapper */
 +
 +SHOW(bch2_fs_internal)
 +{
 +      struct bch_fs *c = container_of(kobj, struct bch_fs, internal);
 +
 +      return bch2_fs_to_text(out, &c->kobj, attr);
 +}
 +
 +STORE(bch2_fs_internal)
 +{
 +      struct bch_fs *c = container_of(kobj, struct bch_fs, internal);
 +
 +      return bch2_fs_store(&c->kobj, attr, buf, size);
 +}
 +SYSFS_OPS(bch2_fs_internal);
 +
 +struct attribute *bch2_fs_internal_files[] = {
 +      &sysfs_journal_debug,
 +      &sysfs_btree_updates,
 +      &sysfs_btree_cache,
 +      &sysfs_btree_key_cache,
 +      &sysfs_new_stripes,
 +      &sysfs_stripes_heap,
 +      &sysfs_open_buckets,
 +      &sysfs_open_buckets_partial,
 +      &sysfs_write_points,
 +#ifdef BCH_WRITE_REF_DEBUG
 +      &sysfs_write_refs,
 +#endif
 +      &sysfs_nocow_lock_table,
 +      &sysfs_io_timers_read,
 +      &sysfs_io_timers_write,
 +
 +      &sysfs_trigger_gc,
 +      &sysfs_trigger_discards,
 +      &sysfs_trigger_invalidates,
 +      &sysfs_prune_cache,
 +      &sysfs_btree_wakeup,
 +
 +      &sysfs_gc_gens_pos,
 +
 +      &sysfs_copy_gc_enabled,
 +      &sysfs_copy_gc_wait,
 +
 +      &sysfs_rebalance_enabled,
 +      &sysfs_rebalance_work,
 +      sysfs_pd_controller_files(rebalance),
 +
 +      &sysfs_moving_ctxts,
 +
 +      &sysfs_internal_uuid,
 +
 +      &sysfs_disk_groups,
 +      NULL
 +};
 +
 +/* options */
 +
 +SHOW(bch2_fs_opts_dir)
 +{
 +      struct bch_fs *c = container_of(kobj, struct bch_fs, opts_dir);
 +      const struct bch_option *opt = container_of(attr, struct bch_option, attr);
 +      int id = opt - bch2_opt_table;
 +      u64 v = bch2_opt_get_by_id(&c->opts, id);
 +
 +      bch2_opt_to_text(out, c, c->disk_sb.sb, opt, v, OPT_SHOW_FULL_LIST);
 +      prt_char(out, '\n');
 +
 +      return 0;
 +}
 +
 +STORE(bch2_fs_opts_dir)
 +{
 +      struct bch_fs *c = container_of(kobj, struct bch_fs, opts_dir);
 +      const struct bch_option *opt = container_of(attr, struct bch_option, attr);
 +      int ret, id = opt - bch2_opt_table;
 +      char *tmp;
 +      u64 v;
 +
 +      /*
 +       * We don't need to take c->writes for correctness, but it eliminates an
 +       * unsightly error message in the dmesg log when we're RO:
 +       */
 +      if (unlikely(!bch2_write_ref_tryget(c, BCH_WRITE_REF_sysfs)))
 +              return -EROFS;
 +
 +      tmp = kstrdup(buf, GFP_KERNEL);
 +      if (!tmp) {
 +              ret = -ENOMEM;
 +              goto err;
 +      }
 +
 +      ret = bch2_opt_parse(c, opt, strim(tmp), &v, NULL);
 +      kfree(tmp);
 +
 +      if (ret < 0)
 +              goto err;
 +
 +      ret = bch2_opt_check_may_set(c, id, v);
 +      if (ret < 0)
 +              goto err;
 +
 +      bch2_opt_set_sb(c, opt, v);
 +      bch2_opt_set_by_id(&c->opts, id, v);
 +
 +      if ((id == Opt_background_target ||
 +           id == Opt_background_compression) && v) {
 +              bch2_rebalance_add_work(c, S64_MAX);
 +              rebalance_wakeup(c);
 +      }
 +
 +      ret = size;
 +err:
 +      bch2_write_ref_put(c, BCH_WRITE_REF_sysfs);
 +      return ret;
 +}
 +SYSFS_OPS(bch2_fs_opts_dir);
 +
 +struct attribute *bch2_fs_opts_dir_files[] = { NULL };
 +
 +int bch2_opts_create_sysfs_files(struct kobject *kobj)
 +{
 +      const struct bch_option *i;
 +      int ret;
 +
 +      for (i = bch2_opt_table;
 +           i < bch2_opt_table + bch2_opts_nr;
 +           i++) {
 +              if (!(i->flags & OPT_FS))
 +                      continue;
 +
 +              ret = sysfs_create_file(kobj, &i->attr);
 +              if (ret)
 +                      return ret;
 +      }
 +
 +      return 0;
 +}
 +
 +/* time stats */
 +
 +SHOW(bch2_fs_time_stats)
 +{
 +      struct bch_fs *c = container_of(kobj, struct bch_fs, time_stats);
 +
 +#define x(name)                                                               \
 +      if (attr == &sysfs_time_stat_##name)                            \
 +              bch2_time_stats_to_text(out, &c->times[BCH_TIME_##name]);
 +      BCH_TIME_STATS()
 +#undef x
 +
 +      return 0;
 +}
 +
 +STORE(bch2_fs_time_stats)
 +{
 +      return size;
 +}
 +SYSFS_OPS(bch2_fs_time_stats);
 +
 +struct attribute *bch2_fs_time_stats_files[] = {
 +#define x(name)                                               \
 +      &sysfs_time_stat_##name,
 +      BCH_TIME_STATS()
 +#undef x
 +      NULL
 +};
 +
 +static void dev_alloc_debug_to_text(struct printbuf *out, struct bch_dev *ca)
 +{
 +      struct bch_fs *c = ca->fs;
 +      struct bch_dev_usage stats = bch2_dev_usage_read(ca);
 +      unsigned i, nr[BCH_DATA_NR];
 +
 +      memset(nr, 0, sizeof(nr));
 +
 +      for (i = 0; i < ARRAY_SIZE(c->open_buckets); i++)
 +              nr[c->open_buckets[i].data_type]++;
 +
 +      printbuf_tabstop_push(out, 8);
 +      printbuf_tabstop_push(out, 16);
 +      printbuf_tabstop_push(out, 16);
 +      printbuf_tabstop_push(out, 16);
 +      printbuf_tabstop_push(out, 16);
 +
 +      prt_tab(out);
 +      prt_str(out, "buckets");
 +      prt_tab_rjust(out);
 +      prt_str(out, "sectors");
 +      prt_tab_rjust(out);
 +      prt_str(out, "fragmented");
 +      prt_tab_rjust(out);
 +      prt_newline(out);
 +
 +      for (i = 0; i < BCH_DATA_NR; i++) {
 +              prt_str(out, bch2_data_types[i]);
 +              prt_tab(out);
 +              prt_u64(out, stats.d[i].buckets);
 +              prt_tab_rjust(out);
 +              prt_u64(out, stats.d[i].sectors);
 +              prt_tab_rjust(out);
 +              prt_u64(out, stats.d[i].fragmented);
 +              prt_tab_rjust(out);
 +              prt_newline(out);
 +      }
 +
 +      prt_str(out, "ec");
 +      prt_tab(out);
 +      prt_u64(out, stats.buckets_ec);
 +      prt_tab_rjust(out);
 +      prt_newline(out);
 +
 +      prt_newline(out);
 +
 +      prt_printf(out, "reserves:");
 +      prt_newline(out);
 +      for (i = 0; i < BCH_WATERMARK_NR; i++) {
 +              prt_str(out, bch2_watermarks[i]);
 +              prt_tab(out);
 +              prt_u64(out, bch2_dev_buckets_reserved(ca, i));
 +              prt_tab_rjust(out);
 +              prt_newline(out);
 +      }
 +
 +      prt_newline(out);
 +
 +      printbuf_tabstops_reset(out);
 +      printbuf_tabstop_push(out, 24);
 +
 +      prt_str(out, "freelist_wait");
 +      prt_tab(out);
 +      prt_str(out, c->freelist_wait.list.first ? "waiting" : "empty");
 +      prt_newline(out);
 +
 +      prt_str(out, "open buckets allocated");
 +      prt_tab(out);
 +      prt_u64(out, OPEN_BUCKETS_COUNT - c->open_buckets_nr_free);
 +      prt_newline(out);
 +
 +      prt_str(out, "open buckets this dev");
 +      prt_tab(out);
 +      prt_u64(out, ca->nr_open_buckets);
 +      prt_newline(out);
 +
 +      prt_str(out, "open buckets total");
 +      prt_tab(out);
 +      prt_u64(out, OPEN_BUCKETS_COUNT);
 +      prt_newline(out);
 +
 +      prt_str(out, "open_buckets_wait");
 +      prt_tab(out);
 +      prt_str(out, c->open_buckets_wait.list.first ? "waiting" : "empty");
 +      prt_newline(out);
 +
 +      prt_str(out, "open_buckets_btree");
 +      prt_tab(out);
 +      prt_u64(out, nr[BCH_DATA_btree]);
 +      prt_newline(out);
 +
 +      prt_str(out, "open_buckets_user");
 +      prt_tab(out);
 +      prt_u64(out, nr[BCH_DATA_user]);
 +      prt_newline(out);
 +
 +      prt_str(out, "buckets_to_invalidate");
 +      prt_tab(out);
 +      prt_u64(out, should_invalidate_buckets(ca, stats));
 +      prt_newline(out);
 +
 +      prt_str(out, "btree reserve cache");
 +      prt_tab(out);
 +      prt_u64(out, c->btree_reserve_cache_nr);
 +      prt_newline(out);
 +}
 +
 +static const char * const bch2_rw[] = {
 +      "read",
 +      "write",
 +      NULL
 +};
 +
 +static void dev_iodone_to_text(struct printbuf *out, struct bch_dev *ca)
 +{
 +      int rw, i;
 +
 +      for (rw = 0; rw < 2; rw++) {
 +              prt_printf(out, "%s:\n", bch2_rw[rw]);
 +
 +              for (i = 1; i < BCH_DATA_NR; i++)
 +                      prt_printf(out, "%-12s:%12llu\n",
 +                             bch2_data_types[i],
 +                             percpu_u64_get(&ca->io_done->sectors[rw][i]) << 9);
 +      }
 +}
 +
 +SHOW(bch2_dev)
 +{
 +      struct bch_dev *ca = container_of(kobj, struct bch_dev, kobj);
 +      struct bch_fs *c = ca->fs;
 +
 +      sysfs_printf(uuid,              "%pU\n", ca->uuid.b);
 +
 +      sysfs_print(bucket_size,        bucket_bytes(ca));
 +      sysfs_print(first_bucket,       ca->mi.first_bucket);
 +      sysfs_print(nbuckets,           ca->mi.nbuckets);
 +      sysfs_print(durability,         ca->mi.durability);
 +      sysfs_print(discard,            ca->mi.discard);
 +
 +      if (attr == &sysfs_label) {
 +              if (ca->mi.group) {
 +                      mutex_lock(&c->sb_lock);
 +                      bch2_disk_path_to_text(out, c->disk_sb.sb,
 +                                             ca->mi.group - 1);
 +                      mutex_unlock(&c->sb_lock);
 +              }
 +
 +              prt_char(out, '\n');
 +      }
 +
 +      if (attr == &sysfs_has_data) {
 +              prt_bitflags(out, bch2_data_types, bch2_dev_has_data(c, ca));
 +              prt_char(out, '\n');
 +      }
 +
 +      if (attr == &sysfs_state_rw) {
 +              prt_string_option(out, bch2_member_states, ca->mi.state);
 +              prt_char(out, '\n');
 +      }
 +
 +      if (attr == &sysfs_iodone)
 +              dev_iodone_to_text(out, ca);
 +
 +      sysfs_print(io_latency_read,            atomic64_read(&ca->cur_latency[READ]));
 +      sysfs_print(io_latency_write,           atomic64_read(&ca->cur_latency[WRITE]));
 +
 +      if (attr == &sysfs_io_latency_stats_read)
 +              bch2_time_stats_to_text(out, &ca->io_latency[READ]);
 +
 +      if (attr == &sysfs_io_latency_stats_write)
 +              bch2_time_stats_to_text(out, &ca->io_latency[WRITE]);
 +
 +      sysfs_printf(congested,                 "%u%%",
 +                   clamp(atomic_read(&ca->congested), 0, CONGESTED_MAX)
 +                   * 100 / CONGESTED_MAX);
 +
 +      if (attr == &sysfs_alloc_debug)
 +              dev_alloc_debug_to_text(out, ca);
 +
 +      return 0;
 +}
 +
 +STORE(bch2_dev)
 +{
 +      struct bch_dev *ca = container_of(kobj, struct bch_dev, kobj);
 +      struct bch_fs *c = ca->fs;
 +      struct bch_member *mi;
 +
 +      if (attr == &sysfs_discard) {
 +              bool v = strtoul_or_return(buf);
 +
 +              mutex_lock(&c->sb_lock);
 +              mi = bch2_members_v2_get_mut(c->disk_sb.sb, ca->dev_idx);
 +
 +              if (v != BCH_MEMBER_DISCARD(mi)) {
 +                      SET_BCH_MEMBER_DISCARD(mi, v);
 +                      bch2_write_super(c);
 +              }
 +              mutex_unlock(&c->sb_lock);
 +      }
 +
 +      if (attr == &sysfs_durability) {
 +              u64 v = strtoul_or_return(buf);
 +
 +              mutex_lock(&c->sb_lock);
 +              mi = bch2_members_v2_get_mut(c->disk_sb.sb, ca->dev_idx);
 +
 +              if (v + 1 != BCH_MEMBER_DURABILITY(mi)) {
 +                      SET_BCH_MEMBER_DURABILITY(mi, v + 1);
 +                      bch2_write_super(c);
 +              }
 +              mutex_unlock(&c->sb_lock);
 +      }
 +
 +      if (attr == &sysfs_label) {
 +              char *tmp;
 +              int ret;
 +
 +              tmp = kstrdup(buf, GFP_KERNEL);
 +              if (!tmp)
 +                      return -ENOMEM;
 +
 +              ret = bch2_dev_group_set(c, ca, strim(tmp));
 +              kfree(tmp);
 +              if (ret)
 +                      return ret;
 +      }
 +
 +      return size;
 +}
 +SYSFS_OPS(bch2_dev);
 +
 +struct attribute *bch2_dev_files[] = {
 +      &sysfs_uuid,
 +      &sysfs_bucket_size,
 +      &sysfs_first_bucket,
 +      &sysfs_nbuckets,
 +      &sysfs_durability,
 +
 +      /* settings: */
 +      &sysfs_discard,
 +      &sysfs_state_rw,
 +      &sysfs_label,
 +
 +      &sysfs_has_data,
 +      &sysfs_iodone,
 +
 +      &sysfs_io_latency_read,
 +      &sysfs_io_latency_write,
 +      &sysfs_io_latency_stats_read,
 +      &sysfs_io_latency_stats_write,
 +      &sysfs_congested,
 +
 +      /* debug: */
 +      &sysfs_alloc_debug,
 +      NULL
 +};
 +
 +#endif  /* _BCACHEFS_SYSFS_H_ */
diff --combined fs/btrfs/super.c
index 6ecf78d09694342f41857f83bbfd7177efeac616,b1798bed68f2d682ffc2bf8061881c257c356ff0..f638dc339693bc1a65c1d5637d58b998ad5b95d4
@@@ -26,7 -26,6 +26,7 @@@
  #include <linux/ratelimit.h>
  #include <linux/crc32c.h>
  #include <linux/btrfs.h>
 +#include <linux/security.h>
  #include "messages.h"
  #include "delayed-inode.h"
  #include "ctree.h"
@@@ -130,6 -129,9 +130,6 @@@ enum 
        Opt_inode_cache, Opt_noinode_cache,
  
        /* Debugging options */
 -      Opt_check_integrity,
 -      Opt_check_integrity_including_extent_data,
 -      Opt_check_integrity_print_mask,
        Opt_enospc_debug, Opt_noenospc_debug,
  #ifdef CONFIG_BTRFS_DEBUG
        Opt_fragment_data, Opt_fragment_metadata, Opt_fragment_all,
@@@ -198,6 -200,9 +198,6 @@@ static const match_table_t tokens = 
        {Opt_recovery, "recovery"},
  
        /* Debugging options */
 -      {Opt_check_integrity, "check_int"},
 -      {Opt_check_integrity_including_extent_data, "check_int_data"},
 -      {Opt_check_integrity_print_mask, "check_int_print_mask=%u"},
        {Opt_enospc_debug, "enospc_debug"},
        {Opt_noenospc_debug, "noenospc_debug"},
  #ifdef CONFIG_BTRFS_DEBUG
@@@ -702,6 -707,44 +702,6 @@@ int btrfs_parse_options(struct btrfs_fs
                case Opt_skip_balance:
                        btrfs_set_opt(info->mount_opt, SKIP_BALANCE);
                        break;
 -#ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 -              case Opt_check_integrity_including_extent_data:
 -                      btrfs_warn(info,
 -      "integrity checker is deprecated and will be removed in 6.7");
 -                      btrfs_info(info,
 -                                 "enabling check integrity including extent data");
 -                      btrfs_set_opt(info->mount_opt, CHECK_INTEGRITY_DATA);
 -                      btrfs_set_opt(info->mount_opt, CHECK_INTEGRITY);
 -                      break;
 -              case Opt_check_integrity:
 -                      btrfs_warn(info,
 -      "integrity checker is deprecated and will be removed in 6.7");
 -                      btrfs_info(info, "enabling check integrity");
 -                      btrfs_set_opt(info->mount_opt, CHECK_INTEGRITY);
 -                      break;
 -              case Opt_check_integrity_print_mask:
 -                      ret = match_int(&args[0], &intarg);
 -                      if (ret) {
 -                              btrfs_err(info,
 -                              "unrecognized check_integrity_print_mask value %s",
 -                                      args[0].from);
 -                              goto out;
 -                      }
 -                      info->check_integrity_print_mask = intarg;
 -                      btrfs_warn(info,
 -      "integrity checker is deprecated and will be removed in 6.7");
 -                      btrfs_info(info, "check_integrity_print_mask 0x%x",
 -                                 info->check_integrity_print_mask);
 -                      break;
 -#else
 -              case Opt_check_integrity_including_extent_data:
 -              case Opt_check_integrity:
 -              case Opt_check_integrity_print_mask:
 -                      btrfs_err(info,
 -                                "support for check_integrity* not compiled in!");
 -                      ret = -EINVAL;
 -                      goto out;
 -#endif
                case Opt_fatal_errors:
                        if (strcmp(args[0].from, "panic") == 0) {
                                btrfs_set_opt(info->mount_opt,
@@@ -846,7 -889,7 +846,7 @@@ static int btrfs_parse_device_options(c
                                error = -ENOMEM;
                                goto out;
                        }
 -                      device = btrfs_scan_one_device(device_name, flags);
 +                      device = btrfs_scan_one_device(device_name, flags, false);
                        kfree(device_name);
                        if (IS_ERR(device)) {
                                error = PTR_ERR(device);
@@@ -1262,6 -1305,15 +1262,6 @@@ static int btrfs_show_options(struct se
                seq_puts(seq, ",autodefrag");
        if (btrfs_test_opt(info, SKIP_BALANCE))
                seq_puts(seq, ",skip_balance");
 -#ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 -      if (btrfs_test_opt(info, CHECK_INTEGRITY_DATA))
 -              seq_puts(seq, ",check_int_data");
 -      else if (btrfs_test_opt(info, CHECK_INTEGRITY))
 -              seq_puts(seq, ",check_int");
 -      if (info->check_integrity_print_mask)
 -              seq_printf(seq, ",check_int_print_mask=%d",
 -                              info->check_integrity_print_mask);
 -#endif
        if (info->metadata_ratio)
                seq_printf(seq, ",metadata_ratio=%u", info->metadata_ratio);
        if (btrfs_test_opt(info, PANIC_ON_FATAL_ERROR))
@@@ -1432,12 -1484,7 +1432,12 @@@ static struct dentry *btrfs_mount_root(
                goto error_fs_info;
        }
  
 -      device = btrfs_scan_one_device(device_name, mode);
 +      /*
 +       * With 'true' passed to btrfs_scan_one_device() (mount time) we expect
 +       * either a valid device or an error.
 +       */
 +      device = btrfs_scan_one_device(device_name, mode, true);
 +      ASSERT(device != NULL);
        if (IS_ERR(device)) {
                mutex_unlock(&uuid_mutex);
                error = PTR_ERR(device);
                        error = -EBUSY;
        } else {
                snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
-               shrinker_debugfs_rename(&s->s_shrink, "sb-%s:%s", fs_type->name,
+               shrinker_debugfs_rename(s->s_shrink, "sb-%s:%s", fs_type->name,
                                        s->s_id);
                btrfs_sb(s)->bdev_holder = fs_type;
                error = btrfs_fill_super(s, fs_devices, data);
@@@ -2149,11 -2196,7 +2149,11 @@@ static long btrfs_control_ioctl(struct 
        switch (cmd) {
        case BTRFS_IOC_SCAN_DEV:
                mutex_lock(&uuid_mutex);
 -              device = btrfs_scan_one_device(vol->name, BLK_OPEN_READ);
 +              /*
 +               * Scanning outside of mount can return NULL which would turn
 +               * into 0 error code.
 +               */
 +              device = btrfs_scan_one_device(vol->name, BLK_OPEN_READ, false);
                ret = PTR_ERR_OR_ZERO(device);
                mutex_unlock(&uuid_mutex);
                break;
                break;
        case BTRFS_IOC_DEVICES_READY:
                mutex_lock(&uuid_mutex);
 -              device = btrfs_scan_one_device(vol->name, BLK_OPEN_READ);
 -              if (IS_ERR(device)) {
 +              /*
 +               * Scanning outside of mount can return NULL which would turn
 +               * into 0 error code.
 +               */
 +              device = btrfs_scan_one_device(vol->name, BLK_OPEN_READ, false);
 +              if (IS_ERR_OR_NULL(device)) {
                        mutex_unlock(&uuid_mutex);
                        ret = PTR_ERR(device);
                        break;
@@@ -2217,7 -2256,6 +2217,7 @@@ static int check_dev_super(struct btrfs
  {
        struct btrfs_fs_info *fs_info = dev->fs_info;
        struct btrfs_super_block *sb;
 +      u64 last_trans;
        u16 csum_type;
        int ret = 0;
  
        if (ret < 0)
                goto out;
  
 -      if (btrfs_super_generation(sb) != fs_info->last_trans_committed) {
 +      last_trans = btrfs_get_last_trans_committed(fs_info);
 +      if (btrfs_super_generation(sb) != last_trans) {
                btrfs_err(fs_info, "transid mismatch, has %llu expect %llu",
 -                      btrfs_super_generation(sb),
 -                      fs_info->last_trans_committed);
 +                        btrfs_super_generation(sb), last_trans);
                ret = -EUCLEAN;
                goto out;
        }
@@@ -2366,6 -2404,9 +2366,6 @@@ static int __init btrfs_print_mod_info(
  #ifdef CONFIG_BTRFS_ASSERT
                        ", assert=on"
  #endif
 -#ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 -                      ", integrity-checker=on"
 -#endif
  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
                        ", ref-verify=on"
  #endif
diff --combined fs/erofs/utils.c
index 4256a85719a1d25fbe3f0aa33820fafe3ad01d45,e9c25cd7b601e1e8af9469b08d6e7abf7bb4163b..5dea308764b45038f8236bf31b004067f0f297a6
@@@ -77,7 -77,12 +77,7 @@@ struct erofs_workgroup *erofs_insert_wo
        struct erofs_sb_info *const sbi = EROFS_SB(sb);
        struct erofs_workgroup *pre;
  
 -      /*
 -       * Bump up before making this visible to others for the XArray in order
 -       * to avoid potential UAF without serialized by xa_lock.
 -       */
 -      lockref_get(&grp->lockref);
 -
 +      DBG_BUGON(grp->lockref.count < 1);
  repeat:
        xa_lock(&sbi->managed_pslots);
        pre = __xa_cmpxchg(&sbi->managed_pslots, grp->index,
@@@ -91,6 -96,7 +91,6 @@@
                        cond_resched();
                        goto repeat;
                }
 -              lockref_put_return(&grp->lockref);
                grp = pre;
        }
        xa_unlock(&sbi->managed_pslots);
@@@ -264,19 -270,24 +264,24 @@@ static unsigned long erofs_shrink_scan(
        return freed;
  }
  
- static struct shrinker erofs_shrinker_info = {
-       .scan_objects = erofs_shrink_scan,
-       .count_objects = erofs_shrink_count,
-       .seeks = DEFAULT_SEEKS,
- };
+ static struct shrinker *erofs_shrinker_info;
  
  int __init erofs_init_shrinker(void)
  {
-       return register_shrinker(&erofs_shrinker_info, "erofs-shrinker");
+       erofs_shrinker_info = shrinker_alloc(0, "erofs-shrinker");
+       if (!erofs_shrinker_info)
+               return -ENOMEM;
+       erofs_shrinker_info->count_objects = erofs_shrink_count;
+       erofs_shrinker_info->scan_objects = erofs_shrink_scan;
+       shrinker_register(erofs_shrinker_info);
+       return 0;
  }
  
  void erofs_exit_shrinker(void)
  {
-       unregister_shrinker(&erofs_shrinker_info);
+       shrinker_free(erofs_shrinker_info);
  }
  #endif        /* !CONFIG_EROFS_FS_ZIP */
diff --combined fs/ext4/ext4.h
index f16aa375c02ba8d4d99e450af5378d3c5605f966,8eeff770992cebb304c86ba2875986fd8bbdbefd..a5d784872303ddb6731f2bf0f8579170809b36fd
@@@ -891,13 -891,10 +891,13 @@@ do {                                                                            
                (raw_inode)->xtime = cpu_to_le32(clamp_t(int32_t, (ts).tv_sec, S32_MIN, S32_MAX));      \
  } while (0)
  
 -#define EXT4_INODE_SET_XTIME(xtime, inode, raw_inode)                         \
 -      EXT4_INODE_SET_XTIME_VAL(xtime, inode, raw_inode, (inode)->xtime)
 +#define EXT4_INODE_SET_ATIME(inode, raw_inode)                                                \
 +      EXT4_INODE_SET_XTIME_VAL(i_atime, inode, raw_inode, inode_get_atime(inode))
  
 -#define EXT4_INODE_SET_CTIME(inode, raw_inode)                                        \
 +#define EXT4_INODE_SET_MTIME(inode, raw_inode)                                                \
 +      EXT4_INODE_SET_XTIME_VAL(i_mtime, inode, raw_inode, inode_get_mtime(inode))
 +
 +#define EXT4_INODE_SET_CTIME(inode, raw_inode)                                                \
        EXT4_INODE_SET_XTIME_VAL(i_ctime, inode, raw_inode, inode_get_ctime(inode))
  
  #define EXT4_EINODE_SET_XTIME(xtime, einode, raw_inode)                               \
                        .tv_sec = (signed)le32_to_cpu((raw_inode)->xtime)       \
                })
  
 -#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode)                         \
 +#define EXT4_INODE_GET_ATIME(inode, raw_inode)                                        \
 +do {                                                                          \
 +      inode_set_atime_to_ts(inode,                                            \
 +              EXT4_INODE_GET_XTIME_VAL(i_atime, inode, raw_inode));           \
 +} while (0)
 +
 +#define EXT4_INODE_GET_MTIME(inode, raw_inode)                                        \
  do {                                                                          \
 -      (inode)->xtime = EXT4_INODE_GET_XTIME_VAL(xtime, inode, raw_inode);     \
 +      inode_set_mtime_to_ts(inode,                                            \
 +              EXT4_INODE_GET_XTIME_VAL(i_mtime, inode, raw_inode));           \
  } while (0)
  
  #define EXT4_INODE_GET_CTIME(inode, raw_inode)                                        \
@@@ -1504,7 -1494,6 +1504,7 @@@ struct ext4_sb_info 
        loff_t s_bitmap_maxbytes;       /* max bytes for bitmap files */
        struct buffer_head * s_sbh;     /* Buffer containing the super block */
        struct ext4_super_block *s_es;  /* Pointer to the super block in the buffer */
 +      /* Array of bh's for the block group descriptors */
        struct buffer_head * __rcu *s_group_desc;
        unsigned int s_mount_opt;
        unsigned int s_mount_opt2;
        unsigned long s_commit_interval;
        u32 s_max_batch_time;
        u32 s_min_batch_time;
 -      struct block_device *s_journal_bdev;
 +      struct bdev_handle *s_journal_bdev_handle;
  #ifdef CONFIG_QUOTA
        /* Names of quota files with journalled quota */
        char __rcu *s_qf_names[EXT4_MAXQUOTAS];
        unsigned int *s_mb_maxs;
        unsigned int s_group_info_size;
        unsigned int s_mb_free_pending;
 -      struct list_head s_freed_data_list;     /* List of blocks to be freed
 +      struct list_head s_freed_data_list[2];  /* List of blocks to be freed
                                                   after commit completed */
        struct list_head s_discard_list;
        struct work_struct s_discard_work;
        __u32 s_csum_seed;
  
        /* Reclaim extents from extent status tree */
-       struct shrinker s_es_shrinker;
+       struct shrinker *s_es_shrinker;
        struct list_head s_es_list;     /* List of inodes with reclaimable extents */
        long s_es_nr_inode;
        struct ext4_es_stats s_es_stats;
  
        /*
         * Barrier between writepages ops and changing any inode's JOURNAL_DATA
 -       * or EXTENTS flag.
 +       * or EXTENTS flag or between writepages ops and changing DELALLOC or
 +       * DIOREAD_NOLOCK mount options on remount.
         */
        struct percpu_rw_semaphore s_writepages_rwsem;
        struct dax_device *s_daxdev;
@@@ -2936,7 -2924,7 +2936,7 @@@ extern int ext4_group_add_blocks(handle
  extern int ext4_trim_fs(struct super_block *, struct fstrim_range *);
  extern void ext4_process_freed_data(struct super_block *sb, tid_t commit_tid);
  extern void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block,
 -                     int len, int state);
 +                          int len, bool state);
  static inline bool ext4_mb_cr_expensive(enum criteria cr)
  {
        return cr >= CR_GOAL_LEN_SLOW;
diff --combined fs/ext4/extents_status.c
index f4b50652f0ccea9fec831a46677958ee855188fc,deec7d1f4e50deff50fde3a2456f05f9b5c03130..4a00e2f019d932c8652eada2968ec68d87ac46ad
@@@ -152,9 -152,8 +152,9 @@@ static int __es_remove_extent(struct in
  static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan);
  static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
                       struct ext4_inode_info *locked_ei);
 -static void __revise_pending(struct inode *inode, ext4_lblk_t lblk,
 -                           ext4_lblk_t len);
 +static int __revise_pending(struct inode *inode, ext4_lblk_t lblk,
 +                          ext4_lblk_t len,
 +                          struct pending_reservation **prealloc);
  
  int __init ext4_init_es(void)
  {
@@@ -449,19 -448,6 +449,19 @@@ static void ext4_es_list_del(struct ino
        spin_unlock(&sbi->s_es_lock);
  }
  
 +static inline struct pending_reservation *__alloc_pending(bool nofail)
 +{
 +      if (!nofail)
 +              return kmem_cache_alloc(ext4_pending_cachep, GFP_ATOMIC);
 +
 +      return kmem_cache_zalloc(ext4_pending_cachep, GFP_KERNEL | __GFP_NOFAIL);
 +}
 +
 +static inline void __free_pending(struct pending_reservation *pr)
 +{
 +      kmem_cache_free(ext4_pending_cachep, pr);
 +}
 +
  /*
   * Returns true if we cannot fail to allocate memory for this extent_status
   * entry and cannot reclaim it until its status changes.
@@@ -850,12 -836,11 +850,12 @@@ void ext4_es_insert_extent(struct inod
  {
        struct extent_status newes;
        ext4_lblk_t end = lblk + len - 1;
 -      int err1 = 0;
 -      int err2 = 0;
 +      int err1 = 0, err2 = 0, err3 = 0;
        struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
        struct extent_status *es1 = NULL;
        struct extent_status *es2 = NULL;
 +      struct pending_reservation *pr = NULL;
 +      bool revise_pending = false;
  
        if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
                return;
  
        ext4_es_insert_extent_check(inode, &newes);
  
 +      revise_pending = sbi->s_cluster_ratio > 1 &&
 +                       test_opt(inode->i_sb, DELALLOC) &&
 +                       (status & (EXTENT_STATUS_WRITTEN |
 +                                  EXTENT_STATUS_UNWRITTEN));
  retry:
        if (err1 && !es1)
                es1 = __es_alloc_extent(true);
        if ((err1 || err2) && !es2)
                es2 = __es_alloc_extent(true);
 +      if ((err1 || err2 || err3) && revise_pending && !pr)
 +              pr = __alloc_pending(true);
        write_lock(&EXT4_I(inode)->i_es_lock);
  
        err1 = __es_remove_extent(inode, lblk, end, NULL, es1);
                es2 = NULL;
        }
  
 -      if (sbi->s_cluster_ratio > 1 && test_opt(inode->i_sb, DELALLOC) &&
 -          (status & EXTENT_STATUS_WRITTEN ||
 -           status & EXTENT_STATUS_UNWRITTEN))
 -              __revise_pending(inode, lblk, len);
 +      if (revise_pending) {
 +              err3 = __revise_pending(inode, lblk, len, &pr);
 +              if (err3 != 0)
 +                      goto error;
 +              if (pr) {
 +                      __free_pending(pr);
 +                      pr = NULL;
 +              }
 +      }
  error:
        write_unlock(&EXT4_I(inode)->i_es_lock);
 -      if (err1 || err2)
 +      if (err1 || err2 || err3)
                goto retry;
  
        ext4_es_print_tree(inode);
@@@ -1337,7 -1311,7 +1337,7 @@@ static unsigned int get_rsvd(struct ino
                                rc->ndelonly--;
                                node = rb_next(&pr->rb_node);
                                rb_erase(&pr->rb_node, &tree->root);
 -                              kmem_cache_free(ext4_pending_cachep, pr);
 +                              __free_pending(pr);
                                if (!node)
                                        break;
                                pr = rb_entry(node, struct pending_reservation,
@@@ -1431,8 -1405,8 +1431,8 @@@ static int __es_remove_extent(struct in
                        }
                }
                if (count_reserved)
 -                      count_rsvd(inode, lblk, orig_es.es_len - len1 - len2,
 -                                 &orig_es, &rc);
 +                      count_rsvd(inode, orig_es.es_lblk + len1,
 +                                 orig_es.es_len - len1 - len2, &orig_es, &rc);
                goto out_get_reserved;
        }
  
@@@ -1632,7 -1606,7 +1632,7 @@@ static unsigned long ext4_es_count(stru
        unsigned long nr;
        struct ext4_sb_info *sbi;
  
-       sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
+       sbi = shrink->private_data;
        nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
        trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr);
        return nr;
  static unsigned long ext4_es_scan(struct shrinker *shrink,
                                  struct shrink_control *sc)
  {
-       struct ext4_sb_info *sbi = container_of(shrink,
-                                       struct ext4_sb_info, s_es_shrinker);
+       struct ext4_sb_info *sbi = shrink->private_data;
        int nr_to_scan = sc->nr_to_scan;
        int ret, nr_shrunk;
  
@@@ -1726,13 -1699,17 +1725,17 @@@ int ext4_es_register_shrinker(struct ex
        if (err)
                goto err3;
  
-       sbi->s_es_shrinker.scan_objects = ext4_es_scan;
-       sbi->s_es_shrinker.count_objects = ext4_es_count;
-       sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
-       err = register_shrinker(&sbi->s_es_shrinker, "ext4-es:%s",
-                               sbi->s_sb->s_id);
-       if (err)
+       sbi->s_es_shrinker = shrinker_alloc(0, "ext4-es:%s", sbi->s_sb->s_id);
+       if (!sbi->s_es_shrinker) {
+               err = -ENOMEM;
                goto err4;
+       }
+       sbi->s_es_shrinker->scan_objects = ext4_es_scan;
+       sbi->s_es_shrinker->count_objects = ext4_es_count;
+       sbi->s_es_shrinker->private_data = sbi;
+       shrinker_register(sbi->s_es_shrinker);
  
        return 0;
  err4:
@@@ -1752,7 -1729,7 +1755,7 @@@ void ext4_es_unregister_shrinker(struc
        percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_misses);
        percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
        percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
-       unregister_shrinker(&sbi->s_es_shrinker);
+       shrinker_free(sbi->s_es_shrinker);
  }
  
  /*
@@@ -1933,13 -1910,11 +1936,13 @@@ static struct pending_reservation *__ge
   *
   * @inode - file containing the cluster
   * @lblk - logical block in the cluster to be added
 + * @prealloc - preallocated pending entry
   *
   * Returns 0 on successful insertion and -ENOMEM on failure.  If the
   * pending reservation is already in the set, returns successfully.
   */
 -static int __insert_pending(struct inode *inode, ext4_lblk_t lblk)
 +static int __insert_pending(struct inode *inode, ext4_lblk_t lblk,
 +                          struct pending_reservation **prealloc)
  {
        struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
        struct ext4_pending_tree *tree = &EXT4_I(inode)->i_pending_tree;
                }
        }
  
 -      pr = kmem_cache_alloc(ext4_pending_cachep, GFP_ATOMIC);
 -      if (pr == NULL) {
 -              ret = -ENOMEM;
 -              goto out;
 +      if (likely(*prealloc == NULL)) {
 +              pr = __alloc_pending(false);
 +              if (!pr) {
 +                      ret = -ENOMEM;
 +                      goto out;
 +              }
 +      } else {
 +              pr = *prealloc;
 +              *prealloc = NULL;
        }
        pr->lclu = lclu;
  
@@@ -2003,7 -1973,7 +2006,7 @@@ static void __remove_pending(struct ino
        if (pr != NULL) {
                tree = &EXT4_I(inode)->i_pending_tree;
                rb_erase(&pr->rb_node, &tree->root);
 -              kmem_cache_free(ext4_pending_cachep, pr);
 +              __free_pending(pr);
        }
  }
  
@@@ -2062,10 -2032,10 +2065,10 @@@ void ext4_es_insert_delayed_block(struc
                                  bool allocated)
  {
        struct extent_status newes;
 -      int err1 = 0;
 -      int err2 = 0;
 +      int err1 = 0, err2 = 0, err3 = 0;
        struct extent_status *es1 = NULL;
        struct extent_status *es2 = NULL;
 +      struct pending_reservation *pr = NULL;
  
        if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
                return;
@@@ -2085,8 -2055,6 +2088,8 @@@ retry
                es1 = __es_alloc_extent(true);
        if ((err1 || err2) && !es2)
                es2 = __es_alloc_extent(true);
 +      if ((err1 || err2 || err3) && allocated && !pr)
 +              pr = __alloc_pending(true);
        write_lock(&EXT4_I(inode)->i_es_lock);
  
        err1 = __es_remove_extent(inode, lblk, lblk, NULL, es1);
                es2 = NULL;
        }
  
 -      if (allocated)
 -              __insert_pending(inode, lblk);
 +      if (allocated) {
 +              err3 = __insert_pending(inode, lblk, &pr);
 +              if (err3 != 0)
 +                      goto error;
 +              if (pr) {
 +                      __free_pending(pr);
 +                      pr = NULL;
 +              }
 +      }
  error:
        write_unlock(&EXT4_I(inode)->i_es_lock);
 -      if (err1 || err2)
 +      if (err1 || err2 || err3)
                goto retry;
  
        ext4_es_print_tree(inode);
@@@ -2226,24 -2187,21 +2229,24 @@@ unsigned int ext4_es_delayed_clu(struc
   * @inode - file containing the range
   * @lblk - logical block defining the start of range
   * @len  - length of range in blocks
 + * @prealloc - preallocated pending entry
   *
   * Used after a newly allocated extent is added to the extents status tree.
   * Requires that the extents in the range have either written or unwritten
   * status.  Must be called while holding i_es_lock.
   */
 -static void __revise_pending(struct inode *inode, ext4_lblk_t lblk,
 -                           ext4_lblk_t len)
 +static int __revise_pending(struct inode *inode, ext4_lblk_t lblk,
 +                          ext4_lblk_t len,
 +                          struct pending_reservation **prealloc)
  {
        struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
        ext4_lblk_t end = lblk + len - 1;
        ext4_lblk_t first, last;
        bool f_del = false, l_del = false;
 +      int ret = 0;
  
        if (len == 0)
 -              return;
 +              return 0;
  
        /*
         * Two cases - block range within single cluster and block range
                        f_del = __es_scan_range(inode, &ext4_es_is_delonly,
                                                first, lblk - 1);
                if (f_del) {
 -                      __insert_pending(inode, first);
 +                      ret = __insert_pending(inode, first, prealloc);
 +                      if (ret < 0)
 +                              goto out;
                } else {
                        last = EXT4_LBLK_CMASK(sbi, end) +
                               sbi->s_cluster_ratio - 1;
                                l_del = __es_scan_range(inode,
                                                        &ext4_es_is_delonly,
                                                        end + 1, last);
 -                      if (l_del)
 -                              __insert_pending(inode, last);
 -                      else
 +                      if (l_del) {
 +                              ret = __insert_pending(inode, last, prealloc);
 +                              if (ret < 0)
 +                                      goto out;
 +                      } else
                                __remove_pending(inode, last);
                }
        } else {
                if (first != lblk)
                        f_del = __es_scan_range(inode, &ext4_es_is_delonly,
                                                first, lblk - 1);
 -              if (f_del)
 -                      __insert_pending(inode, first);
 -              else
 +              if (f_del) {
 +                      ret = __insert_pending(inode, first, prealloc);
 +                      if (ret < 0)
 +                              goto out;
 +              } else
                        __remove_pending(inode, first);
  
                last = EXT4_LBLK_CMASK(sbi, end) + sbi->s_cluster_ratio - 1;
                if (last != end)
                        l_del = __es_scan_range(inode, &ext4_es_is_delonly,
                                                end + 1, last);
 -              if (l_del)
 -                      __insert_pending(inode, last);
 -              else
 +              if (l_del) {
 +                      ret = __insert_pending(inode, last, prealloc);
 +                      if (ret < 0)
 +                              goto out;
 +              } else
                        __remove_pending(inode, last);
        }
 +out:
 +      return ret;
  }
diff --combined fs/ext4/inode.c
index a6838f54ae91698b7fce3e6ecf948a299097f66b,347fc8986e93b6aeb5f61b9b20d1fba6baae0780..61277f7f87225a0a69701922ad84f75290a5a113
@@@ -789,22 -789,10 +789,22 @@@ int ext4_get_block(struct inode *inode
  int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
                             struct buffer_head *bh_result, int create)
  {
 +      int ret = 0;
 +
        ext4_debug("ext4_get_block_unwritten: inode %lu, create flag %d\n",
                   inode->i_ino, create);
 -      return _ext4_get_block(inode, iblock, bh_result,
 +      ret = _ext4_get_block(inode, iblock, bh_result,
                               EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
 +
 +      /*
 +       * If the buffer is marked unwritten, mark it as new to make sure it is
 +       * zeroed out correctly in case of partial writes. Otherwise, there is
 +       * a chance of stale data getting exposed.
 +       */
 +      if (ret == 0 && buffer_unwritten(bh_result))
 +              set_buffer_new(bh_result);
 +
 +      return ret;
  }
  
  /* Maximum number of blocks we map for direct IO at once. */
@@@ -1032,10 -1020,8 +1032,8 @@@ static int ext4_block_write_begin(struc
        BUG_ON(from > to);
  
        head = folio_buffers(folio);
-       if (!head) {
-               create_empty_buffers(&folio->page, blocksize, 0);
-               head = folio_buffers(folio);
-       }
+       if (!head)
+               head = create_empty_buffers(folio, blocksize, 0);
        bbits = ilog2(blocksize);
        block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
  
@@@ -1165,7 -1151,7 +1163,7 @@@ retry_grab
         * starting the handle.
         */
        if (!folio_buffers(folio))
-               create_empty_buffers(&folio->page, inode->i_sb->s_blocksize, 0);
+               create_empty_buffers(folio, inode->i_sb->s_blocksize, 0);
  
        folio_unlock(folio);
  
@@@ -3655,10 -3641,8 +3653,8 @@@ static int __ext4_block_zero_page_range
        iblock = index << (PAGE_SHIFT - inode->i_sb->s_blocksize_bits);
  
        bh = folio_buffers(folio);
-       if (!bh) {
-               create_empty_buffers(&folio->page, blocksize, 0);
-               bh = folio_buffers(folio);
-       }
+       if (!bh)
+               bh = create_empty_buffers(folio, blocksize, 0);
  
        /* Find the buffer that contains "offset" */
        pos = blocksize;
@@@ -4032,7 -4016,7 +4028,7 @@@ int ext4_punch_hole(struct file *file, 
        if (IS_SYNC(inode))
                ext4_handle_sync(handle);
  
 -      inode->i_mtime = inode_set_ctime_current(inode);
 +      inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
        ret2 = ext4_mark_inode_dirty(handle, inode);
        if (unlikely(ret2))
                ret = ret2;
@@@ -4192,7 -4176,7 +4188,7 @@@ out_stop
        if (inode->i_nlink)
                ext4_orphan_del(handle, inode);
  
 -      inode->i_mtime = inode_set_ctime_current(inode);
 +      inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
        err2 = ext4_mark_inode_dirty(handle, inode);
        if (unlikely(err2 && !err))
                err = err2;
@@@ -4296,8 -4280,8 +4292,8 @@@ static int ext4_fill_raw_inode(struct i
        raw_inode->i_links_count = cpu_to_le16(inode->i_nlink);
  
        EXT4_INODE_SET_CTIME(inode, raw_inode);
 -      EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
 -      EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
 +      EXT4_INODE_SET_MTIME(inode, raw_inode);
 +      EXT4_INODE_SET_ATIME(inode, raw_inode);
        EXT4_EINODE_SET_XTIME(i_crtime, ei, raw_inode);
  
        raw_inode->i_dtime = cpu_to_le32(ei->i_dtime);
@@@ -4905,8 -4889,8 +4901,8 @@@ struct inode *__ext4_iget(struct super_
        }
  
        EXT4_INODE_GET_CTIME(inode, raw_inode);
 -      EXT4_INODE_GET_XTIME(i_mtime, inode, raw_inode);
 -      EXT4_INODE_GET_XTIME(i_atime, inode, raw_inode);
 +      EXT4_INODE_GET_ATIME(inode, raw_inode);
 +      EXT4_INODE_GET_MTIME(inode, raw_inode);
        EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);
  
        if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
@@@ -5031,8 -5015,8 +5027,8 @@@ static void __ext4_update_other_inode_t
  
                spin_lock(&ei->i_raw_lock);
                EXT4_INODE_SET_CTIME(inode, raw_inode);
 -              EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
 -              EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
 +              EXT4_INODE_SET_MTIME(inode, raw_inode);
 +              EXT4_INODE_SET_ATIME(inode, raw_inode);
                ext4_inode_csum_set(inode, raw_inode, ei);
                spin_unlock(&ei->i_raw_lock);
                trace_ext4_other_inode_update_time(inode, orig_ino);
@@@ -5425,8 -5409,7 +5421,8 @@@ int ext4_setattr(struct mnt_idmap *idma
                         * update c/mtime in shrink case below
                         */
                        if (!shrink)
 -                              inode->i_mtime = inode_set_ctime_current(inode);
 +                              inode_set_mtime_to_ts(inode,
 +                                                    inode_set_ctime_current(inode));
  
                        if (shrink)
                                ext4_fc_track_range(handle, inode,
diff --combined fs/ext4/super.c
index 77e2b694c7d5d14a1451795ad2ae5677884b7323,56a08fc5c5d524134a50e74a3a229c576531e314..54a9dde7483a5a505f41f75fde05a763a114c443
@@@ -244,18 -244,25 +244,25 @@@ static struct buffer_head *__ext4_sb_br
  struct buffer_head *ext4_sb_bread(struct super_block *sb, sector_t block,
                                   blk_opf_t op_flags)
  {
-       return __ext4_sb_bread_gfp(sb, block, op_flags, __GFP_MOVABLE);
+       gfp_t gfp = mapping_gfp_constraint(sb->s_bdev->bd_inode->i_mapping,
+                       ~__GFP_FS) | __GFP_MOVABLE;
+       return __ext4_sb_bread_gfp(sb, block, op_flags, gfp);
  }
  
  struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb,
                                            sector_t block)
  {
-       return __ext4_sb_bread_gfp(sb, block, 0, 0);
+       gfp_t gfp = mapping_gfp_constraint(sb->s_bdev->bd_inode->i_mapping,
+                       ~__GFP_FS);
+       return __ext4_sb_bread_gfp(sb, block, 0, gfp);
  }
  
  void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block)
  {
-       struct buffer_head *bh = sb_getblk_gfp(sb, block, 0);
+       struct buffer_head *bh = bdev_getblk(sb->s_bdev, block,
+                       sb->s_blocksize, GFP_NOWAIT | __GFP_NOWARN);
  
        if (likely(bh)) {
                if (trylock_buffer(bh))
@@@ -768,8 -775,7 +775,8 @@@ static void update_super_work(struct wo
         */
        if (!sb_rdonly(sbi->s_sb) && journal) {
                struct buffer_head *sbh = sbi->s_sbh;
 -              bool call_notify_err;
 +              bool call_notify_err = false;
 +
                handle = jbd2_journal_start(journal, 1);
                if (IS_ERR(handle))
                        goto write_directly;
@@@ -1352,14 -1358,14 +1359,14 @@@ static void ext4_put_super(struct super
  
        sync_blockdev(sb->s_bdev);
        invalidate_bdev(sb->s_bdev);
 -      if (sbi->s_journal_bdev) {
 +      if (sbi->s_journal_bdev_handle) {
                /*
                 * Invalidate the journal device's buffers.  We don't want them
                 * floating about in memory - the physical journal device may
                 * hotswapped, and it breaks the `ro-after' testing code.
                 */
 -              sync_blockdev(sbi->s_journal_bdev);
 -              invalidate_bdev(sbi->s_journal_bdev);
 +              sync_blockdev(sbi->s_journal_bdev_handle->bdev);
 +              invalidate_bdev(sbi->s_journal_bdev_handle->bdev);
        }
  
        ext4_xattr_destroy_cache(sbi->s_ea_inode_cache);
@@@ -4234,7 -4240,7 +4241,7 @@@ int ext4_calculate_overhead(struct supe
         * Add the internal journal blocks whether the journal has been
         * loaded or not
         */
 -      if (sbi->s_journal && !sbi->s_journal_bdev)
 +      if (sbi->s_journal && !sbi->s_journal_bdev_handle)
                overhead += EXT4_NUM_B2C(sbi, sbi->s_journal->j_total_len);
        else if (ext4_has_feature_journal(sb) && !sbi->s_journal && j_inum) {
                /* j_inum for internal journal is non-zero */
@@@ -5671,9 -5677,9 +5678,9 @@@ failed_mount
  #endif
        fscrypt_free_dummy_policy(&sbi->s_dummy_enc_policy);
        brelse(sbi->s_sbh);
 -      if (sbi->s_journal_bdev) {
 -              invalidate_bdev(sbi->s_journal_bdev);
 -              blkdev_put(sbi->s_journal_bdev, sb);
 +      if (sbi->s_journal_bdev_handle) {
 +              invalidate_bdev(sbi->s_journal_bdev_handle->bdev);
 +              bdev_release(sbi->s_journal_bdev_handle);
        }
  out_fail:
        invalidate_bdev(sb->s_bdev);
@@@ -5843,13 -5849,12 +5850,13 @@@ static journal_t *ext4_open_inode_journ
        return journal;
  }
  
 -static struct block_device *ext4_get_journal_blkdev(struct super_block *sb,
 +static struct bdev_handle *ext4_get_journal_blkdev(struct super_block *sb,
                                        dev_t j_dev, ext4_fsblk_t *j_start,
                                        ext4_fsblk_t *j_len)
  {
        struct buffer_head *bh;
        struct block_device *bdev;
 +      struct bdev_handle *bdev_handle;
        int hblock, blocksize;
        ext4_fsblk_t sb_block;
        unsigned long offset;
  
        /* see get_tree_bdev why this is needed and safe */
        up_write(&sb->s_umount);
 -      bdev = blkdev_get_by_dev(j_dev, BLK_OPEN_READ | BLK_OPEN_WRITE, sb,
 -                               &fs_holder_ops);
 +      bdev_handle = bdev_open_by_dev(j_dev, BLK_OPEN_READ | BLK_OPEN_WRITE,
 +                                     sb, &fs_holder_ops);
        down_write(&sb->s_umount);
 -      if (IS_ERR(bdev)) {
 +      if (IS_ERR(bdev_handle)) {
                ext4_msg(sb, KERN_ERR,
                         "failed to open journal device unknown-block(%u,%u) %ld",
 -                       MAJOR(j_dev), MINOR(j_dev), PTR_ERR(bdev));
 -              return ERR_CAST(bdev);
 +                       MAJOR(j_dev), MINOR(j_dev), PTR_ERR(bdev_handle));
 +              return bdev_handle;
        }
  
 +      bdev = bdev_handle->bdev;
        blocksize = sb->s_blocksize;
        hblock = bdev_logical_block_size(bdev);
        if (blocksize < hblock) {
        *j_start = sb_block + 1;
        *j_len = ext4_blocks_count(es);
        brelse(bh);
 -      return bdev;
 +      return bdev_handle;
  
  out_bh:
        brelse(bh);
  out_bdev:
 -      blkdev_put(bdev, sb);
 +      bdev_release(bdev_handle);
        return ERR_PTR(errno);
  }
  
@@@ -5930,14 -5934,14 +5937,14 @@@ static journal_t *ext4_open_dev_journal
        journal_t *journal;
        ext4_fsblk_t j_start;
        ext4_fsblk_t j_len;
 -      struct block_device *journal_bdev;
 +      struct bdev_handle *bdev_handle;
        int errno = 0;
  
 -      journal_bdev = ext4_get_journal_blkdev(sb, j_dev, &j_start, &j_len);
 -      if (IS_ERR(journal_bdev))
 -              return ERR_CAST(journal_bdev);
 +      bdev_handle = ext4_get_journal_blkdev(sb, j_dev, &j_start, &j_len);
 +      if (IS_ERR(bdev_handle))
 +              return ERR_CAST(bdev_handle);
  
 -      journal = jbd2_journal_init_dev(journal_bdev, sb->s_bdev, j_start,
 +      journal = jbd2_journal_init_dev(bdev_handle->bdev, sb->s_bdev, j_start,
                                        j_len, sb->s_blocksize);
        if (IS_ERR(journal)) {
                ext4_msg(sb, KERN_ERR, "failed to create device journal");
                goto out_journal;
        }
        journal->j_private = sb;
 -      EXT4_SB(sb)->s_journal_bdev = journal_bdev;
 +      EXT4_SB(sb)->s_journal_bdev_handle = bdev_handle;
        ext4_init_journal_params(sb, journal);
        return journal;
  
  out_journal:
        jbd2_journal_destroy(journal);
  out_bdev:
 -      blkdev_put(journal_bdev, sb);
 +      bdev_release(bdev_handle);
        return ERR_PTR(errno);
  }
  
@@@ -6445,7 -6449,6 +6452,7 @@@ static int __ext4_remount(struct fs_con
        struct ext4_mount_options old_opts;
        ext4_group_t g;
        int err = 0;
 +      int alloc_ctx;
  #ifdef CONFIG_QUOTA
        int enable_quota = 0;
        int i, j;
  
        }
  
 +      /*
 +       * Changing the DIOREAD_NOLOCK or DELALLOC mount options may cause
 +       * two calls to ext4_should_dioread_nolock() to return inconsistent
 +       * values, triggering WARN_ON in ext4_add_complete_io(). we grab
 +       * here s_writepages_rwsem to avoid race between writepages ops and
 +       * remount.
 +       */
 +      alloc_ctx = ext4_writepages_down_write(sb);
        ext4_apply_options(fc, sb);
 +      ext4_writepages_up_write(sb, alloc_ctx);
  
        if ((old_opts.s_mount_opt & EXT4_MOUNT_JOURNAL_CHECKSUM) ^
            test_opt(sb, JOURNAL_CHECKSUM)) {
@@@ -6713,8 -6707,6 +6720,8 @@@ restore_opts
        if (sb_rdonly(sb) && !(old_sb_flags & SB_RDONLY) &&
            sb_any_quota_suspended(sb))
                dquot_resume(sb, -1);
 +
 +      alloc_ctx = ext4_writepages_down_write(sb);
        sb->s_flags = old_sb_flags;
        sbi->s_mount_opt = old_opts.s_mount_opt;
        sbi->s_mount_opt2 = old_opts.s_mount_opt2;
        sbi->s_commit_interval = old_opts.s_commit_interval;
        sbi->s_min_batch_time = old_opts.s_min_batch_time;
        sbi->s_max_batch_time = old_opts.s_max_batch_time;
 +      ext4_writepages_up_write(sb, alloc_ctx);
 +
        if (!test_opt(sb, BLOCK_VALIDITY) && sbi->s_system_blks)
                ext4_release_system_zone(sb);
  #ifdef CONFIG_QUOTA
@@@ -7144,7 -7134,7 +7151,7 @@@ static int ext4_quota_off(struct super_
        }
        EXT4_I(inode)->i_flags &= ~(EXT4_NOATIME_FL | EXT4_IMMUTABLE_FL);
        inode_set_flags(inode, 0, S_NOATIME | S_IMMUTABLE);
 -      inode->i_mtime = inode_set_ctime_current(inode);
 +      inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
        err = ext4_mark_inode_dirty(handle, inode);
        ext4_journal_stop(handle);
  out_unlock:
@@@ -7317,12 -7307,12 +7324,12 @@@ static inline int ext3_feature_set_ok(s
  static void ext4_kill_sb(struct super_block *sb)
  {
        struct ext4_sb_info *sbi = EXT4_SB(sb);
 -      struct block_device *journal_bdev = sbi ? sbi->s_journal_bdev : NULL;
 +      struct bdev_handle *handle = sbi ? sbi->s_journal_bdev_handle : NULL;
  
        kill_block_super(sb);
  
 -      if (journal_bdev)
 -              blkdev_put(journal_bdev, sb);
 +      if (handle)
 +              bdev_release(handle);
  }
  
  static struct file_system_type ext4_fs_type = {
diff --combined fs/f2fs/super.c
index be17d77513d56c60c55598c099a90256c4d1a804,fe25ff9cebbee8db45904923d8858ac63db3c84c..05f9f7b6ebf8c63a2f482bd1ba1a50836727cd40
@@@ -83,11 -83,26 +83,26 @@@ void f2fs_build_fault_attr(struct f2fs_
  #endif
  
  /* f2fs-wide shrinker description */
- static struct shrinker f2fs_shrinker_info = {
-       .scan_objects = f2fs_shrink_scan,
-       .count_objects = f2fs_shrink_count,
-       .seeks = DEFAULT_SEEKS,
- };
+ static struct shrinker *f2fs_shrinker_info;
+ static int __init f2fs_init_shrinker(void)
+ {
+       f2fs_shrinker_info = shrinker_alloc(0, "f2fs-shrinker");
+       if (!f2fs_shrinker_info)
+               return -ENOMEM;
+       f2fs_shrinker_info->count_objects = f2fs_shrink_count;
+       f2fs_shrinker_info->scan_objects = f2fs_shrink_scan;
+       shrinker_register(f2fs_shrinker_info);
+       return 0;
+ }
+ static void f2fs_exit_shrinker(void)
+ {
+       shrinker_free(f2fs_shrinker_info);
+ }
  
  enum {
        Opt_gc_background,
@@@ -1562,7 -1577,7 +1577,7 @@@ static void destroy_device_list(struct 
  
        for (i = 0; i < sbi->s_ndevs; i++) {
                if (i > 0)
 -                      blkdev_put(FDEV(i).bdev, sbi->sb);
 +                      bdev_release(FDEV(i).bdev_handle);
  #ifdef CONFIG_BLK_DEV_ZONED
                kvfree(FDEV(i).blkz_seq);
  #endif
@@@ -2710,7 -2725,7 +2725,7 @@@ retry
  
        if (len == towrite)
                return err;
 -      inode->i_mtime = inode_set_ctime_current(inode);
 +      inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
        f2fs_mark_inode_dirty_sync(inode, false);
        return len - towrite;
  }
@@@ -3203,6 -3218,13 +3218,6 @@@ static bool f2fs_has_stable_inodes(stru
        return true;
  }
  
 -static void f2fs_get_ino_and_lblk_bits(struct super_block *sb,
 -                                     int *ino_bits_ret, int *lblk_bits_ret)
 -{
 -      *ino_bits_ret = 8 * sizeof(nid_t);
 -      *lblk_bits_ret = 8 * sizeof(block_t);
 -}
 -
  static struct block_device **f2fs_get_devices(struct super_block *sb,
                                              unsigned int *num_devs)
  {
  }
  
  static const struct fscrypt_operations f2fs_cryptops = {
 -      .key_prefix             = "f2fs:",
 +      .needs_bounce_pages     = 1,
 +      .has_32bit_inodes       = 1,
 +      .supports_subblock_data_units = 1,
 +      .legacy_key_prefix      = "f2fs:",
        .get_context            = f2fs_get_context,
        .set_context            = f2fs_set_context,
        .get_dummy_policy       = f2fs_get_dummy_policy,
        .empty_dir              = f2fs_empty_dir,
        .has_stable_inodes      = f2fs_has_stable_inodes,
 -      .get_ino_and_lblk_bits  = f2fs_get_ino_and_lblk_bits,
        .get_devices            = f2fs_get_devices,
  };
  #endif
@@@ -4193,7 -4213,7 +4208,7 @@@ static int f2fs_scan_devices(struct f2f
  
        for (i = 0; i < max_devices; i++) {
                if (i == 0)
 -                      FDEV(0).bdev = sbi->sb->s_bdev;
 +                      FDEV(0).bdev_handle = sbi->sb->s_bdev_handle;
                else if (!RDEV(i).path[0])
                        break;
  
                                FDEV(i).end_blk = FDEV(i).start_blk +
                                        (FDEV(i).total_segments <<
                                        sbi->log_blocks_per_seg) - 1;
 -                              FDEV(i).bdev = blkdev_get_by_path(FDEV(i).path,
 -                                      mode, sbi->sb, NULL);
 +                              FDEV(i).bdev_handle = bdev_open_by_path(
 +                                      FDEV(i).path, mode, sbi->sb, NULL);
                        }
                }
 -              if (IS_ERR(FDEV(i).bdev))
 -                      return PTR_ERR(FDEV(i).bdev);
 +              if (IS_ERR(FDEV(i).bdev_handle))
 +                      return PTR_ERR(FDEV(i).bdev_handle);
  
 +              FDEV(i).bdev = FDEV(i).bdev_handle->bdev;
                /* to release errored devices */
                sbi->s_ndevs = i + 1;
  
@@@ -4940,7 -4959,7 +4955,7 @@@ static int __init init_f2fs_fs(void
        err = f2fs_init_sysfs();
        if (err)
                goto free_garbage_collection_cache;
-       err = register_shrinker(&f2fs_shrinker_info, "f2fs-shrinker");
+       err = f2fs_init_shrinker();
        if (err)
                goto free_sysfs;
        err = register_filesystem(&f2fs_fs_type);
@@@ -4985,7 -5004,7 +5000,7 @@@ free_root_stats
        f2fs_destroy_root_stats();
        unregister_filesystem(&f2fs_fs_type);
  free_shrinker:
-       unregister_shrinker(&f2fs_shrinker_info);
+       f2fs_exit_shrinker();
  free_sysfs:
        f2fs_exit_sysfs();
  free_garbage_collection_cache:
@@@ -5017,7 -5036,7 +5032,7 @@@ static void __exit exit_f2fs_fs(void
        f2fs_destroy_post_read_processing();
        f2fs_destroy_root_stats();
        unregister_filesystem(&f2fs_fs_type);
-       unregister_shrinker(&f2fs_shrinker_info);
+       f2fs_exit_shrinker();
        f2fs_exit_sysfs();
        f2fs_destroy_garbage_collection_cache();
        f2fs_destroy_extent_cache();
diff --combined fs/gfs2/bmap.c
index 011cd992e0e6d23ca54d5740255ff12155464506,f1eee3f4704b61cf7c54d1eaba933325e8173c70..6eb6f1bd9e34b59c130cecd02a5a43c03bf9bfaf
@@@ -43,53 -43,51 +43,51 @@@ struct metapath 
  static int punch_hole(struct gfs2_inode *ip, u64 offset, u64 length);
  
  /**
-  * gfs2_unstuffer_page - unstuff a stuffed inode into a block cached by a page
+  * gfs2_unstuffer_folio - unstuff a stuffed inode into a block cached by a folio
   * @ip: the inode
   * @dibh: the dinode buffer
   * @block: the block number that was allocated
-  * @page: The (optional) page. This is looked up if @page is NULL
+  * @folio: The folio.
   *
   * Returns: errno
   */
- static int gfs2_unstuffer_page(struct gfs2_inode *ip, struct buffer_head *dibh,
-                              u64 block, struct page *page)
+ static int gfs2_unstuffer_folio(struct gfs2_inode *ip, struct buffer_head *dibh,
+                              u64 block, struct folio *folio)
  {
        struct inode *inode = &ip->i_inode;
  
-       if (!PageUptodate(page)) {
-               void *kaddr = kmap(page);
+       if (!folio_test_uptodate(folio)) {
+               void *kaddr = kmap_local_folio(folio, 0);
                u64 dsize = i_size_read(inode);
   
                memcpy(kaddr, dibh->b_data + sizeof(struct gfs2_dinode), dsize);
-               memset(kaddr + dsize, 0, PAGE_SIZE - dsize);
-               kunmap(page);
+               memset(kaddr + dsize, 0, folio_size(folio) - dsize);
+               kunmap_local(kaddr);
  
-               SetPageUptodate(page);
+               folio_mark_uptodate(folio);
        }
  
        if (gfs2_is_jdata(ip)) {
-               struct buffer_head *bh;
+               struct buffer_head *bh = folio_buffers(folio);
  
-               if (!page_has_buffers(page))
-                       create_empty_buffers(page, BIT(inode->i_blkbits),
-                                            BIT(BH_Uptodate));
+               if (!bh)
+                       bh = create_empty_buffers(folio,
+                               BIT(inode->i_blkbits), BIT(BH_Uptodate));
  
-               bh = page_buffers(page);
                if (!buffer_mapped(bh))
                        map_bh(bh, inode->i_sb, block);
  
                set_buffer_uptodate(bh);
                gfs2_trans_add_data(ip->i_gl, bh);
        } else {
-               set_page_dirty(page);
+               folio_mark_dirty(folio);
                gfs2_ordered_add_inode(ip);
        }
  
        return 0;
  }
  
- static int __gfs2_unstuff_inode(struct gfs2_inode *ip, struct page *page)
+ static int __gfs2_unstuff_inode(struct gfs2_inode *ip, struct folio *folio)
  {
        struct buffer_head *bh, *dibh;
        struct gfs2_dinode *di;
                                              dibh, sizeof(struct gfs2_dinode));
                        brelse(bh);
                } else {
-                       error = gfs2_unstuffer_page(ip, dibh, block, page);
+                       error = gfs2_unstuffer_folio(ip, dibh, block, folio);
                        if (error)
                                goto out_brelse;
                }
@@@ -157,17 -155,17 +155,17 @@@ out_brelse
  int gfs2_unstuff_dinode(struct gfs2_inode *ip)
  {
        struct inode *inode = &ip->i_inode;
-       struct page *page;
+       struct folio *folio;
        int error;
  
        down_write(&ip->i_rw_mutex);
-       page = grab_cache_page(inode->i_mapping, 0);
-       error = -ENOMEM;
-       if (!page)
+       folio = filemap_grab_folio(inode->i_mapping, 0);
+       error = PTR_ERR(folio);
+       if (IS_ERR(folio))
                goto out;
-       error = __gfs2_unstuff_inode(ip, page);
-       unlock_page(page);
-       put_page(page);
+       error = __gfs2_unstuff_inode(ip, folio);
+       folio_unlock(folio);
+       folio_put(folio);
  out:
        up_write(&ip->i_rw_mutex);
        return error;
@@@ -1386,7 -1384,7 +1384,7 @@@ static int trunc_start(struct inode *in
                ip->i_diskflags |= GFS2_DIF_TRUNC_IN_PROG;
  
        i_size_write(inode, newsize);
 -      ip->i_inode.i_mtime = inode_set_ctime_current(&ip->i_inode);
 +      inode_set_mtime_to_ts(&ip->i_inode, inode_set_ctime_current(&ip->i_inode));
        gfs2_dinode_out(ip, dibh->b_data);
  
        if (journaled)
@@@ -1583,7 -1581,7 +1581,7 @@@ out_unlock
  
                        /* Every transaction boundary, we rewrite the dinode
                           to keep its di_blocks current in case of failure. */
 -                      ip->i_inode.i_mtime = inode_set_ctime_current(&ip->i_inode);
 +                      inode_set_mtime_to_ts(&ip->i_inode, inode_set_ctime_current(&ip->i_inode));
                        gfs2_trans_add_meta(ip->i_gl, dibh);
                        gfs2_dinode_out(ip, dibh->b_data);
                        brelse(dibh);
@@@ -1949,7 -1947,7 +1947,7 @@@ static int punch_hole(struct gfs2_inod
                gfs2_statfs_change(sdp, 0, +btotal, 0);
                gfs2_quota_change(ip, -(s64)btotal, ip->i_inode.i_uid,
                                  ip->i_inode.i_gid);
 -              ip->i_inode.i_mtime = inode_set_ctime_current(&ip->i_inode);
 +              inode_set_mtime_to_ts(&ip->i_inode, inode_set_ctime_current(&ip->i_inode));
                gfs2_trans_add_meta(ip->i_gl, dibh);
                gfs2_dinode_out(ip, dibh->b_data);
                up_write(&ip->i_rw_mutex);
@@@ -1992,7 -1990,7 +1990,7 @@@ static int trunc_end(struct gfs2_inode 
                gfs2_buffer_clear_tail(dibh, sizeof(struct gfs2_dinode));
                gfs2_ordered_del_inode(ip);
        }
 -      ip->i_inode.i_mtime = inode_set_ctime_current(&ip->i_inode);
 +      inode_set_mtime_to_ts(&ip->i_inode, inode_set_ctime_current(&ip->i_inode));
        ip->i_diskflags &= ~GFS2_DIF_TRUNC_IN_PROG;
  
        gfs2_trans_add_meta(ip->i_gl, dibh);
@@@ -2093,7 -2091,7 +2091,7 @@@ static int do_grow(struct inode *inode
                goto do_end_trans;
  
        truncate_setsize(inode, size);
 -      ip->i_inode.i_mtime = inode_set_ctime_current(&ip->i_inode);
 +      inode_set_mtime_to_ts(&ip->i_inode, inode_set_ctime_current(&ip->i_inode));
        gfs2_trans_add_meta(ip->i_gl, dibh);
        gfs2_dinode_out(ip, dibh->b_data);
        brelse(dibh);
diff --combined fs/gfs2/glock.c
index 3772a5d9e85c055b4a6cfedae92c835ba97d8675,482291bb08d860ee13694b78c4104849ddd7694e..d5fa75eac0bfee8ea4c677c798313eb79e3175ea
@@@ -2041,11 -2041,7 +2041,7 @@@ static unsigned long gfs2_glock_shrink_
        return vfs_pressure_ratio(atomic_read(&lru_count));
  }
  
- static struct shrinker glock_shrinker = {
-       .seeks = DEFAULT_SEEKS,
-       .count_objects = gfs2_glock_shrink_count,
-       .scan_objects = gfs2_glock_shrink_scan,
- };
+ static struct shrinker *glock_shrinker;
  
  /**
   * glock_hash_walk - Call a function for glock in a hash bucket
@@@ -2465,13 -2461,18 +2461,18 @@@ int __init gfs2_glock_init(void
                return -ENOMEM;
        }
  
-       ret = register_shrinker(&glock_shrinker, "gfs2-glock");
-       if (ret) {
+       glock_shrinker = shrinker_alloc(0, "gfs2-glock");
+       if (!glock_shrinker) {
                destroy_workqueue(glock_workqueue);
                rhashtable_destroy(&gl_hash_table);
-               return ret;
+               return -ENOMEM;
        }
  
+       glock_shrinker->count_objects = gfs2_glock_shrink_count;
+       glock_shrinker->scan_objects = gfs2_glock_shrink_scan;
+       shrinker_register(glock_shrinker);
        for (i = 0; i < GLOCK_WAIT_TABLE_SIZE; i++)
                init_waitqueue_head(glock_wait_table + i);
  
  
  void gfs2_glock_exit(void)
  {
-       unregister_shrinker(&glock_shrinker);
+       shrinker_free(glock_shrinker);
        rhashtable_destroy(&gl_hash_table);
        destroy_workqueue(glock_workqueue);
  }
@@@ -2719,19 -2720,16 +2720,19 @@@ static struct file *gfs2_glockfd_next_f
        for(;; i->fd++) {
                struct inode *inode;
  
 -              i->file = task_lookup_next_fd_rcu(i->task, &i->fd);
 +              i->file = task_lookup_next_fdget_rcu(i->task, &i->fd);
                if (!i->file) {
                        i->fd = 0;
                        break;
                }
 +
                inode = file_inode(i->file);
 -              if (inode->i_sb != i->sb)
 -                      continue;
 -              if (get_file_rcu(i->file))
 +              if (inode->i_sb == i->sb)
                        break;
 +
 +              rcu_read_unlock();
 +              fput(i->file);
 +              rcu_read_lock();
        }
        rcu_read_unlock();
        return i->file;
diff --combined fs/gfs2/quota.c
index d9854aece15b5612441bb1d42762d3e95c597bff,2f1328af34f4d2e935b9dc3f7ec4414c4046a46b..5cbbc1a46a92bb46e77bc9aa690f5a035acd357c
@@@ -196,13 -196,26 +196,26 @@@ static unsigned long gfs2_qd_shrink_cou
        return vfs_pressure_ratio(list_lru_shrink_count(&gfs2_qd_lru, sc));
  }
  
- struct shrinker gfs2_qd_shrinker = {
-       .count_objects = gfs2_qd_shrink_count,
-       .scan_objects = gfs2_qd_shrink_scan,
-       .seeks = DEFAULT_SEEKS,
-       .flags = SHRINKER_NUMA_AWARE,
- };
+ static struct shrinker *gfs2_qd_shrinker;
  
+ int __init gfs2_qd_shrinker_init(void)
+ {
+       gfs2_qd_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "gfs2-qd");
+       if (!gfs2_qd_shrinker)
+               return -ENOMEM;
+       gfs2_qd_shrinker->count_objects = gfs2_qd_shrink_count;
+       gfs2_qd_shrinker->scan_objects = gfs2_qd_shrink_scan;
+       shrinker_register(gfs2_qd_shrinker);
+       return 0;
+ }
+ void gfs2_qd_shrinker_exit(void)
+ {
+       shrinker_free(gfs2_qd_shrinker);
+ }
  
  static u64 qd2index(struct gfs2_quota_data *qd)
  {
@@@ -736,7 -749,7 +749,7 @@@ static int gfs2_write_buf_to_page(struc
        struct gfs2_inode *ip = GFS2_I(sdp->sd_quota_inode);
        struct inode *inode = &ip->i_inode;
        struct address_space *mapping = inode->i_mapping;
-       struct page *page;
+       struct folio *folio;
        struct buffer_head *bh;
        u64 blk;
        unsigned bsize = sdp->sd_sb.sb_bsize, bnum = 0, boff = 0;
        blk = index << (PAGE_SHIFT - sdp->sd_sb.sb_bsize_shift);
        boff = off % bsize;
  
-       page = grab_cache_page(mapping, index);
-       if (!page)
-               return -ENOMEM;
-       if (!page_has_buffers(page))
-               create_empty_buffers(page, bsize, 0);
+       folio = filemap_grab_folio(mapping, index);
+       if (IS_ERR(folio))
+               return PTR_ERR(folio);
+       bh = folio_buffers(folio);
+       if (!bh)
+               bh = create_empty_buffers(folio, bsize, 0);
  
-       bh = page_buffers(page);
-       for(;;) {
-               /* Find the beginning block within the page */
+       for (;;) {
+               /* Find the beginning block within the folio */
                if (pg_off >= ((bnum * bsize) + bsize)) {
                        bh = bh->b_this_page;
                        bnum++;
                                goto unlock_out;
                        /* If it's a newly allocated disk block, zero it */
                        if (buffer_new(bh))
-                               zero_user(page, bnum * bsize, bh->b_size);
+                               folio_zero_range(folio, bnum * bsize,
+                                               bh->b_size);
                }
-               if (PageUptodate(page))
+               if (folio_test_uptodate(folio))
                        set_buffer_uptodate(bh);
                if (bh_read(bh, REQ_META | REQ_PRIO) < 0)
                        goto unlock_out;
                break;
        }
  
-       /* Write to the page, now that we have setup the buffer(s) */
-       memcpy_to_page(page, off, buf, bytes);
-       flush_dcache_page(page);
-       unlock_page(page);
-       put_page(page);
+       /* Write to the folio, now that we have setup the buffer(s) */
+       memcpy_to_folio(folio, off, buf, bytes);
+       flush_dcache_folio(folio);
+       folio_unlock(folio);
+       folio_put(folio);
  
        return 0;
  
  unlock_out:
-       unlock_page(page);
-       put_page(page);
+       folio_unlock(folio);
+       folio_put(folio);
        return -EIO;
  }
  
@@@ -886,7 -900,7 +900,7 @@@ static int gfs2_adjust_quota(struct gfs
                size = loc + sizeof(struct gfs2_quota);
                if (size > inode->i_size)
                        i_size_write(inode, size);
 -              inode->i_mtime = inode_set_ctime_current(inode);
 +              inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
                mark_inode_dirty(inode);
                set_bit(QDF_REFRESH, &qd->qd_flags);
        }
diff --combined fs/hugetlbfs/inode.c
index da217eaba10247b40032dd40b58b68cb2846087f,3ad5fc3cb8db40ba7853e73af4cc398ee390445a..54b3d489b6a7a52f7f876632dff55da6c5fc89c8
@@@ -83,29 -83,6 +83,6 @@@ static const struct fs_parameter_spec h
        {}
  };
  
- #ifdef CONFIG_NUMA
- static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma,
-                                       struct inode *inode, pgoff_t index)
- {
-       vma->vm_policy = mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
-                                                       index);
- }
- static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
- {
-       mpol_cond_put(vma->vm_policy);
- }
- #else
- static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma,
-                                       struct inode *inode, pgoff_t index)
- {
- }
- static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
- {
- }
- #endif
  /*
   * Mask used when checking the page offset value passed in via system
   * calls.  This value will be converted to a loff_t which is signed.
@@@ -135,7 -112,7 +112,7 @@@ static int hugetlbfs_file_mmap(struct f
        vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
        vma->vm_ops = &hugetlb_vm_ops;
  
-       ret = seal_check_future_write(info->seals, vma);
+       ret = seal_check_write(info->seals, vma);
        if (ret)
                return ret;
  
@@@ -295,7 -272,7 +272,7 @@@ static size_t adjust_range_hwpoison(str
        size_t res = 0;
  
        /* First subpage to start the loop. */
-       page += offset / PAGE_SIZE;
+       page = nth_page(page, offset / PAGE_SIZE);
        offset %= PAGE_SIZE;
        while (1) {
                if (is_raw_hwpoison_page_in_hugepage(page))
                        break;
                offset += n;
                if (offset == PAGE_SIZE) {
-                       page++;
+                       page = nth_page(page, 1);
                        offset = 0;
                }
        }
@@@ -334,7 -311,7 +311,7 @@@ static ssize_t hugetlbfs_read_iter(stru
        ssize_t retval = 0;
  
        while (iov_iter_count(to)) {
-               struct page *page;
+               struct folio *folio;
                size_t nr, copied, want;
  
                /* nr is the maximum number of bytes to copy from this page */
                }
                nr = nr - offset;
  
-               /* Find the page */
-               page = find_lock_page(mapping, index);
-               if (unlikely(page == NULL)) {
+               /* Find the folio */
+               folio = filemap_lock_hugetlb_folio(h, mapping, index);
+               if (IS_ERR(folio)) {
                        /*
                         * We have a HOLE, zero out the user-buffer for the
                         * length of the hole or request.
                         */
                        copied = iov_iter_zero(nr, to);
                } else {
-                       unlock_page(page);
+                       folio_unlock(folio);
  
-                       if (!PageHWPoison(page))
+                       if (!folio_test_has_hwpoisoned(folio))
                                want = nr;
                        else {
                                /*
                                 * touching the 1st raw HWPOISON subpage after
                                 * offset.
                                 */
-                               want = adjust_range_hwpoison(page, offset, nr);
+                               want = adjust_range_hwpoison(&folio->page, offset, nr);
                                if (want == 0) {
-                                       put_page(page);
+                                       folio_put(folio);
                                        retval = -EIO;
                                        break;
                                }
                        }
  
                        /*
-                        * We have the page, copy it to user space buffer.
+                        * We have the folio, copy it to user space buffer.
                         */
-                       copied = copy_page_to_iter(page, offset, want, to);
-                       put_page(page);
+                       copied = copy_folio_to_iter(folio, offset, want, to);
+                       folio_put(folio);
                }
                offset += copied;
                retval += copied;
@@@ -661,21 -638,20 +638,20 @@@ static void remove_inode_hugepages(stru
  {
        struct hstate *h = hstate_inode(inode);
        struct address_space *mapping = &inode->i_data;
-       const pgoff_t start = lstart >> huge_page_shift(h);
-       const pgoff_t end = lend >> huge_page_shift(h);
+       const pgoff_t end = lend >> PAGE_SHIFT;
        struct folio_batch fbatch;
        pgoff_t next, index;
        int i, freed = 0;
        bool truncate_op = (lend == LLONG_MAX);
  
        folio_batch_init(&fbatch);
-       next = start;
+       next = lstart >> PAGE_SHIFT;
        while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
                for (i = 0; i < folio_batch_count(&fbatch); ++i) {
                        struct folio *folio = fbatch.folios[i];
                        u32 hash = 0;
  
-                       index = folio->index;
+                       index = folio->index >> huge_page_order(h);
                        hash = hugetlb_fault_mutex_hash(mapping, index);
                        mutex_lock(&hugetlb_fault_mutex_table[hash]);
  
        }
  
        if (truncate_op)
-               (void)hugetlb_unreserve_pages(inode, start, LONG_MAX, freed);
+               (void)hugetlb_unreserve_pages(inode,
+                               lstart >> huge_page_shift(h),
+                               LONG_MAX, freed);
  }
  
  static void hugetlbfs_evict_inode(struct inode *inode)
@@@ -741,7 -719,7 +719,7 @@@ static void hugetlbfs_zero_partial_page
        pgoff_t idx = start >> huge_page_shift(h);
        struct folio *folio;
  
-       folio = filemap_lock_folio(mapping, idx);
+       folio = filemap_lock_hugetlb_folio(h, mapping, idx);
        if (IS_ERR(folio))
                return;
  
@@@ -852,8 -830,7 +830,7 @@@ static long hugetlbfs_fallocate(struct 
  
        /*
         * Initialize a pseudo vma as this is required by the huge page
-        * allocation routines.  If NUMA is configured, use page index
-        * as input to create an allocation policy.
+        * allocation routines.
         */
        vma_init(&pseudo_vma, mm);
        vm_flags_init(&pseudo_vma, VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
                mutex_lock(&hugetlb_fault_mutex_table[hash]);
  
                /* See if already present in mapping to avoid alloc/free */
-               folio = filemap_get_folio(mapping, index);
+               folio = filemap_get_folio(mapping, index << huge_page_order(h));
                if (!IS_ERR(folio)) {
                        folio_put(folio);
                        mutex_unlock(&hugetlb_fault_mutex_table[hash]);
                 * folios in these areas, we need to consume the reserves
                 * to keep reservation accounting consistent.
                 */
-               hugetlb_set_vma_policy(&pseudo_vma, inode, index);
                folio = alloc_hugetlb_folio(&pseudo_vma, addr, 0);
-               hugetlb_drop_vma_policy(&pseudo_vma);
                if (IS_ERR(folio)) {
                        mutex_unlock(&hugetlb_fault_mutex_table[hash]);
                        error = PTR_ERR(folio);
@@@ -980,7 -955,7 +955,7 @@@ static struct inode *hugetlbfs_get_root
                inode->i_mode = S_IFDIR | ctx->mode;
                inode->i_uid = ctx->uid;
                inode->i_gid = ctx->gid;
 -              inode->i_atime = inode->i_mtime = inode_set_ctime_current(inode);
 +              simple_inode_init_ts(inode);
                inode->i_op = &hugetlbfs_dir_inode_operations;
                inode->i_fop = &simple_dir_operations;
                /* directory inodes start off with i_nlink == 2 (for "." entry) */
@@@ -1024,7 -999,7 +999,7 @@@ static struct inode *hugetlbfs_get_inod
                lockdep_set_class(&inode->i_mapping->i_mmap_rwsem,
                                &hugetlbfs_i_mmap_rwsem_key);
                inode->i_mapping->a_ops = &hugetlbfs_aops;
 -              inode->i_atime = inode->i_mtime = inode_set_ctime_current(inode);
 +              simple_inode_init_ts(inode);
                inode->i_mapping->private_data = resv_map;
                info->seals = F_SEAL_SEAL;
                switch (mode & S_IFMT) {
@@@ -1067,7 -1042,7 +1042,7 @@@ static int hugetlbfs_mknod(struct mnt_i
        inode = hugetlbfs_get_inode(dir->i_sb, dir, mode, dev);
        if (!inode)
                return -ENOSPC;
 -      dir->i_mtime = inode_set_ctime_current(dir);
 +      inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
        d_instantiate(dentry, inode);
        dget(dentry);/* Extra count - pin the dentry in core */
        return 0;
@@@ -1099,7 -1074,7 +1074,7 @@@ static int hugetlbfs_tmpfile(struct mnt
        inode = hugetlbfs_get_inode(dir->i_sb, dir, mode | S_IFREG, 0);
        if (!inode)
                return -ENOSPC;
 -      dir->i_mtime = inode_set_ctime_current(dir);
 +      inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
        d_tmpfile(file, inode);
        return finish_open_simple(file, 0);
  }
@@@ -1121,7 -1096,7 +1096,7 @@@ static int hugetlbfs_symlink(struct mnt
                } else
                        iput(inode);
        }
 -      dir->i_mtime = inode_set_ctime_current(dir);
 +      inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
  
        return error;
  }
@@@ -1282,18 -1257,6 +1257,6 @@@ static struct inode *hugetlbfs_alloc_in
                hugetlbfs_inc_free_inodes(sbinfo);
                return NULL;
        }
-       /*
-        * Any time after allocation, hugetlbfs_destroy_inode can be called
-        * for the inode.  mpol_free_shared_policy is unconditionally called
-        * as part of hugetlbfs_destroy_inode.  So, initialize policy here
-        * in case of a quick call to destroy.
-        *
-        * Note that the policy is initialized even if we are creating a
-        * private inode.  This simplifies hugetlbfs_destroy_inode.
-        */
-       mpol_shared_policy_init(&p->policy, NULL);
        return &p->vfs_inode;
  }
  
@@@ -1305,7 -1268,6 +1268,6 @@@ static void hugetlbfs_free_inode(struc
  static void hugetlbfs_destroy_inode(struct inode *inode)
  {
        hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode->i_sb));
-       mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);
  }
  
  static const struct address_space_operations hugetlbfs_aops = {
diff --combined fs/iomap/buffered-io.c
index 2bc0aa23fde3b940427b9c32533fc54f6c075e4d,5d19a2b47b6a6c058e356a51f1c1b1697b37c6ae..f72df2babe561ada4e38039093dcedb958dea312
@@@ -29,9 -29,9 +29,9 @@@ typedef int (*iomap_punch_t)(struct ino
   * and I/O completions.
   */
  struct iomap_folio_state {
-       atomic_t                read_bytes_pending;
-       atomic_t                write_bytes_pending;
        spinlock_t              state_lock;
+       unsigned int            read_bytes_pending;
+       atomic_t                write_bytes_pending;
  
        /*
         * Each block has two bits in this bitmap:
@@@ -57,30 -57,32 +57,32 @@@ static inline bool ifs_block_is_uptodat
        return test_bit(block, ifs->state);
  }
  
- static void ifs_set_range_uptodate(struct folio *folio,
+ static bool ifs_set_range_uptodate(struct folio *folio,
                struct iomap_folio_state *ifs, size_t off, size_t len)
  {
        struct inode *inode = folio->mapping->host;
        unsigned int first_blk = off >> inode->i_blkbits;
        unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
        unsigned int nr_blks = last_blk - first_blk + 1;
-       unsigned long flags;
  
-       spin_lock_irqsave(&ifs->state_lock, flags);
        bitmap_set(ifs->state, first_blk, nr_blks);
-       if (ifs_is_fully_uptodate(folio, ifs))
-               folio_mark_uptodate(folio);
-       spin_unlock_irqrestore(&ifs->state_lock, flags);
+       return ifs_is_fully_uptodate(folio, ifs);
  }
  
  static void iomap_set_range_uptodate(struct folio *folio, size_t off,
                size_t len)
  {
        struct iomap_folio_state *ifs = folio->private;
+       unsigned long flags;
+       bool uptodate = true;
  
-       if (ifs)
-               ifs_set_range_uptodate(folio, ifs, off, len);
-       else
+       if (ifs) {
+               spin_lock_irqsave(&ifs->state_lock, flags);
+               uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
+               spin_unlock_irqrestore(&ifs->state_lock, flags);
+       }
+       if (uptodate)
                folio_mark_uptodate(folio);
  }
  
@@@ -181,7 -183,7 +183,7 @@@ static void ifs_free(struct folio *foli
  
        if (!ifs)
                return;
-       WARN_ON_ONCE(atomic_read(&ifs->read_bytes_pending));
+       WARN_ON_ONCE(ifs->read_bytes_pending != 0);
        WARN_ON_ONCE(atomic_read(&ifs->write_bytes_pending));
        WARN_ON_ONCE(ifs_is_fully_uptodate(folio, ifs) !=
                        folio_test_uptodate(folio));
@@@ -248,20 -250,28 +250,28 @@@ static void iomap_adjust_read_range(str
        *lenp = plen;
  }
  
- static void iomap_finish_folio_read(struct folio *folio, size_t offset,
+ static void iomap_finish_folio_read(struct folio *folio, size_t off,
                size_t len, int error)
  {
        struct iomap_folio_state *ifs = folio->private;
+       bool uptodate = !error;
+       bool finished = true;
  
-       if (unlikely(error)) {
-               folio_clear_uptodate(folio);
-               folio_set_error(folio);
-       } else {
-               iomap_set_range_uptodate(folio, offset, len);
+       if (ifs) {
+               unsigned long flags;
+               spin_lock_irqsave(&ifs->state_lock, flags);
+               if (!error)
+                       uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
+               ifs->read_bytes_pending -= len;
+               finished = !ifs->read_bytes_pending;
+               spin_unlock_irqrestore(&ifs->state_lock, flags);
        }
  
-       if (!ifs || atomic_sub_and_test(len, &ifs->read_bytes_pending))
-               folio_unlock(folio);
+       if (error)
+               folio_set_error(folio);
+       if (finished)
+               folio_end_read(folio, uptodate);
  }
  
  static void iomap_read_end_io(struct bio *bio)
@@@ -358,8 -368,11 +368,11 @@@ static loff_t iomap_readpage_iter(cons
        }
  
        ctx->cur_folio_in_bio = true;
-       if (ifs)
-               atomic_add(plen, &ifs->read_bytes_pending);
+       if (ifs) {
+               spin_lock_irq(&ifs->state_lock);
+               ifs->read_bytes_pending += plen;
+               spin_unlock_irq(&ifs->state_lock);
+       }
  
        sector = iomap_sector(iomap, pos);
        if (!ctx->bio ||
@@@ -881,10 -894,8 +894,10 @@@ static loff_t iomap_write_iter(struct i
                size_t bytes;           /* Bytes to write to folio */
                size_t copied;          /* Bytes copied from user */
  
 +              bytes = iov_iter_count(i);
 +retry:
                offset = pos & (chunk - 1);
 -              bytes = min(chunk - offset, iov_iter_count(i));
 +              bytes = min(chunk - offset, bytes);
                status = balance_dirty_pages_ratelimited_flags(mapping,
                                                               bdp_flags);
                if (unlikely(status))
                         * halfway through, might be a race with munmap,
                         * might be severe memory pressure.
                         */
 -                      if (copied)
 -                              bytes = copied;
                        if (chunk > PAGE_SIZE)
                                chunk /= 2;
 +                      if (copied) {
 +                              bytes = copied;
 +                              goto retry;
 +                      }
                } else {
                        pos += status;
                        written += status;
diff --combined fs/nfs/super.c
index 9b1cfca8112ae2e2e0b91d6634bffa01c1d1f0c7,09ded7f63acf6ecc7603c2f3852e1de520f55292..2667ab753d42747e0b20cd2cdfcbe0710884afc0
@@@ -129,11 -129,7 +129,7 @@@ static void nfs_ssc_unregister_ops(void
  }
  #endif /* CONFIG_NFS_V4_2 */
  
- static struct shrinker acl_shrinker = {
-       .count_objects  = nfs_access_cache_count,
-       .scan_objects   = nfs_access_cache_scan,
-       .seeks          = DEFAULT_SEEKS,
- };
+ static struct shrinker *acl_shrinker;
  
  /*
   * Register the NFS filesystems
@@@ -153,9 -149,18 +149,18 @@@ int __init register_nfs_fs(void
        ret = nfs_register_sysctl();
        if (ret < 0)
                goto error_2;
-       ret = register_shrinker(&acl_shrinker, "nfs-acl");
-       if (ret < 0)
+       acl_shrinker = shrinker_alloc(0, "nfs-acl");
+       if (!acl_shrinker) {
+               ret = -ENOMEM;
                goto error_3;
+       }
+       acl_shrinker->count_objects = nfs_access_cache_count;
+       acl_shrinker->scan_objects = nfs_access_cache_scan;
+       shrinker_register(acl_shrinker);
  #ifdef CONFIG_NFS_V4_2
        nfs_ssc_register_ops();
  #endif
@@@ -175,7 -180,7 +180,7 @@@ error_0
   */
  void __exit unregister_nfs_fs(void)
  {
-       unregister_shrinker(&acl_shrinker);
+       shrinker_free(acl_shrinker);
        nfs_unregister_sysctl();
        unregister_nfs4_fs();
  #ifdef CONFIG_NFS_V4_2
@@@ -1071,7 -1076,7 +1076,7 @@@ static void nfs_fill_super(struct super
                sb->s_export_op = &nfs_export_ops;
                break;
        case 4:
 -              sb->s_flags |= SB_POSIXACL;
 +              sb->s_iflags |= SB_I_NOUMASK;
                sb->s_time_gran = 1;
                sb->s_time_min = S64_MIN;
                sb->s_time_max = S64_MAX;
diff --combined fs/nfsd/filecache.c
index 07bf219f9ae482a352c9a3fb122f06769320241a,9c62b4502539a5864da67066d4ed659c8098e804..ef063f93fde9d831e825634f6a4b80976d671747
@@@ -521,11 -521,7 +521,7 @@@ nfsd_file_lru_scan(struct shrinker *s, 
        return ret;
  }
  
- static struct shrinker        nfsd_file_shrinker = {
-       .scan_objects = nfsd_file_lru_scan,
-       .count_objects = nfsd_file_lru_count,
-       .seeks = 1,
- };
+ static struct shrinker *nfsd_file_shrinker;
  
  /**
   * nfsd_file_cond_queue - conditionally unhash and queue a nfsd_file
@@@ -746,12 -742,19 +742,19 @@@ nfsd_file_cache_init(void
                goto out_err;
        }
  
-       ret = register_shrinker(&nfsd_file_shrinker, "nfsd-filecache");
-       if (ret) {
-               pr_err("nfsd: failed to register nfsd_file_shrinker: %d\n", ret);
+       nfsd_file_shrinker = shrinker_alloc(0, "nfsd-filecache");
+       if (!nfsd_file_shrinker) {
+               ret = -ENOMEM;
+               pr_err("nfsd: failed to allocate nfsd_file_shrinker\n");
                goto out_lru;
        }
  
+       nfsd_file_shrinker->count_objects = nfsd_file_lru_count;
+       nfsd_file_shrinker->scan_objects = nfsd_file_lru_scan;
+       nfsd_file_shrinker->seeks = 1;
+       shrinker_register(nfsd_file_shrinker);
        ret = lease_register_notifier(&nfsd_file_lease_notifier);
        if (ret) {
                pr_err("nfsd: unable to register lease notifier: %d\n", ret);
@@@ -774,7 -777,7 +777,7 @@@ out
  out_notifier:
        lease_unregister_notifier(&nfsd_file_lease_notifier);
  out_shrinker:
-       unregister_shrinker(&nfsd_file_shrinker);
+       shrinker_free(nfsd_file_shrinker);
  out_lru:
        list_lru_destroy(&nfsd_file_lru);
  out_err:
@@@ -891,7 -894,7 +894,7 @@@ nfsd_file_cache_shutdown(void
                return;
  
        lease_unregister_notifier(&nfsd_file_lease_notifier);
-       unregister_shrinker(&nfsd_file_shrinker);
+       shrinker_free(nfsd_file_shrinker);
        /*
         * make sure all callers of nfsd_file_lru_cb are done before
         * calling nfsd_file_cache_purge
@@@ -989,21 -992,22 +992,21 @@@ nfsd_file_do_acquire(struct svc_rqst *r
        unsigned char need = may_flags & NFSD_FILE_MAY_MASK;
        struct net *net = SVC_NET(rqstp);
        struct nfsd_file *new, *nf;
 -      const struct cred *cred;
 +      bool stale_retry = true;
        bool open_retry = true;
        struct inode *inode;
        __be32 status;
        int ret;
  
 +retry:
        status = fh_verify(rqstp, fhp, S_IFREG,
                                may_flags|NFSD_MAY_OWNER_OVERRIDE);
        if (status != nfs_ok)
                return status;
        inode = d_inode(fhp->fh_dentry);
 -      cred = get_current_cred();
  
 -retry:
        rcu_read_lock();
 -      nf = nfsd_file_lookup_locked(net, cred, inode, need, want_gc);
 +      nf = nfsd_file_lookup_locked(net, current_cred(), inode, need, want_gc);
        rcu_read_unlock();
  
        if (nf) {
  
        rcu_read_lock();
        spin_lock(&inode->i_lock);
 -      nf = nfsd_file_lookup_locked(net, cred, inode, need, want_gc);
 +      nf = nfsd_file_lookup_locked(net, current_cred(), inode, need, want_gc);
        if (unlikely(nf)) {
                spin_unlock(&inode->i_lock);
                rcu_read_unlock();
@@@ -1057,7 -1061,6 +1060,7 @@@ wait_for_construction
                        goto construction_err;
                }
                open_retry = false;
 +              fh_put(fhp);
                goto retry;
        }
        this_cpu_inc(nfsd_file_cache_hits);
@@@ -1074,6 -1077,7 +1077,6 @@@ out
                nfsd_file_check_write_error(nf);
                *pnf = nf;
        }
 -      put_cred(cred);
        trace_nfsd_file_acquire(rqstp, inode, may_flags, nf, status);
        return status;
  
@@@ -1087,20 -1091,8 +1090,20 @@@ open_file
                        status = nfs_ok;
                        trace_nfsd_file_opened(nf, status);
                } else {
 -                      status = nfsd_open_verified(rqstp, fhp, may_flags,
 -                                                  &nf->nf_file);
 +                      ret = nfsd_open_verified(rqstp, fhp, may_flags,
 +                                               &nf->nf_file);
 +                      if (ret == -EOPENSTALE && stale_retry) {
 +                              stale_retry = false;
 +                              nfsd_file_unhash(nf);
 +                              clear_and_wake_up_bit(NFSD_FILE_PENDING,
 +                                                    &nf->nf_flags);
 +                              if (refcount_dec_and_test(&nf->nf_ref))
 +                                      nfsd_file_free(nf);
 +                              nf = NULL;
 +                              fh_put(fhp);
 +                              goto retry;
 +                      }
 +                      status = nfserrno(ret);
                        trace_nfsd_file_open(nf, status);
                }
        } else
diff --combined fs/nfsd/nfs4state.c
index 65fd5510323a3a76845308a8d5d21460dd78a578,23b3b38c8cda7d54f20db48ea4115cfb8b2d6302..4045c852a450e7ab172f26ecfcfc5e2805bfb6df
@@@ -59,7 -59,7 +59,7 @@@
  
  #define NFSDDBG_FACILITY                NFSDDBG_PROC
  
 -#define all_ones {{~0,~0},~0}
 +#define all_ones {{ ~0, ~0}, ~0}
  static const stateid_t one_stateid = {
        .si_generation = ~0,
        .si_opaque = all_ones,
@@@ -127,7 -127,6 +127,7 @@@ static void free_session(struct nfsd4_s
  
  static const struct nfsd4_callback_ops nfsd4_cb_recall_ops;
  static const struct nfsd4_callback_ops nfsd4_cb_notify_lock_ops;
 +static const struct nfsd4_callback_ops nfsd4_cb_getattr_ops;
  
  static struct workqueue_struct *laundry_wq;
  
@@@ -298,7 -297,7 +298,7 @@@ find_or_allocate_block(struct nfs4_lock
  
        nbl = find_blocked_lock(lo, fh, nn);
        if (!nbl) {
 -              nbl= kmalloc(sizeof(*nbl), GFP_KERNEL);
 +              nbl = kmalloc(sizeof(*nbl), GFP_KERNEL);
                if (nbl) {
                        INIT_LIST_HEAD(&nbl->nbl_list);
                        INIT_LIST_HEAD(&nbl->nbl_lru);
@@@ -1160,7 -1159,6 +1160,7 @@@ alloc_init_deleg(struct nfs4_client *cl
                 struct nfs4_clnt_odstate *odstate, u32 dl_type)
  {
        struct nfs4_delegation *dp;
 +      struct nfs4_stid *stid;
        long n;
  
        dprintk("NFSD alloc_init_deleg\n");
                goto out_dec;
        if (delegation_blocked(&fp->fi_fhandle))
                goto out_dec;
 -      dp = delegstateid(nfs4_alloc_stid(clp, deleg_slab, nfs4_free_deleg));
 -      if (dp == NULL)
 +      stid = nfs4_alloc_stid(clp, deleg_slab, nfs4_free_deleg);
 +      if (stid == NULL)
                goto out_dec;
 +      dp = delegstateid(stid);
  
        /*
         * delegation seqid's are never incremented.  The 4.1 special
        dp->dl_recalled = false;
        nfsd4_init_cb(&dp->dl_recall, dp->dl_stid.sc_client,
                      &nfsd4_cb_recall_ops, NFSPROC4_CLNT_CB_RECALL);
 +      nfsd4_init_cb(&dp->dl_cb_fattr.ncf_getattr, dp->dl_stid.sc_client,
 +                      &nfsd4_cb_getattr_ops, NFSPROC4_CLNT_CB_GETATTR);
 +      dp->dl_cb_fattr.ncf_file_modified = false;
 +      dp->dl_cb_fattr.ncf_cb_bmap[0] = FATTR4_WORD0_CHANGE | FATTR4_WORD0_SIZE;
        get_nfs4_file(fp);
        dp->dl_stid.sc_file = fp;
        return dp;
@@@ -2901,56 -2894,11 +2901,56 @@@ nfsd4_cb_recall_any_release(struct nfsd
        spin_unlock(&nn->client_lock);
  }
  
 +static int
 +nfsd4_cb_getattr_done(struct nfsd4_callback *cb, struct rpc_task *task)
 +{
 +      struct nfs4_cb_fattr *ncf =
 +                      container_of(cb, struct nfs4_cb_fattr, ncf_getattr);
 +
 +      ncf->ncf_cb_status = task->tk_status;
 +      switch (task->tk_status) {
 +      case -NFS4ERR_DELAY:
 +              rpc_delay(task, 2 * HZ);
 +              return 0;
 +      default:
 +              return 1;
 +      }
 +}
 +
 +static void
 +nfsd4_cb_getattr_release(struct nfsd4_callback *cb)
 +{
 +      struct nfs4_cb_fattr *ncf =
 +                      container_of(cb, struct nfs4_cb_fattr, ncf_getattr);
 +      struct nfs4_delegation *dp =
 +                      container_of(ncf, struct nfs4_delegation, dl_cb_fattr);
 +
 +      nfs4_put_stid(&dp->dl_stid);
 +      clear_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags);
 +      wake_up_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY);
 +}
 +
  static const struct nfsd4_callback_ops nfsd4_cb_recall_any_ops = {
        .done           = nfsd4_cb_recall_any_done,
        .release        = nfsd4_cb_recall_any_release,
  };
  
 +static const struct nfsd4_callback_ops nfsd4_cb_getattr_ops = {
 +      .done           = nfsd4_cb_getattr_done,
 +      .release        = nfsd4_cb_getattr_release,
 +};
 +
 +void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf)
 +{
 +      struct nfs4_delegation *dp =
 +                      container_of(ncf, struct nfs4_delegation, dl_cb_fattr);
 +
 +      if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags))
 +              return;
 +      refcount_inc(&dp->dl_stid.sc_count);
 +      nfsd4_run_cb(&ncf->ncf_getattr);
 +}
 +
  static struct nfs4_client *create_client(struct xdr_netobj name,
                struct svc_rqst *rqstp, nfs4_verifier *verf)
  {
@@@ -4452,8 -4400,7 +4452,7 @@@ static unsigned lon
  nfsd4_state_shrinker_count(struct shrinker *shrink, struct shrink_control *sc)
  {
        int count;
-       struct nfsd_net *nn = container_of(shrink,
-                       struct nfsd_net, nfsd_client_shrinker);
+       struct nfsd_net *nn = shrink->private_data;
  
        count = atomic_read(&nn->nfsd_courtesy_clients);
        if (!count)
@@@ -5686,15 -5633,13 +5685,15 @@@ nfs4_open_delegation(struct nfsd4_open 
        struct svc_fh *parent = NULL;
        int cb_up;
        int status = 0;
 +      struct kstat stat;
 +      struct path path;
  
        cb_up = nfsd4_cb_channel_good(oo->oo_owner.so_client);
 -      open->op_recall = 0;
 +      open->op_recall = false;
        switch (open->op_claim_type) {
                case NFS4_OPEN_CLAIM_PREVIOUS:
                        if (!cb_up)
 -                              open->op_recall = 1;
 +                              open->op_recall = true;
                        break;
                case NFS4_OPEN_CLAIM_NULL:
                        parent = currentfh;
        if (open->op_share_access & NFS4_SHARE_ACCESS_WRITE) {
                open->op_delegate_type = NFS4_OPEN_DELEGATE_WRITE;
                trace_nfsd_deleg_write(&dp->dl_stid.sc_stateid);
 +              path.mnt = currentfh->fh_export->ex_path.mnt;
 +              path.dentry = currentfh->fh_dentry;
 +              if (vfs_getattr(&path, &stat,
 +                              (STATX_SIZE | STATX_CTIME | STATX_CHANGE_COOKIE),
 +                              AT_STATX_SYNC_AS_STAT)) {
 +                      nfs4_put_stid(&dp->dl_stid);
 +                      destroy_delegation(dp);
 +                      goto out_no_deleg;
 +              }
 +              dp->dl_cb_fattr.ncf_cur_fsize = stat.size;
 +              dp->dl_cb_fattr.ncf_initial_cinfo =
 +                      nfsd4_change_attribute(&stat, d_inode(currentfh->fh_dentry));
        } else {
                open->op_delegate_type = NFS4_OPEN_DELEGATE_READ;
                trace_nfsd_deleg_read(&dp->dl_stid.sc_stateid);
@@@ -5748,7 -5681,7 +5747,7 @@@ out_no_deleg
        if (open->op_claim_type == NFS4_OPEN_CLAIM_PREVIOUS &&
            open->op_delegate_type != NFS4_OPEN_DELEGATE_NONE) {
                dprintk("NFSD: WARNING: refusing delegation reclaim\n");
 -              open->op_recall = 1;
 +              open->op_recall = true;
        }
  
        /* 4.1 client asking for a delegation? */
@@@ -7553,7 -7486,6 +7552,7 @@@ nfsd4_lock(struct svc_rqst *rqstp, stru
        struct nfsd4_blocked_lock *nbl = NULL;
        struct file_lock *file_lock = NULL;
        struct file_lock *conflock = NULL;
 +      struct super_block *sb;
        __be32 status = 0;
        int lkflg;
        int err;
                dprintk("NFSD: nfsd4_lock: permission denied!\n");
                return status;
        }
 +      sb = cstate->current_fh.fh_dentry->d_sb;
  
        if (lock->lk_is_new) {
                if (nfsd4_has_session(cstate))
        fp = lock_stp->st_stid.sc_file;
        switch (lock->lk_type) {
                case NFS4_READW_LT:
 -                      if (nfsd4_has_session(cstate))
 +                      if (nfsd4_has_session(cstate) ||
 +                          exportfs_lock_op_is_async(sb->s_export_op))
                                fl_flags |= FL_SLEEP;
                        fallthrough;
                case NFS4_READ_LT:
                        fl_type = F_RDLCK;
                        break;
                case NFS4_WRITEW_LT:
 -                      if (nfsd4_has_session(cstate))
 +                      if (nfsd4_has_session(cstate) ||
 +                          exportfs_lock_op_is_async(sb->s_export_op))
                                fl_flags |= FL_SLEEP;
                        fallthrough;
                case NFS4_WRITE_LT:
         * for file locks), so don't attempt blocking lock notifications
         * on those filesystems:
         */
 -      if (nf->nf_file->f_op->lock)
 +      if (!exportfs_lock_op_is_async(sb->s_export_op))
                fl_flags &= ~FL_SLEEP;
  
        nbl = find_or_allocate_block(lock_sop, &fp->fi_fhandle, nn);
        return status;
  }
  
 +void nfsd4_lock_release(union nfsd4_op_u *u)
 +{
 +      struct nfsd4_lock *lock = &u->lock;
 +      struct nfsd4_lock_denied *deny = &lock->lk_denied;
 +
 +      kfree(deny->ld_owner.data);
 +}
 +
  /*
   * The NFSv4 spec allows a client to do a LOCKT without holding an OPEN,
   * so we do a temporary open here just to get an open file to pass to
        return status;
  }
  
 +void nfsd4_lockt_release(union nfsd4_op_u *u)
 +{
 +      struct nfsd4_lockt *lockt = &u->lockt;
 +      struct nfsd4_lock_denied *deny = &lockt->lt_denied;
 +
 +      kfree(deny->ld_owner.data);
 +}
 +
  __be32
  nfsd4_locku(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
            union nfsd4_op_u *u)
@@@ -8235,12 -8148,16 +8234,16 @@@ static int nfs4_state_create_net(struc
        INIT_WORK(&nn->nfsd_shrinker_work, nfsd4_state_shrinker_worker);
        get_net(net);
  
-       nn->nfsd_client_shrinker.scan_objects = nfsd4_state_shrinker_scan;
-       nn->nfsd_client_shrinker.count_objects = nfsd4_state_shrinker_count;
-       nn->nfsd_client_shrinker.seeks = DEFAULT_SEEKS;
-       if (register_shrinker(&nn->nfsd_client_shrinker, "nfsd-client"))
+       nn->nfsd_client_shrinker = shrinker_alloc(0, "nfsd-client");
+       if (!nn->nfsd_client_shrinker)
                goto err_shrinker;
+       nn->nfsd_client_shrinker->scan_objects = nfsd4_state_shrinker_scan;
+       nn->nfsd_client_shrinker->count_objects = nfsd4_state_shrinker_count;
+       nn->nfsd_client_shrinker->private_data = nn;
+       shrinker_register(nn->nfsd_client_shrinker);
        return 0;
  
  err_shrinker:
@@@ -8338,7 -8255,7 +8341,7 @@@ nfs4_state_shutdown_net(struct net *net
        struct list_head *pos, *next, reaplist;
        struct nfsd_net *nn = net_generic(net, nfsd_net_id);
  
-       unregister_shrinker(&nn->nfsd_client_shrinker);
+       shrinker_free(nn->nfsd_client_shrinker);
        cancel_work(&nn->nfsd_shrinker_work);
        cancel_delayed_work_sync(&nn->laundromat_work);
        locks_end_grace(&nn->nfsd4_manager);
@@@ -8489,8 -8406,6 +8492,8 @@@ nfsd4_get_writestateid(struct nfsd4_com
   * nfsd4_deleg_getattr_conflict - Recall if GETATTR causes conflict
   * @rqstp: RPC transaction context
   * @inode: file to be checked for a conflict
 + * @modified: return true if file was modified
 + * @size: new size of file if modified is true
   *
   * This function is called when there is a conflict between a write
   * delegation and a change/size GETATTR from another client. The server
   * delegation before replying to the GETATTR. See RFC 8881 section
   * 18.7.4.
   *
 - * The current implementation does not support CB_GETATTR yet. However
 - * this can avoid recalling the delegation could be added in follow up
 - * work.
 - *
   * Returns 0 if there is no conflict; otherwise an nfs_stat
   * code is returned.
   */
  __be32
 -nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode)
 +nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode,
 +                           bool *modified, u64 *size)
  {
 -      __be32 status;
        struct file_lock_context *ctx;
 -      struct file_lock *fl;
        struct nfs4_delegation *dp;
 +      struct nfs4_cb_fattr *ncf;
 +      struct file_lock *fl;
 +      struct iattr attrs;
 +      __be32 status;
  
 +      might_sleep();
 +
 +      *modified = false;
        ctx = locks_inode_context(inode);
        if (!ctx)
                return 0;
  break_lease:
                        spin_unlock(&ctx->flc_lock);
                        nfsd_stats_wdeleg_getattr_inc();
 -                      status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ));
 -                      if (status != nfserr_jukebox ||
 -                                      !nfsd_wait_for_delegreturn(rqstp, inode))
 -                              return status;
 +
 +                      dp = fl->fl_owner;
 +                      ncf = &dp->dl_cb_fattr;
 +                      nfs4_cb_getattr(&dp->dl_cb_fattr);
 +                      wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE);
 +                      if (ncf->ncf_cb_status) {
 +                              status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ));
 +                              if (status != nfserr_jukebox ||
 +                                              !nfsd_wait_for_delegreturn(rqstp, inode))
 +                                      return status;
 +                      }
 +                      if (!ncf->ncf_file_modified &&
 +                                      (ncf->ncf_initial_cinfo != ncf->ncf_cb_change ||
 +                                      ncf->ncf_cur_fsize != ncf->ncf_cb_fsize))
 +                              ncf->ncf_file_modified = true;
 +                      if (ncf->ncf_file_modified) {
 +                              /*
 +                               * The server would not update the file's metadata
 +                               * with the client's modified size.
 +                               */
 +                              attrs.ia_mtime = attrs.ia_ctime = current_time(inode);
 +                              attrs.ia_valid = ATTR_MTIME | ATTR_CTIME;
 +                              setattr_copy(&nop_mnt_idmap, inode, &attrs);
 +                              mark_inode_dirty(inode);
 +                              ncf->ncf_cur_fsize = ncf->ncf_cb_fsize;
 +                              *size = ncf->ncf_cur_fsize;
 +                              *modified = true;
 +                      }
                        return 0;
                }
                break;
diff --combined fs/ntfs3/file.c
index ad4a70b5d4321f1ae39dc1c49cbc44add8f56fbd,66fd4ac28395f5e311a8a07fb829405294169929..a5a30a24ce5dfa70d670826d1b5ac16a668d06be
@@@ -187,7 -187,7 +187,7 @@@ static int ntfs_zero_range(struct inod
        struct buffer_head *head, *bh;
        u32 bh_next, bh_off, to;
        sector_t iblock;
-       struct page *page;
+       struct folio *folio;
  
        for (; idx < idx_end; idx += 1, from = 0) {
                page_off = (loff_t)idx << PAGE_SHIFT;
                                                       PAGE_SIZE;
                iblock = page_off >> inode->i_blkbits;
  
-               page = find_or_create_page(mapping, idx,
-                                          mapping_gfp_constraint(mapping,
-                                                                 ~__GFP_FS));
-               if (!page)
-                       return -ENOMEM;
+               folio = __filemap_get_folio(mapping, idx,
+                               FGP_LOCK | FGP_ACCESSED | FGP_CREAT,
+                               mapping_gfp_constraint(mapping, ~__GFP_FS));
+               if (IS_ERR(folio))
+                       return PTR_ERR(folio);
  
-               if (!page_has_buffers(page))
-                       create_empty_buffers(page, blocksize, 0);
+               head = folio_buffers(folio);
+               if (!head)
+                       head = create_empty_buffers(folio, blocksize, 0);
  
-               bh = head = page_buffers(page);
+               bh = head;
                bh_off = 0;
                do {
                        bh_next = bh_off + blocksize;
                        }
  
                        /* Ok, it's mapped. Make sure it's up-to-date. */
-                       if (PageUptodate(page))
+                       if (folio_test_uptodate(folio))
                                set_buffer_uptodate(bh);
  
                        if (!buffer_uptodate(bh)) {
                                err = bh_read(bh, 0);
                                if (err < 0) {
-                                       unlock_page(page);
-                                       put_page(page);
+                                       folio_unlock(folio);
+                                       folio_put(folio);
                                        goto out;
                                }
                        }
                } while (bh_off = bh_next, iblock += 1,
                         head != (bh = bh->b_this_page));
  
-               zero_user_segment(page, from, to);
+               folio_zero_segment(folio, from, to);
  
-               unlock_page(page);
-               put_page(page);
+               folio_unlock(folio);
+               folio_put(folio);
                cond_resched();
        }
  out:
@@@ -342,7 -343,7 +343,7 @@@ static int ntfs_extend(struct inode *in
                err = 0;
        }
  
 -      inode->i_mtime = inode_set_ctime_current(inode);
 +      inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
        mark_inode_dirty(inode);
  
        if (IS_SYNC(inode)) {
@@@ -400,7 -401,7 +401,7 @@@ static int ntfs_truncate(struct inode *
        ni_unlock(ni);
  
        ni->std_fa |= FILE_ATTRIBUTE_ARCHIVE;
 -      inode->i_mtime = inode_set_ctime_current(inode);
 +      inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
        if (!IS_DIRSYNC(inode)) {
                dirty = 1;
        } else {
@@@ -642,7 -643,7 +643,7 @@@ out
                filemap_invalidate_unlock(mapping);
  
        if (!err) {
 -              inode->i_mtime = inode_set_ctime_current(inode);
 +              inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
                mark_inode_dirty(inode);
        }
  
@@@ -745,8 -746,8 +746,8 @@@ static ssize_t ntfs_file_read_iter(stru
  }
  
  static ssize_t ntfs_file_splice_read(struct file *in, loff_t *ppos,
 -                                   struct pipe_inode_info *pipe,
 -                                   size_t len, unsigned int flags)
 +                                   struct pipe_inode_info *pipe, size_t len,
 +                                   unsigned int flags)
  {
        struct inode *inode = in->f_mapping->host;
        struct ntfs_inode *ni = ntfs_i(inode);
diff --combined fs/ocfs2/aops.c
index 6ab03494fc6e7d9063edbb0664176d1f2576c258,a6405dd5df09d80c909ae736072697de53252633..ba790219d528e1997c84e92ff3e297b6756fa69f
@@@ -568,10 -568,10 +568,10 @@@ static void ocfs2_clear_page_regions(st
   * read-in the blocks at the tail of our file. Avoid reading them by
   * testing i_size against each block offset.
   */
- static int ocfs2_should_read_blk(struct inode *inode, struct page *page,
+ static int ocfs2_should_read_blk(struct inode *inode, struct folio *folio,
                                 unsigned int block_start)
  {
-       u64 offset = page_offset(page) + block_start;
+       u64 offset = folio_pos(folio) + block_start;
  
        if (ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)))
                return 1;
@@@ -593,15 -593,16 +593,16 @@@ int ocfs2_map_page_blocks(struct page *
                          struct inode *inode, unsigned int from,
                          unsigned int to, int new)
  {
+       struct folio *folio = page_folio(page);
        int ret = 0;
        struct buffer_head *head, *bh, *wait[2], **wait_bh = wait;
        unsigned int block_end, block_start;
        unsigned int bsize = i_blocksize(inode);
  
-       if (!page_has_buffers(page))
-               create_empty_buffers(page, bsize, 0);
+       head = folio_buffers(folio);
+       if (!head)
+               head = create_empty_buffers(folio, bsize, 0);
  
-       head = page_buffers(page);
        for (bh = head, block_start = 0; bh != head || !block_start;
             bh = bh->b_this_page, block_start += bsize) {
                block_end = block_start + bsize;
                 * they may belong to unallocated clusters.
                 */
                if (block_start >= to || block_end <= from) {
-                       if (PageUptodate(page))
+                       if (folio_test_uptodate(folio))
                                set_buffer_uptodate(bh);
                        continue;
                }
                        clean_bdev_bh_alias(bh);
                }
  
-               if (PageUptodate(page)) {
+               if (folio_test_uptodate(folio)) {
                        set_buffer_uptodate(bh);
                } else if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
                           !buffer_new(bh) &&
-                          ocfs2_should_read_blk(inode, page, block_start) &&
+                          ocfs2_should_read_blk(inode, folio, block_start) &&
                           (block_start < from || block_end > to)) {
                        bh_read_nowait(bh, 0);
                        *wait_bh++=bh;
                if (block_start >= to)
                        break;
  
-               zero_user(page, block_start, bh->b_size);
+               folio_zero_range(folio, block_start, bh->b_size);
                set_buffer_uptodate(bh);
                mark_buffer_dirty(bh);
  
@@@ -2048,9 -2049,9 +2049,9 @@@ out_write_size
                }
                inode->i_blocks = ocfs2_inode_sector_count(inode);
                di->i_size = cpu_to_le64((u64)i_size_read(inode));
 -              inode->i_mtime = inode_set_ctime_current(inode);
 -              di->i_mtime = di->i_ctime = cpu_to_le64(inode->i_mtime.tv_sec);
 -              di->i_mtime_nsec = di->i_ctime_nsec = cpu_to_le32(inode->i_mtime.tv_nsec);
 +              inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
 +              di->i_mtime = di->i_ctime = cpu_to_le64(inode_get_mtime_sec(inode));
 +              di->i_mtime_nsec = di->i_ctime_nsec = cpu_to_le32(inode_get_mtime_nsec(inode));
                if (handle)
                        ocfs2_update_inode_fsync_trans(handle, inode, 1);
        }
diff --combined fs/proc/task_mmu.c
index 1593940ca01ee5ca20d84cc2da74e7b13aea512d,66ae1c265da34b5e534885d41e04426cf5e0ef19..4abd51053f76d92d0998a4e07f5e021a263c1026
@@@ -20,6 -20,8 +20,8 @@@
  #include <linux/shmem_fs.h>
  #include <linux/uaccess.h>
  #include <linux/pkeys.h>
+ #include <linux/minmax.h>
+ #include <linux/overflow.h>
  
  #include <asm/elf.h>
  #include <asm/tlb.h>
@@@ -296,7 -298,7 +298,7 @@@ show_map_vma(struct seq_file *m, struc
                if (anon_name)
                        seq_printf(m, "[anon_shmem:%s]", anon_name->name);
                else
 -                      seq_file_path(m, file, "\n");
 +                      seq_path(m, file_user_path(file), "\n");
                goto done;
        }
  
@@@ -1761,11 -1763,737 +1763,737 @@@ static int pagemap_release(struct inod
        return 0;
  }
  
+ #define PM_SCAN_CATEGORIES    (PAGE_IS_WPALLOWED | PAGE_IS_WRITTEN |  \
+                                PAGE_IS_FILE | PAGE_IS_PRESENT |       \
+                                PAGE_IS_SWAPPED | PAGE_IS_PFNZERO |    \
+                                PAGE_IS_HUGE)
+ #define PM_SCAN_FLAGS         (PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC)
+ struct pagemap_scan_private {
+       struct pm_scan_arg arg;
+       unsigned long masks_of_interest, cur_vma_category;
+       struct page_region *vec_buf;
+       unsigned long vec_buf_len, vec_buf_index, found_pages;
+       struct page_region __user *vec_out;
+ };
+ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
+                                          struct vm_area_struct *vma,
+                                          unsigned long addr, pte_t pte)
+ {
+       unsigned long categories = 0;
+       if (pte_present(pte)) {
+               struct page *page;
+               categories |= PAGE_IS_PRESENT;
+               if (!pte_uffd_wp(pte))
+                       categories |= PAGE_IS_WRITTEN;
+               if (p->masks_of_interest & PAGE_IS_FILE) {
+                       page = vm_normal_page(vma, addr, pte);
+                       if (page && !PageAnon(page))
+                               categories |= PAGE_IS_FILE;
+               }
+               if (is_zero_pfn(pte_pfn(pte)))
+                       categories |= PAGE_IS_PFNZERO;
+       } else if (is_swap_pte(pte)) {
+               swp_entry_t swp;
+               categories |= PAGE_IS_SWAPPED;
+               if (!pte_swp_uffd_wp_any(pte))
+                       categories |= PAGE_IS_WRITTEN;
+               if (p->masks_of_interest & PAGE_IS_FILE) {
+                       swp = pte_to_swp_entry(pte);
+                       if (is_pfn_swap_entry(swp) &&
+                           !PageAnon(pfn_swap_entry_to_page(swp)))
+                               categories |= PAGE_IS_FILE;
+               }
+       }
+       return categories;
+ }
+ static void make_uffd_wp_pte(struct vm_area_struct *vma,
+                            unsigned long addr, pte_t *pte)
+ {
+       pte_t ptent = ptep_get(pte);
+       if (pte_present(ptent)) {
+               pte_t old_pte;
+               old_pte = ptep_modify_prot_start(vma, addr, pte);
+               ptent = pte_mkuffd_wp(ptent);
+               ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
+       } else if (is_swap_pte(ptent)) {
+               ptent = pte_swp_mkuffd_wp(ptent);
+               set_pte_at(vma->vm_mm, addr, pte, ptent);
+       } else {
+               set_pte_at(vma->vm_mm, addr, pte,
+                          make_pte_marker(PTE_MARKER_UFFD_WP));
+       }
+ }
+ #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
+                                         struct vm_area_struct *vma,
+                                         unsigned long addr, pmd_t pmd)
+ {
+       unsigned long categories = PAGE_IS_HUGE;
+       if (pmd_present(pmd)) {
+               struct page *page;
+               categories |= PAGE_IS_PRESENT;
+               if (!pmd_uffd_wp(pmd))
+                       categories |= PAGE_IS_WRITTEN;
+               if (p->masks_of_interest & PAGE_IS_FILE) {
+                       page = vm_normal_page_pmd(vma, addr, pmd);
+                       if (page && !PageAnon(page))
+                               categories |= PAGE_IS_FILE;
+               }
+               if (is_zero_pfn(pmd_pfn(pmd)))
+                       categories |= PAGE_IS_PFNZERO;
+       } else if (is_swap_pmd(pmd)) {
+               swp_entry_t swp;
+               categories |= PAGE_IS_SWAPPED;
+               if (!pmd_swp_uffd_wp(pmd))
+                       categories |= PAGE_IS_WRITTEN;
+               if (p->masks_of_interest & PAGE_IS_FILE) {
+                       swp = pmd_to_swp_entry(pmd);
+                       if (is_pfn_swap_entry(swp) &&
+                           !PageAnon(pfn_swap_entry_to_page(swp)))
+                               categories |= PAGE_IS_FILE;
+               }
+       }
+       return categories;
+ }
+ static void make_uffd_wp_pmd(struct vm_area_struct *vma,
+                            unsigned long addr, pmd_t *pmdp)
+ {
+       pmd_t old, pmd = *pmdp;
+       if (pmd_present(pmd)) {
+               old = pmdp_invalidate_ad(vma, addr, pmdp);
+               pmd = pmd_mkuffd_wp(old);
+               set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+       } else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+               pmd = pmd_swp_mkuffd_wp(pmd);
+               set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+       }
+ }
+ #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+ #ifdef CONFIG_HUGETLB_PAGE
+ static unsigned long pagemap_hugetlb_category(pte_t pte)
+ {
+       unsigned long categories = PAGE_IS_HUGE;
+       /*
+        * According to pagemap_hugetlb_range(), file-backed HugeTLB
+        * page cannot be swapped. So PAGE_IS_FILE is not checked for
+        * swapped pages.
+        */
+       if (pte_present(pte)) {
+               categories |= PAGE_IS_PRESENT;
+               if (!huge_pte_uffd_wp(pte))
+                       categories |= PAGE_IS_WRITTEN;
+               if (!PageAnon(pte_page(pte)))
+                       categories |= PAGE_IS_FILE;
+               if (is_zero_pfn(pte_pfn(pte)))
+                       categories |= PAGE_IS_PFNZERO;
+       } else if (is_swap_pte(pte)) {
+               categories |= PAGE_IS_SWAPPED;
+               if (!pte_swp_uffd_wp_any(pte))
+                       categories |= PAGE_IS_WRITTEN;
+       }
+       return categories;
+ }
+ static void make_uffd_wp_huge_pte(struct vm_area_struct *vma,
+                                 unsigned long addr, pte_t *ptep,
+                                 pte_t ptent)
+ {
+       unsigned long psize;
+       if (is_hugetlb_entry_hwpoisoned(ptent) || is_pte_marker(ptent))
+               return;
+       psize = huge_page_size(hstate_vma(vma));
+       if (is_hugetlb_entry_migration(ptent))
+               set_huge_pte_at(vma->vm_mm, addr, ptep,
+                               pte_swp_mkuffd_wp(ptent), psize);
+       else if (!huge_pte_none(ptent))
+               huge_ptep_modify_prot_commit(vma, addr, ptep, ptent,
+                                            huge_pte_mkuffd_wp(ptent));
+       else
+               set_huge_pte_at(vma->vm_mm, addr, ptep,
+                               make_pte_marker(PTE_MARKER_UFFD_WP), psize);
+ }
+ #endif /* CONFIG_HUGETLB_PAGE */
+ #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE)
+ static void pagemap_scan_backout_range(struct pagemap_scan_private *p,
+                                      unsigned long addr, unsigned long end)
+ {
+       struct page_region *cur_buf = &p->vec_buf[p->vec_buf_index];
+       if (cur_buf->start != addr)
+               cur_buf->end = addr;
+       else
+               cur_buf->start = cur_buf->end = 0;
+       p->found_pages -= (end - addr) / PAGE_SIZE;
+ }
+ #endif
+ static bool pagemap_scan_is_interesting_page(unsigned long categories,
+                                            const struct pagemap_scan_private *p)
+ {
+       categories ^= p->arg.category_inverted;
+       if ((categories & p->arg.category_mask) != p->arg.category_mask)
+               return false;
+       if (p->arg.category_anyof_mask && !(categories & p->arg.category_anyof_mask))
+               return false;
+       return true;
+ }
+ static bool pagemap_scan_is_interesting_vma(unsigned long categories,
+                                           const struct pagemap_scan_private *p)
+ {
+       unsigned long required = p->arg.category_mask & PAGE_IS_WPALLOWED;
+       categories ^= p->arg.category_inverted;
+       if ((categories & required) != required)
+               return false;
+       return true;
+ }
+ static int pagemap_scan_test_walk(unsigned long start, unsigned long end,
+                                 struct mm_walk *walk)
+ {
+       struct pagemap_scan_private *p = walk->private;
+       struct vm_area_struct *vma = walk->vma;
+       unsigned long vma_category = 0;
+       if (userfaultfd_wp_async(vma) && userfaultfd_wp_use_markers(vma))
+               vma_category |= PAGE_IS_WPALLOWED;
+       else if (p->arg.flags & PM_SCAN_CHECK_WPASYNC)
+               return -EPERM;
+       if (vma->vm_flags & VM_PFNMAP)
+               return 1;
+       if (!pagemap_scan_is_interesting_vma(vma_category, p))
+               return 1;
+       p->cur_vma_category = vma_category;
+       return 0;
+ }
+ static bool pagemap_scan_push_range(unsigned long categories,
+                                   struct pagemap_scan_private *p,
+                                   unsigned long addr, unsigned long end)
+ {
+       struct page_region *cur_buf = &p->vec_buf[p->vec_buf_index];
+       /*
+        * When there is no output buffer provided at all, the sentinel values
+        * won't match here. There is no other way for `cur_buf->end` to be
+        * non-zero other than it being non-empty.
+        */
+       if (addr == cur_buf->end && categories == cur_buf->categories) {
+               cur_buf->end = end;
+               return true;
+       }
+       if (cur_buf->end) {
+               if (p->vec_buf_index >= p->vec_buf_len - 1)
+                       return false;
+               cur_buf = &p->vec_buf[++p->vec_buf_index];
+       }
+       cur_buf->start = addr;
+       cur_buf->end = end;
+       cur_buf->categories = categories;
+       return true;
+ }
+ static int pagemap_scan_output(unsigned long categories,
+                              struct pagemap_scan_private *p,
+                              unsigned long addr, unsigned long *end)
+ {
+       unsigned long n_pages, total_pages;
+       int ret = 0;
+       if (!p->vec_buf)
+               return 0;
+       categories &= p->arg.return_mask;
+       n_pages = (*end - addr) / PAGE_SIZE;
+       if (check_add_overflow(p->found_pages, n_pages, &total_pages) ||
+           total_pages > p->arg.max_pages) {
+               size_t n_too_much = total_pages - p->arg.max_pages;
+               *end -= n_too_much * PAGE_SIZE;
+               n_pages -= n_too_much;
+               ret = -ENOSPC;
+       }
+       if (!pagemap_scan_push_range(categories, p, addr, *end)) {
+               *end = addr;
+               n_pages = 0;
+               ret = -ENOSPC;
+       }
+       p->found_pages += n_pages;
+       if (ret)
+               p->arg.walk_end = *end;
+       return ret;
+ }
+ static int pagemap_scan_thp_entry(pmd_t *pmd, unsigned long start,
+                                 unsigned long end, struct mm_walk *walk)
+ {
+ #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+       struct pagemap_scan_private *p = walk->private;
+       struct vm_area_struct *vma = walk->vma;
+       unsigned long categories;
+       spinlock_t *ptl;
+       int ret = 0;
+       ptl = pmd_trans_huge_lock(pmd, vma);
+       if (!ptl)
+               return -ENOENT;
+       categories = p->cur_vma_category |
+                    pagemap_thp_category(p, vma, start, *pmd);
+       if (!pagemap_scan_is_interesting_page(categories, p))
+               goto out_unlock;
+       ret = pagemap_scan_output(categories, p, start, &end);
+       if (start == end)
+               goto out_unlock;
+       if (~p->arg.flags & PM_SCAN_WP_MATCHING)
+               goto out_unlock;
+       if (~categories & PAGE_IS_WRITTEN)
+               goto out_unlock;
+       /*
+        * Break huge page into small pages if the WP operation
+        * needs to be performed on a portion of the huge page.
+        */
+       if (end != start + HPAGE_SIZE) {
+               spin_unlock(ptl);
+               split_huge_pmd(vma, pmd, start);
+               pagemap_scan_backout_range(p, start, end);
+               /* Report as if there was no THP */
+               return -ENOENT;
+       }
+       make_uffd_wp_pmd(vma, start, pmd);
+       flush_tlb_range(vma, start, end);
+ out_unlock:
+       spin_unlock(ptl);
+       return ret;
+ #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
+       return -ENOENT;
+ #endif
+ }
+ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
+                                 unsigned long end, struct mm_walk *walk)
+ {
+       struct pagemap_scan_private *p = walk->private;
+       struct vm_area_struct *vma = walk->vma;
+       unsigned long addr, flush_end = 0;
+       pte_t *pte, *start_pte;
+       spinlock_t *ptl;
+       int ret;
+       arch_enter_lazy_mmu_mode();
+       ret = pagemap_scan_thp_entry(pmd, start, end, walk);
+       if (ret != -ENOENT) {
+               arch_leave_lazy_mmu_mode();
+               return ret;
+       }
+       ret = 0;
+       start_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
+       if (!pte) {
+               arch_leave_lazy_mmu_mode();
+               walk->action = ACTION_AGAIN;
+               return 0;
+       }
+       if (!p->vec_out) {
+               /* Fast path for performing exclusive WP */
+               for (addr = start; addr != end; pte++, addr += PAGE_SIZE) {
+                       if (pte_uffd_wp(ptep_get(pte)))
+                               continue;
+                       make_uffd_wp_pte(vma, addr, pte);
+                       if (!flush_end)
+                               start = addr;
+                       flush_end = addr + PAGE_SIZE;
+               }
+               goto flush_and_return;
+       }
+       if (!p->arg.category_anyof_mask && !p->arg.category_inverted &&
+           p->arg.category_mask == PAGE_IS_WRITTEN &&
+           p->arg.return_mask == PAGE_IS_WRITTEN) {
+               for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
+                       unsigned long next = addr + PAGE_SIZE;
+                       if (pte_uffd_wp(ptep_get(pte)))
+                               continue;
+                       ret = pagemap_scan_output(p->cur_vma_category | PAGE_IS_WRITTEN,
+                                                 p, addr, &next);
+                       if (next == addr)
+                               break;
+                       if (~p->arg.flags & PM_SCAN_WP_MATCHING)
+                               continue;
+                       make_uffd_wp_pte(vma, addr, pte);
+                       if (!flush_end)
+                               start = addr;
+                       flush_end = next;
+               }
+               goto flush_and_return;
+       }
+       for (addr = start; addr != end; pte++, addr += PAGE_SIZE) {
+               unsigned long categories = p->cur_vma_category |
+                                          pagemap_page_category(p, vma, addr, ptep_get(pte));
+               unsigned long next = addr + PAGE_SIZE;
+               if (!pagemap_scan_is_interesting_page(categories, p))
+                       continue;
+               ret = pagemap_scan_output(categories, p, addr, &next);
+               if (next == addr)
+                       break;
+               if (~p->arg.flags & PM_SCAN_WP_MATCHING)
+                       continue;
+               if (~categories & PAGE_IS_WRITTEN)
+                       continue;
+               make_uffd_wp_pte(vma, addr, pte);
+               if (!flush_end)
+                       start = addr;
+               flush_end = next;
+       }
+ flush_and_return:
+       if (flush_end)
+               flush_tlb_range(vma, start, addr);
+       pte_unmap_unlock(start_pte, ptl);
+       arch_leave_lazy_mmu_mode();
+       cond_resched();
+       return ret;
+ }
+ #ifdef CONFIG_HUGETLB_PAGE
+ static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
+                                     unsigned long start, unsigned long end,
+                                     struct mm_walk *walk)
+ {
+       struct pagemap_scan_private *p = walk->private;
+       struct vm_area_struct *vma = walk->vma;
+       unsigned long categories;
+       spinlock_t *ptl;
+       int ret = 0;
+       pte_t pte;
+       if (~p->arg.flags & PM_SCAN_WP_MATCHING) {
+               /* Go the short route when not write-protecting pages. */
+               pte = huge_ptep_get(ptep);
+               categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
+               if (!pagemap_scan_is_interesting_page(categories, p))
+                       return 0;
+               return pagemap_scan_output(categories, p, start, &end);
+       }
+       i_mmap_lock_write(vma->vm_file->f_mapping);
+       ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
+       pte = huge_ptep_get(ptep);
+       categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
+       if (!pagemap_scan_is_interesting_page(categories, p))
+               goto out_unlock;
+       ret = pagemap_scan_output(categories, p, start, &end);
+       if (start == end)
+               goto out_unlock;
+       if (~categories & PAGE_IS_WRITTEN)
+               goto out_unlock;
+       if (end != start + HPAGE_SIZE) {
+               /* Partial HugeTLB page WP isn't possible. */
+               pagemap_scan_backout_range(p, start, end);
+               p->arg.walk_end = start;
+               ret = 0;
+               goto out_unlock;
+       }
+       make_uffd_wp_huge_pte(vma, start, ptep, pte);
+       flush_hugetlb_tlb_range(vma, start, end);
+ out_unlock:
+       spin_unlock(ptl);
+       i_mmap_unlock_write(vma->vm_file->f_mapping);
+       return ret;
+ }
+ #else
+ #define pagemap_scan_hugetlb_entry NULL
+ #endif
+ static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
+                                int depth, struct mm_walk *walk)
+ {
+       struct pagemap_scan_private *p = walk->private;
+       struct vm_area_struct *vma = walk->vma;
+       int ret, err;
+       if (!vma || !pagemap_scan_is_interesting_page(p->cur_vma_category, p))
+               return 0;
+       ret = pagemap_scan_output(p->cur_vma_category, p, addr, &end);
+       if (addr == end)
+               return ret;
+       if (~p->arg.flags & PM_SCAN_WP_MATCHING)
+               return ret;
+       err = uffd_wp_range(vma, addr, end - addr, true);
+       if (err < 0)
+               ret = err;
+       return ret;
+ }
+ static const struct mm_walk_ops pagemap_scan_ops = {
+       .test_walk = pagemap_scan_test_walk,
+       .pmd_entry = pagemap_scan_pmd_entry,
+       .pte_hole = pagemap_scan_pte_hole,
+       .hugetlb_entry = pagemap_scan_hugetlb_entry,
+ };
+ static int pagemap_scan_get_args(struct pm_scan_arg *arg,
+                                unsigned long uarg)
+ {
+       if (copy_from_user(arg, (void __user *)uarg, sizeof(*arg)))
+               return -EFAULT;
+       if (arg->size != sizeof(struct pm_scan_arg))
+               return -EINVAL;
+       /* Validate requested features */
+       if (arg->flags & ~PM_SCAN_FLAGS)
+               return -EINVAL;
+       if ((arg->category_inverted | arg->category_mask |
+            arg->category_anyof_mask | arg->return_mask) & ~PM_SCAN_CATEGORIES)
+               return -EINVAL;
+       arg->start = untagged_addr((unsigned long)arg->start);
+       arg->end = untagged_addr((unsigned long)arg->end);
+       arg->vec = untagged_addr((unsigned long)arg->vec);
+       /* Validate memory pointers */
+       if (!IS_ALIGNED(arg->start, PAGE_SIZE))
+               return -EINVAL;
+       if (!access_ok((void __user *)(long)arg->start, arg->end - arg->start))
+               return -EFAULT;
+       if (!arg->vec && arg->vec_len)
+               return -EINVAL;
+       if (arg->vec && !access_ok((void __user *)(long)arg->vec,
+                             arg->vec_len * sizeof(struct page_region)))
+               return -EFAULT;
+       /* Fixup default values */
+       arg->end = ALIGN(arg->end, PAGE_SIZE);
+       arg->walk_end = 0;
+       if (!arg->max_pages)
+               arg->max_pages = ULONG_MAX;
+       return 0;
+ }
+ static int pagemap_scan_writeback_args(struct pm_scan_arg *arg,
+                                      unsigned long uargl)
+ {
+       struct pm_scan_arg __user *uarg = (void __user *)uargl;
+       if (copy_to_user(&uarg->walk_end, &arg->walk_end, sizeof(arg->walk_end)))
+               return -EFAULT;
+       return 0;
+ }
+ static int pagemap_scan_init_bounce_buffer(struct pagemap_scan_private *p)
+ {
+       if (!p->arg.vec_len)
+               return 0;
+       p->vec_buf_len = min_t(size_t, PAGEMAP_WALK_SIZE >> PAGE_SHIFT,
+                              p->arg.vec_len);
+       p->vec_buf = kmalloc_array(p->vec_buf_len, sizeof(*p->vec_buf),
+                                  GFP_KERNEL);
+       if (!p->vec_buf)
+               return -ENOMEM;
+       p->vec_buf->start = p->vec_buf->end = 0;
+       p->vec_out = (struct page_region __user *)(long)p->arg.vec;
+       return 0;
+ }
+ static long pagemap_scan_flush_buffer(struct pagemap_scan_private *p)
+ {
+       const struct page_region *buf = p->vec_buf;
+       long n = p->vec_buf_index;
+       if (!p->vec_buf)
+               return 0;
+       if (buf[n].end != buf[n].start)
+               n++;
+       if (!n)
+               return 0;
+       if (copy_to_user(p->vec_out, buf, n * sizeof(*buf)))
+               return -EFAULT;
+       p->arg.vec_len -= n;
+       p->vec_out += n;
+       p->vec_buf_index = 0;
+       p->vec_buf_len = min_t(size_t, p->vec_buf_len, p->arg.vec_len);
+       p->vec_buf->start = p->vec_buf->end = 0;
+       return n;
+ }
+ static long do_pagemap_scan(struct mm_struct *mm, unsigned long uarg)
+ {
+       struct mmu_notifier_range range;
+       struct pagemap_scan_private p = {0};
+       unsigned long walk_start;
+       size_t n_ranges_out = 0;
+       int ret;
+       ret = pagemap_scan_get_args(&p.arg, uarg);
+       if (ret)
+               return ret;
+       p.masks_of_interest = p.arg.category_mask | p.arg.category_anyof_mask |
+                             p.arg.return_mask;
+       ret = pagemap_scan_init_bounce_buffer(&p);
+       if (ret)
+               return ret;
+       /* Protection change for the range is going to happen. */
+       if (p.arg.flags & PM_SCAN_WP_MATCHING) {
+               mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
+                                       mm, p.arg.start, p.arg.end);
+               mmu_notifier_invalidate_range_start(&range);
+       }
+       for (walk_start = p.arg.start; walk_start < p.arg.end;
+                       walk_start = p.arg.walk_end) {
+               long n_out;
+               if (fatal_signal_pending(current)) {
+                       ret = -EINTR;
+                       break;
+               }
+               ret = mmap_read_lock_killable(mm);
+               if (ret)
+                       break;
+               ret = walk_page_range(mm, walk_start, p.arg.end,
+                                     &pagemap_scan_ops, &p);
+               mmap_read_unlock(mm);
+               n_out = pagemap_scan_flush_buffer(&p);
+               if (n_out < 0)
+                       ret = n_out;
+               else
+                       n_ranges_out += n_out;
+               if (ret != -ENOSPC)
+                       break;
+               if (p.arg.vec_len == 0 || p.found_pages == p.arg.max_pages)
+                       break;
+       }
+       /* ENOSPC signifies early stop (buffer full) from the walk. */
+       if (!ret || ret == -ENOSPC)
+               ret = n_ranges_out;
+       /* The walk_end isn't set when ret is zero */
+       if (!p.arg.walk_end)
+               p.arg.walk_end = p.arg.end;
+       if (pagemap_scan_writeback_args(&p.arg, uarg))
+               ret = -EFAULT;
+       if (p.arg.flags & PM_SCAN_WP_MATCHING)
+               mmu_notifier_invalidate_range_end(&range);
+       kfree(p.vec_buf);
+       return ret;
+ }
+ static long do_pagemap_cmd(struct file *file, unsigned int cmd,
+                          unsigned long arg)
+ {
+       struct mm_struct *mm = file->private_data;
+       switch (cmd) {
+       case PAGEMAP_SCAN:
+               return do_pagemap_scan(mm, arg);
+       default:
+               return -EINVAL;
+       }
+ }
  const struct file_operations proc_pagemap_operations = {
        .llseek         = mem_lseek, /* borrow this */
        .read           = pagemap_read,
        .open           = pagemap_open,
        .release        = pagemap_release,
+       .unlocked_ioctl = do_pagemap_cmd,
+       .compat_ioctl   = do_pagemap_cmd,
  };
  #endif /* CONFIG_PROC_PAGE_MONITOR */
  
@@@ -1945,8 -2673,9 +2673,9 @@@ static int show_numa_map(struct seq_fil
        struct numa_maps *md = &numa_priv->md;
        struct file *file = vma->vm_file;
        struct mm_struct *mm = vma->vm_mm;
-       struct mempolicy *pol;
        char buffer[64];
+       struct mempolicy *pol;
+       pgoff_t ilx;
        int nid;
  
        if (!mm)
        /* Ensure we start with an empty set of numa_maps statistics. */
        memset(md, 0, sizeof(*md));
  
-       pol = __get_vma_policy(vma, vma->vm_start);
+       pol = __get_vma_policy(vma, vma->vm_start, &ilx);
        if (pol) {
                mpol_to_str(buffer, sizeof(buffer), pol);
                mpol_cond_put(pol);
  
        if (file) {
                seq_puts(m, " file=");
 -              seq_file_path(m, file, "\n\t= ");
 +              seq_path(m, file_user_path(file), "\n\t= ");
        } else if (vma_is_initial_heap(vma)) {
                seq_puts(m, " heap");
        } else if (vma_is_initial_stack(vma)) {
diff --combined fs/quota/dquot.c
index 023b91b4e1f0ab2e25ad9291e09e65edbfb6ba76,15030b0cd1c8461a22c87b64e23ae2dc48ecd7c2..58b5de081b5714e89f212f893caa59527f025f5d
@@@ -233,18 -233,19 +233,18 @@@ static void put_quota_format(struct quo
   * All dquots are placed to the end of inuse_list when first created, and this
   * list is used for invalidate operation, which must look at every dquot.
   *
 - * When the last reference of a dquot will be dropped, the dquot will be
 - * added to releasing_dquots. We'd then queue work item which would call
 + * When the last reference of a dquot is dropped, the dquot is added to
 + * releasing_dquots. We'll then queue work item which will call
   * synchronize_srcu() and after that perform the final cleanup of all the
 - * dquots on the list. Both releasing_dquots and free_dquots use the
 - * dq_free list_head in the dquot struct. When a dquot is removed from
 - * releasing_dquots, a reference count is always subtracted, and if
 - * dq_count == 0 at that point, the dquot will be added to the free_dquots.
 + * dquots on the list. Each cleaned up dquot is moved to free_dquots list.
 + * Both releasing_dquots and free_dquots use the dq_free list_head in the dquot
 + * struct.
   *
 - * Unused dquots (dq_count == 0) are added to the free_dquots list when freed,
 - * and this list is searched whenever we need an available dquot.  Dquots are
 - * removed from the list as soon as they are used again, and
 - * dqstats.free_dquots gives the number of dquots on the list. When
 - * dquot is invalidated it's completely released from memory.
 + * Unused and cleaned up dquots are in the free_dquots list and this list is
 + * searched whenever we need an available dquot. Dquots are removed from the
 + * list as soon as they are used again and dqstats.free_dquots gives the number
 + * of dquots on the list. When dquot is invalidated it's completely released
 + * from memory.
   *
   * Dirty dquots are added to the dqi_dirty_list of quota_info when mark
   * dirtied, and this list is searched when writing dirty dquots back to
@@@ -320,7 -321,6 +320,7 @@@ static inline void put_dquot_last(struc
  static inline void put_releasing_dquots(struct dquot *dquot)
  {
        list_add_tail(&dquot->dq_free, &releasing_dquots);
 +      set_bit(DQ_RELEASING_B, &dquot->dq_flags);
  }
  
  static inline void remove_free_dquot(struct dquot *dquot)
        if (list_empty(&dquot->dq_free))
                return;
        list_del_init(&dquot->dq_free);
 -      if (!atomic_read(&dquot->dq_count))
 +      if (!test_bit(DQ_RELEASING_B, &dquot->dq_flags))
                dqstats_dec(DQST_FREE_DQUOTS);
 +      else
 +              clear_bit(DQ_RELEASING_B, &dquot->dq_flags);
  }
  
  static inline void put_inuse(struct dquot *dquot)
@@@ -583,6 -581,12 +583,6 @@@ restart
                        continue;
                /* Wait for dquot users */
                if (atomic_read(&dquot->dq_count)) {
 -                      /* dquot in releasing_dquots, flush and retry */
 -                      if (!list_empty(&dquot->dq_free)) {
 -                              spin_unlock(&dq_list_lock);
 -                              goto restart;
 -                      }
 -
                        atomic_inc(&dquot->dq_count);
                        spin_unlock(&dq_list_lock);
                        /*
                         * restart. */
                        goto restart;
                }
 +              /*
 +               * The last user already dropped its reference but dquot didn't
 +               * get fully cleaned up yet. Restart the scan which flushes the
 +               * work cleaning up released dquots.
 +               */
 +              if (test_bit(DQ_RELEASING_B, &dquot->dq_flags)) {
 +                      spin_unlock(&dq_list_lock);
 +                      goto restart;
 +              }
                /*
                 * Quota now has no users and it has been written on last
                 * dqput()
@@@ -701,13 -696,6 +701,13 @@@ int dquot_writeback_dquots(struct super
                                                 dq_dirty);
  
                        WARN_ON(!dquot_active(dquot));
 +                      /* If the dquot is releasing we should not touch it */
 +                      if (test_bit(DQ_RELEASING_B, &dquot->dq_flags)) {
 +                              spin_unlock(&dq_list_lock);
 +                              flush_delayed_work(&quota_release_work);
 +                              spin_lock(&dq_list_lock);
 +                              continue;
 +                      }
  
                        /* Now we have active dquot from which someone is
                         * holding reference so we can safely just increase
@@@ -803,12 -791,6 +803,6 @@@ dqcache_shrink_count(struct shrinker *s
        percpu_counter_read_positive(&dqstats.counter[DQST_FREE_DQUOTS]));
  }
  
- static struct shrinker dqcache_shrinker = {
-       .count_objects = dqcache_shrink_count,
-       .scan_objects = dqcache_shrink_scan,
-       .seeks = DEFAULT_SEEKS,
- };
  /*
   * Safely release dquot and put reference to dquot.
   */
@@@ -821,18 -803,18 +815,18 @@@ static void quota_release_workfn(struc
        /* Exchange the list head to avoid livelock. */
        list_replace_init(&releasing_dquots, &rls_head);
        spin_unlock(&dq_list_lock);
 +      synchronize_srcu(&dquot_srcu);
  
  restart:
 -      synchronize_srcu(&dquot_srcu);
        spin_lock(&dq_list_lock);
        while (!list_empty(&rls_head)) {
                dquot = list_first_entry(&rls_head, struct dquot, dq_free);
 -              /* Dquot got used again? */
 -              if (atomic_read(&dquot->dq_count) > 1) {
 -                      remove_free_dquot(dquot);
 -                      atomic_dec(&dquot->dq_count);
 -                      continue;
 -              }
 +              WARN_ON_ONCE(atomic_read(&dquot->dq_count));
 +              /*
 +               * Note that DQ_RELEASING_B protects us from racing with
 +               * invalidate_dquots() calls so we are safe to work with the
 +               * dquot even after we drop dq_list_lock.
 +               */
                if (dquot_dirty(dquot)) {
                        spin_unlock(&dq_list_lock);
                        /* Commit dquot before releasing */
                }
                /* Dquot is inactive and clean, now move it to free list */
                remove_free_dquot(dquot);
 -              atomic_dec(&dquot->dq_count);
                put_dquot_last(dquot);
        }
        spin_unlock(&dq_list_lock);
@@@ -886,7 -869,6 +880,7 @@@ void dqput(struct dquot *dquot
        BUG_ON(!list_empty(&dquot->dq_free));
  #endif
        put_releasing_dquots(dquot);
 +      atomic_dec(&dquot->dq_count);
        spin_unlock(&dq_list_lock);
        queue_delayed_work(system_unbound_wq, &quota_release_work, 1);
  }
@@@ -975,7 -957,7 +969,7 @@@ we_slept
                dqstats_inc(DQST_LOOKUPS);
        }
        /* Wait for dq_lock - after this we know that either dquot_release() is
 -       * already finished or it will be canceled due to dq_count > 1 test */
 +       * already finished or it will be canceled due to dq_count > 0 test */
        wait_on_dquot(dquot);
        /* Read the dquot / allocate space in quota file */
        if (!dquot_active(dquot)) {
@@@ -2351,20 -2333,6 +2345,20 @@@ static int vfs_setup_quota_inode(struc
        if (sb_has_quota_loaded(sb, type))
                return -EBUSY;
  
 +      /*
 +       * Quota files should never be encrypted.  They should be thought of as
 +       * filesystem metadata, not user data.  New-style internal quota files
 +       * cannot be encrypted by users anyway, but old-style external quota
 +       * files could potentially be incorrectly created in an encrypted
 +       * directory, hence this explicit check.  Some reasons why encrypted
 +       * quota files don't work include: (1) some filesystems that support
 +       * encryption don't handle it in their quota_read and quota_write, and
 +       * (2) cleaning up encrypted quota files at unmount would need special
 +       * consideration, as quota files are cleaned up later than user files.
 +       */
 +      if (IS_ENCRYPTED(inode))
 +              return -EINVAL;
 +
        dqopt->files[type] = igrab(inode);
        if (!dqopt->files[type])
                return -EIO;
@@@ -2982,6 -2950,7 +2976,7 @@@ static int __init dquot_init(void
  {
        int i, ret;
        unsigned long nr_hash, order;
+       struct shrinker *dqcache_shrinker;
  
        printk(KERN_NOTICE "VFS: Disk quotas %s\n", __DQUOT_VERSION__);
  
        pr_info("VFS: Dquot-cache hash table entries: %ld (order %ld,"
                " %ld bytes)\n", nr_hash, order, (PAGE_SIZE << order));
  
-       if (register_shrinker(&dqcache_shrinker, "dquota-cache"))
-               panic("Cannot register dquot shrinker");
+       dqcache_shrinker = shrinker_alloc(0, "dquota-cache");
+       if (!dqcache_shrinker)
+               panic("Cannot allocate dquot shrinker");
+       dqcache_shrinker->count_objects = dqcache_shrink_count;
+       dqcache_shrinker->scan_objects = dqcache_shrink_scan;
+       shrinker_register(dqcache_shrinker);
  
        return 0;
  }
diff --combined fs/reiserfs/inode.c
index c8572346556fcbd76afa7e1f28d7219665c36262,a9075c4843ed9982764b5d49e39378bc552cfe8a..1d825459ee6e62ee5723022833e40e28ccd1adba
@@@ -1257,9 -1257,11 +1257,9 @@@ static void init_inode(struct inode *in
                i_uid_write(inode, sd_v1_uid(sd));
                i_gid_write(inode, sd_v1_gid(sd));
                inode->i_size = sd_v1_size(sd);
 -              inode->i_atime.tv_sec = sd_v1_atime(sd);
 -              inode->i_mtime.tv_sec = sd_v1_mtime(sd);
 +              inode_set_atime(inode, sd_v1_atime(sd), 0);
 +              inode_set_mtime(inode, sd_v1_mtime(sd), 0);
                inode_set_ctime(inode, sd_v1_ctime(sd), 0);
 -              inode->i_atime.tv_nsec = 0;
 -              inode->i_mtime.tv_nsec = 0;
  
                inode->i_blocks = sd_v1_blocks(sd);
                inode->i_generation = le32_to_cpu(INODE_PKEY(inode)->k_dir_id);
                i_uid_write(inode, sd_v2_uid(sd));
                inode->i_size = sd_v2_size(sd);
                i_gid_write(inode, sd_v2_gid(sd));
 -              inode->i_mtime.tv_sec = sd_v2_mtime(sd);
 -              inode->i_atime.tv_sec = sd_v2_atime(sd);
 +              inode_set_mtime(inode, sd_v2_mtime(sd), 0);
 +              inode_set_atime(inode, sd_v2_atime(sd), 0);
                inode_set_ctime(inode, sd_v2_ctime(sd), 0);
 -              inode->i_mtime.tv_nsec = 0;
 -              inode->i_atime.tv_nsec = 0;
                inode->i_blocks = sd_v2_blocks(sd);
                rdev = sd_v2_rdev(sd);
                if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode))
@@@ -1366,9 -1370,9 +1366,9 @@@ static void inode2sd(void *sd, struct i
        set_sd_v2_uid(sd_v2, i_uid_read(inode));
        set_sd_v2_size(sd_v2, size);
        set_sd_v2_gid(sd_v2, i_gid_read(inode));
 -      set_sd_v2_mtime(sd_v2, inode->i_mtime.tv_sec);
 -      set_sd_v2_atime(sd_v2, inode->i_atime.tv_sec);
 -      set_sd_v2_ctime(sd_v2, inode_get_ctime(inode).tv_sec);
 +      set_sd_v2_mtime(sd_v2, inode_get_mtime_sec(inode));
 +      set_sd_v2_atime(sd_v2, inode_get_atime_sec(inode));
 +      set_sd_v2_ctime(sd_v2, inode_get_ctime_sec(inode));
        set_sd_v2_blocks(sd_v2, to_fake_used_blocks(inode, SD_V2_SIZE));
        if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode))
                set_sd_v2_rdev(sd_v2, new_encode_dev(inode->i_rdev));
@@@ -1387,9 -1391,9 +1387,9 @@@ static void inode2sd_v1(void *sd, struc
        set_sd_v1_gid(sd_v1, i_gid_read(inode));
        set_sd_v1_nlink(sd_v1, inode->i_nlink);
        set_sd_v1_size(sd_v1, size);
 -      set_sd_v1_atime(sd_v1, inode->i_atime.tv_sec);
 -      set_sd_v1_ctime(sd_v1, inode_get_ctime(inode).tv_sec);
 -      set_sd_v1_mtime(sd_v1, inode->i_mtime.tv_sec);
 +      set_sd_v1_atime(sd_v1, inode_get_atime_sec(inode));
 +      set_sd_v1_ctime(sd_v1, inode_get_ctime_sec(inode));
 +      set_sd_v1_mtime(sd_v1, inode_get_mtime_sec(inode));
  
        if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode))
                set_sd_v1_rdev(sd_v1, new_encode_dev(inode->i_rdev));
@@@ -1980,7 -1984,7 +1980,7 @@@ int reiserfs_new_inode(struct reiserfs_
  
        /* uid and gid must already be set by the caller for quota init */
  
 -      inode->i_mtime = inode->i_atime = inode_set_ctime_current(inode);
 +      simple_inode_init_ts(inode);
        inode->i_size = i_size;
        inode->i_blocks = 0;
        inode->i_bytes = 0;
   * start/recovery path as __block_write_full_folio, along with special
   * code to handle reiserfs tails.
   */
- static int reiserfs_write_full_page(struct page *page,
+ static int reiserfs_write_full_folio(struct folio *folio,
                                    struct writeback_control *wbc)
  {
-       struct inode *inode = page->mapping->host;
+       struct inode *inode = folio->mapping->host;
        unsigned long end_index = inode->i_size >> PAGE_SHIFT;
        int error = 0;
        unsigned long block;
        struct buffer_head *head, *bh;
        int partial = 0;
        int nr = 0;
-       int checked = PageChecked(page);
+       int checked = folio_test_checked(folio);
        struct reiserfs_transaction_handle th;
        struct super_block *s = inode->i_sb;
        int bh_per_page = PAGE_SIZE / s->s_blocksize;
  
        /* no logging allowed when nonblocking or from PF_MEMALLOC */
        if (checked && (current->flags & PF_MEMALLOC)) {
-               redirty_page_for_writepage(wbc, page);
-               unlock_page(page);
+               folio_redirty_for_writepage(wbc, folio);
+               folio_unlock(folio);
                return 0;
        }
  
        /*
-        * The page dirty bit is cleared before writepage is called, which
+        * The folio dirty bit is cleared before writepage is called, which
         * means we have to tell create_empty_buffers to make dirty buffers
-        * The page really should be up to date at this point, so tossing
+        * The folio really should be up to date at this point, so tossing
         * in the BH_Uptodate is just a sanity check.
         */
-       if (!page_has_buffers(page)) {
-               create_empty_buffers(page, s->s_blocksize,
+       head = folio_buffers(folio);
+       if (!head)
+               head = create_empty_buffers(folio, s->s_blocksize,
                                     (1 << BH_Dirty) | (1 << BH_Uptodate));
-       }
-       head = page_buffers(page);
  
        /*
-        * last page in the file, zero out any contents past the
+        * last folio in the file, zero out any contents past the
         * last byte in the file
         */
-       if (page->index >= end_index) {
+       if (folio->index >= end_index) {
                unsigned last_offset;
  
                last_offset = inode->i_size & (PAGE_SIZE - 1);
-               /* no file contents in this page */
-               if (page->index >= end_index + 1 || !last_offset) {
-                       unlock_page(page);
+               /* no file contents in this folio */
+               if (folio->index >= end_index + 1 || !last_offset) {
+                       folio_unlock(folio);
                        return 0;
                }
-               zero_user_segment(page, last_offset, PAGE_SIZE);
+               folio_zero_segment(folio, last_offset, folio_size(folio));
        }
        bh = head;
-       block = page->index << (PAGE_SHIFT - s->s_blocksize_bits);
+       block = folio->index << (PAGE_SHIFT - s->s_blocksize_bits);
        last_block = (i_size_read(inode) - 1) >> inode->i_blkbits;
        /* first map all the buffers, logging any direct items we find */
        do {
                if (block > last_block) {
                        /*
                         * This can happen when the block size is less than
-                        * the page size.  The corresponding bytes in the page
+                        * the folio size.  The corresponding bytes in the folio
                         * were zero filled above
                         */
                        clear_buffer_dirty(bh);
         * blocks we're going to log
         */
        if (checked) {
-               ClearPageChecked(page);
+               folio_clear_checked(folio);
                reiserfs_write_lock(s);
                error = journal_begin(&th, s, bh_per_page + 1);
                if (error) {
                }
                reiserfs_update_inode_transaction(inode);
        }
-       /* now go through and lock any dirty buffers on the page */
+       /* now go through and lock any dirty buffers on the folio */
        do {
                get_bh(bh);
                if (!buffer_mapped(bh))
                        lock_buffer(bh);
                } else {
                        if (!trylock_buffer(bh)) {
-                               redirty_page_for_writepage(wbc, page);
+                               folio_redirty_for_writepage(wbc, folio);
                                continue;
                        }
                }
                if (error)
                        goto fail;
        }
-       BUG_ON(PageWriteback(page));
-       set_page_writeback(page);
-       unlock_page(page);
+       BUG_ON(folio_test_writeback(folio));
+       folio_start_writeback(folio);
+       folio_unlock(folio);
  
        /*
-        * since any buffer might be the only dirty buffer on the page,
-        * the first submit_bh can bring the page out of writeback.
+        * since any buffer might be the only dirty buffer on the folio,
+        * the first submit_bh can bring the folio out of writeback.
         * be careful with the buffers.
         */
        do {
  done:
        if (nr == 0) {
                /*
-                * if this page only had a direct item, it is very possible for
+                * if this folio only had a direct item, it is very possible for
                 * no io to be required without there being an error.  Or,
                 * someone else could have locked them and sent them down the
-                * pipe without locking the page
+                * pipe without locking the folio
                 */
                bh = head;
                do {
                        bh = bh->b_this_page;
                } while (bh != head);
                if (!partial)
-                       SetPageUptodate(page);
-               end_page_writeback(page);
+                       folio_mark_uptodate(folio);
+               folio_end_writeback(folio);
        }
        return error;
  
  fail:
        /*
         * catches various errors, we need to make sure any valid dirty blocks
-        * get to the media.  The page is currently locked and not marked for
+        * get to the media.  The folio is currently locked and not marked for
         * writeback
         */
-       ClearPageUptodate(page);
+       folio_clear_uptodate(folio);
        bh = head;
        do {
                get_bh(bh);
                } else {
                        /*
                         * clear any dirty bits that might have come from
-                        * getting attached to a dirty page
+                        * getting attached to a dirty folio
                         */
                        clear_buffer_dirty(bh);
                }
                bh = bh->b_this_page;
        } while (bh != head);
-       SetPageError(page);
-       BUG_ON(PageWriteback(page));
-       set_page_writeback(page);
-       unlock_page(page);
+       folio_set_error(folio);
+       BUG_ON(folio_test_writeback(folio));
+       folio_start_writeback(folio);
+       folio_unlock(folio);
        do {
                struct buffer_head *next = bh->b_this_page;
                if (buffer_async_write(bh)) {
@@@ -2724,9 -2727,10 +2723,10 @@@ static int reiserfs_read_folio(struct f
  
  static int reiserfs_writepage(struct page *page, struct writeback_control *wbc)
  {
-       struct inode *inode = page->mapping->host;
+       struct folio *folio = page_folio(page);
+       struct inode *inode = folio->mapping->host;
        reiserfs_wait_on_write_block(inode->i_sb);
-       return reiserfs_write_full_page(page, wbc);
+       return reiserfs_write_full_folio(folio, wbc);
  }
  
  static void reiserfs_truncate_failed_write(struct inode *inode)
diff --combined fs/super.c
index c7b452e12e4c882a1ad599dae3b58433d8204b3a,816a22a5cad1be3aecccaf85897d35953dbfc4a2..77faad6627392629f41d141643dee7e07e539470
@@@ -178,7 -178,7 +178,7 @@@ static void super_wake(struct super_blo
   * One thing we have to be careful of with a per-sb shrinker is that we don't
   * drop the last active reference to the superblock from within the shrinker.
   * If that happens we could trigger unregistering the shrinker from within the
-  * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
+  * shrinker path and that leads to deadlock on the shrinker_mutex. Hence we
   * take a passive reference to the superblock to avoid this from occurring.
   */
  static unsigned long super_cache_scan(struct shrinker *shrink,
        long    dentries;
        long    inodes;
  
-       sb = container_of(shrink, struct super_block, s_shrink);
+       sb = shrink->private_data;
  
        /*
         * Deadlock avoidance.  We may hold various FS locks, and we don't want
@@@ -244,7 -244,7 +244,7 @@@ static unsigned long super_cache_count(
        struct super_block *sb;
        long    total_objects = 0;
  
-       sb = container_of(shrink, struct super_block, s_shrink);
+       sb = shrink->private_data;
  
        /*
         * We don't call super_trylock_shared() here as it is a scalability
@@@ -306,7 -306,7 +306,7 @@@ static void destroy_unused_super(struc
        security_sb_free(s);
        put_user_ns(s->s_user_ns);
        kfree(s->s_subtype);
-       free_prealloced_shrinker(&s->s_shrink);
+       shrinker_free(s->s_shrink);
        /* no delays needed */
        destroy_super_work(&s->destroy_work);
  }
@@@ -383,16 -383,19 +383,19 @@@ static struct super_block *alloc_super(
        s->s_time_min = TIME64_MIN;
        s->s_time_max = TIME64_MAX;
  
-       s->s_shrink.seeks = DEFAULT_SEEKS;
-       s->s_shrink.scan_objects = super_cache_scan;
-       s->s_shrink.count_objects = super_cache_count;
-       s->s_shrink.batch = 1024;
-       s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
-       if (prealloc_shrinker(&s->s_shrink, "sb-%s", type->name))
+       s->s_shrink = shrinker_alloc(SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE,
+                                    "sb-%s", type->name);
+       if (!s->s_shrink)
                goto fail;
-       if (list_lru_init_memcg(&s->s_dentry_lru, &s->s_shrink))
+       s->s_shrink->scan_objects = super_cache_scan;
+       s->s_shrink->count_objects = super_cache_count;
+       s->s_shrink->batch = 1024;
+       s->s_shrink->private_data = s;
+       if (list_lru_init_memcg(&s->s_dentry_lru, s->s_shrink))
                goto fail;
-       if (list_lru_init_memcg(&s->s_inode_lru, &s->s_shrink))
+       if (list_lru_init_memcg(&s->s_inode_lru, s->s_shrink))
                goto fail;
        return s;
  
@@@ -477,7 -480,7 +480,7 @@@ void deactivate_locked_super(struct sup
  {
        struct file_system_type *fs = s->s_type;
        if (atomic_dec_and_test(&s->s_active)) {
-               unregister_shrinker(&s->s_shrink);
+               shrinker_free(s->s_shrink);
                fs->kill_sb(s);
  
                kill_super_notify(s);
@@@ -818,7 -821,7 +821,7 @@@ retry
        hlist_add_head(&s->s_instances, &s->s_type->fs_supers);
        spin_unlock(&sb_lock);
        get_filesystem(s->s_type);
-       register_shrinker_prepared(&s->s_shrink);
+       shrinker_register(s->s_shrink);
        return s;
  
  share_extant_sb:
@@@ -901,7 -904,7 +904,7 @@@ retry
        hlist_add_head(&s->s_instances, &type->fs_supers);
        spin_unlock(&sb_lock);
        get_filesystem(type);
-       register_shrinker_prepared(&s->s_shrink);
+       shrinker_register(s->s_shrink);
        return s;
  }
  EXPORT_SYMBOL(sget);
@@@ -1419,48 -1422,32 +1422,48 @@@ EXPORT_SYMBOL(sget_dev)
  
  #ifdef CONFIG_BLOCK
  /*
 - * Lock a super block that the callers holds a reference to.
 + * Lock the superblock that is holder of the bdev. Returns the superblock
 + * pointer if we successfully locked the superblock and it is alive. Otherwise
 + * we return NULL and just unlock bdev->bd_holder_lock.
   *
 - * The caller needs to ensure that the super_block isn't being freed while
 - * calling this function, e.g. by holding a lock over the call to this function
 - * and the place that clears the pointer to the superblock used by this function
 - * before freeing the superblock.
 + * The function must be called with bdev->bd_holder_lock and releases it.
   */
 -static bool super_lock_shared_active(struct super_block *sb)
 +static struct super_block *bdev_super_lock_shared(struct block_device *bdev)
 +      __releases(&bdev->bd_holder_lock)
  {
 -      bool born = super_lock_shared(sb);
 +      struct super_block *sb = bdev->bd_holder;
 +      bool born;
 +
 +      lockdep_assert_held(&bdev->bd_holder_lock);
 +      lockdep_assert_not_held(&sb->s_umount);
 +      lockdep_assert_not_held(&bdev->bd_disk->open_mutex);
 +
 +      /* Make sure sb doesn't go away from under us */
 +      spin_lock(&sb_lock);
 +      sb->s_count++;
 +      spin_unlock(&sb_lock);
 +      mutex_unlock(&bdev->bd_holder_lock);
  
 +      born = super_lock_shared(sb);
        if (!born || !sb->s_root || !(sb->s_flags & SB_ACTIVE)) {
                super_unlock_shared(sb);
 -              return false;
 +              put_super(sb);
 +              return NULL;
        }
 -      return true;
 +      /*
 +       * The superblock is active and we hold s_umount, we can drop our
 +       * temporary reference now.
 +       */
 +      put_super(sb);
 +      return sb;
  }
  
  static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
  {
 -      struct super_block *sb = bdev->bd_holder;
 -
 -      /* bd_holder_lock ensures that the sb isn't freed */
 -      lockdep_assert_held(&bdev->bd_holder_lock);
 +      struct super_block *sb;
  
 -      if (!super_lock_shared_active(sb))
 +      sb = bdev_super_lock_shared(bdev);
 +      if (!sb)
                return;
  
        if (!surprise)
  
  static void fs_bdev_sync(struct block_device *bdev)
  {
 -      struct super_block *sb = bdev->bd_holder;
 -
 -      lockdep_assert_held(&bdev->bd_holder_lock);
 +      struct super_block *sb;
  
 -      if (!super_lock_shared_active(sb))
 +      sb = bdev_super_lock_shared(bdev);
 +      if (!sb)
                return;
        sync_filesystem(sb);
        super_unlock_shared(sb);
@@@ -1494,16 -1482,14 +1497,16 @@@ int setup_bdev_super(struct super_bloc
                struct fs_context *fc)
  {
        blk_mode_t mode = sb_open_mode(sb_flags);
 +      struct bdev_handle *bdev_handle;
        struct block_device *bdev;
  
 -      bdev = blkdev_get_by_dev(sb->s_dev, mode, sb, &fs_holder_ops);
 -      if (IS_ERR(bdev)) {
 +      bdev_handle = bdev_open_by_dev(sb->s_dev, mode, sb, &fs_holder_ops);
 +      if (IS_ERR(bdev_handle)) {
                if (fc)
                        errorf(fc, "%s: Can't open blockdev", fc->source);
 -              return PTR_ERR(bdev);
 +              return PTR_ERR(bdev_handle);
        }
 +      bdev = bdev_handle->bdev;
  
        /*
         * This really should be in blkdev_get_by_dev, but right now can't due
         * writable from userspace even for a read-only block device.
         */
        if ((mode & BLK_OPEN_WRITE) && bdev_read_only(bdev)) {
 -              blkdev_put(bdev, sb);
 +              bdev_release(bdev_handle);
                return -EACCES;
        }
  
                mutex_unlock(&bdev->bd_fsfreeze_mutex);
                if (fc)
                        warnf(fc, "%pg: Can't mount, blockdev is frozen", bdev);
 -              blkdev_put(bdev, sb);
 +              bdev_release(bdev_handle);
                return -EBUSY;
        }
        spin_lock(&sb_lock);
 +      sb->s_bdev_handle = bdev_handle;
        sb->s_bdev = bdev;
        sb->s_bdi = bdi_get(bdev->bd_disk->bdi);
        if (bdev_stable_writes(bdev))
        mutex_unlock(&bdev->bd_fsfreeze_mutex);
  
        snprintf(sb->s_id, sizeof(sb->s_id), "%pg", bdev);
-       shrinker_debugfs_rename(&sb->s_shrink, "sb-%s:%s", sb->s_type->name,
+       shrinker_debugfs_rename(sb->s_shrink, "sb-%s:%s", sb->s_type->name,
                                sb->s_id);
        sb_set_blocksize(sb, block_size(bdev));
        return 0;
@@@ -1664,7 -1649,7 +1667,7 @@@ void kill_block_super(struct super_bloc
        generic_shutdown_super(sb);
        if (bdev) {
                sync_blockdev(bdev);
 -              blkdev_put(bdev, sb);
 +              bdev_release(sb->s_bdev_handle);
        }
  }
  
diff --combined fs/ubifs/super.c
index 366941d4a18a40bad66beecab9e48ac188c409ce,96f6a9118207a37e0380bc4d35cef52957d96128..0d0478815d4dbcf595699c0e8bb6ad7a5d21236f
@@@ -54,11 -54,7 +54,7 @@@ module_param_cb(default_version, &ubifs
  static struct kmem_cache *ubifs_inode_slab;
  
  /* UBIFS TNC shrinker description */
- static struct shrinker ubifs_shrinker_info = {
-       .scan_objects = ubifs_shrink_scan,
-       .count_objects = ubifs_shrink_count,
-       .seeks = DEFAULT_SEEKS,
- };
+ static struct shrinker *ubifs_shrinker_info;
  
  /**
   * validate_inode - validate inode.
@@@ -142,10 -138,10 +138,10 @@@ struct inode *ubifs_iget(struct super_b
        set_nlink(inode, le32_to_cpu(ino->nlink));
        i_uid_write(inode, le32_to_cpu(ino->uid));
        i_gid_write(inode, le32_to_cpu(ino->gid));
 -      inode->i_atime.tv_sec  = (int64_t)le64_to_cpu(ino->atime_sec);
 -      inode->i_atime.tv_nsec = le32_to_cpu(ino->atime_nsec);
 -      inode->i_mtime.tv_sec  = (int64_t)le64_to_cpu(ino->mtime_sec);
 -      inode->i_mtime.tv_nsec = le32_to_cpu(ino->mtime_nsec);
 +      inode_set_atime(inode, (int64_t)le64_to_cpu(ino->atime_sec),
 +                      le32_to_cpu(ino->atime_nsec));
 +      inode_set_mtime(inode, (int64_t)le64_to_cpu(ino->mtime_sec),
 +                      le32_to_cpu(ino->mtime_nsec));
        inode_set_ctime(inode, (int64_t)le64_to_cpu(ino->ctime_sec),
                        le32_to_cpu(ino->ctime_nsec));
        inode->i_mode = le32_to_cpu(ino->mode);
@@@ -2373,7 -2369,7 +2369,7 @@@ static void inode_slab_ctor(void *obj
  
  static int __init ubifs_init(void)
  {
-       int err;
+       int err = -ENOMEM;
  
        BUILD_BUG_ON(sizeof(struct ubifs_ch) != 24);
  
        if (!ubifs_inode_slab)
                return -ENOMEM;
  
-       err = register_shrinker(&ubifs_shrinker_info, "ubifs-slab");
-       if (err)
+       ubifs_shrinker_info = shrinker_alloc(0, "ubifs-slab");
+       if (!ubifs_shrinker_info)
                goto out_slab;
  
+       ubifs_shrinker_info->count_objects = ubifs_shrink_count;
+       ubifs_shrinker_info->scan_objects = ubifs_shrink_scan;
+       shrinker_register(ubifs_shrinker_info);
        err = ubifs_compressors_init();
        if (err)
                goto out_shrinker;
@@@ -2467,7 -2468,7 +2468,7 @@@ out_dbg
        dbg_debugfs_exit();
        ubifs_compressors_exit();
  out_shrinker:
-       unregister_shrinker(&ubifs_shrinker_info);
+       shrinker_free(ubifs_shrinker_info);
  out_slab:
        kmem_cache_destroy(ubifs_inode_slab);
        return err;
@@@ -2483,7 -2484,7 +2484,7 @@@ static void __exit ubifs_exit(void
        dbg_debugfs_exit();
        ubifs_sysfs_exit();
        ubifs_compressors_exit();
-       unregister_shrinker(&ubifs_shrinker_info);
+       shrinker_free(ubifs_shrinker_info);
  
        /*
         * Make sure all delayed rcu free inodes are flushed before we
diff --combined fs/ufs/inode.c
index 338e4b97312f62c4e6acf73a84c4a5e34e6e162a,5b289374d4fd0e1f0a9b8a8c9b787b7d173f8a93..ebce93b082817b76d065e54c82b0636b013cf411
@@@ -579,15 -579,13 +579,15 @@@ static int ufs1_read_inode(struct inod
        i_gid_write(inode, ufs_get_inode_gid(sb, ufs_inode));
  
        inode->i_size = fs64_to_cpu(sb, ufs_inode->ui_size);
 -      inode->i_atime.tv_sec = (signed)fs32_to_cpu(sb, ufs_inode->ui_atime.tv_sec);
 +      inode_set_atime(inode,
 +                      (signed)fs32_to_cpu(sb, ufs_inode->ui_atime.tv_sec),
 +                      0);
        inode_set_ctime(inode,
                        (signed)fs32_to_cpu(sb, ufs_inode->ui_ctime.tv_sec),
                        0);
 -      inode->i_mtime.tv_sec = (signed)fs32_to_cpu(sb, ufs_inode->ui_mtime.tv_sec);
 -      inode->i_mtime.tv_nsec = 0;
 -      inode->i_atime.tv_nsec = 0;
 +      inode_set_mtime(inode,
 +                      (signed)fs32_to_cpu(sb, ufs_inode->ui_mtime.tv_sec),
 +                      0);
        inode->i_blocks = fs32_to_cpu(sb, ufs_inode->ui_blocks);
        inode->i_generation = fs32_to_cpu(sb, ufs_inode->ui_gen);
        ufsi->i_flags = fs32_to_cpu(sb, ufs_inode->ui_flags);
@@@ -628,12 -626,12 +628,12 @@@ static int ufs2_read_inode(struct inod
        i_gid_write(inode, fs32_to_cpu(sb, ufs2_inode->ui_gid));
  
        inode->i_size = fs64_to_cpu(sb, ufs2_inode->ui_size);
 -      inode->i_atime.tv_sec = fs64_to_cpu(sb, ufs2_inode->ui_atime);
 +      inode_set_atime(inode, fs64_to_cpu(sb, ufs2_inode->ui_atime),
 +                      fs32_to_cpu(sb, ufs2_inode->ui_atimensec));
        inode_set_ctime(inode, fs64_to_cpu(sb, ufs2_inode->ui_ctime),
                        fs32_to_cpu(sb, ufs2_inode->ui_ctimensec));
 -      inode->i_mtime.tv_sec = fs64_to_cpu(sb, ufs2_inode->ui_mtime);
 -      inode->i_atime.tv_nsec = fs32_to_cpu(sb, ufs2_inode->ui_atimensec);
 -      inode->i_mtime.tv_nsec = fs32_to_cpu(sb, ufs2_inode->ui_mtimensec);
 +      inode_set_mtime(inode, fs64_to_cpu(sb, ufs2_inode->ui_mtime),
 +                      fs32_to_cpu(sb, ufs2_inode->ui_mtimensec));
        inode->i_blocks = fs64_to_cpu(sb, ufs2_inode->ui_blocks);
        inode->i_generation = fs32_to_cpu(sb, ufs2_inode->ui_gen);
        ufsi->i_flags = fs32_to_cpu(sb, ufs2_inode->ui_flags);
@@@ -727,14 -725,12 +727,14 @@@ static void ufs1_update_inode(struct in
        ufs_set_inode_gid(sb, ufs_inode, i_gid_read(inode));
  
        ufs_inode->ui_size = cpu_to_fs64(sb, inode->i_size);
 -      ufs_inode->ui_atime.tv_sec = cpu_to_fs32(sb, inode->i_atime.tv_sec);
 +      ufs_inode->ui_atime.tv_sec = cpu_to_fs32(sb,
 +                                               inode_get_atime_sec(inode));
        ufs_inode->ui_atime.tv_usec = 0;
        ufs_inode->ui_ctime.tv_sec = cpu_to_fs32(sb,
 -                                               inode_get_ctime(inode).tv_sec);
 +                                               inode_get_ctime_sec(inode));
        ufs_inode->ui_ctime.tv_usec = 0;
 -      ufs_inode->ui_mtime.tv_sec = cpu_to_fs32(sb, inode->i_mtime.tv_sec);
 +      ufs_inode->ui_mtime.tv_sec = cpu_to_fs32(sb,
 +                                               inode_get_mtime_sec(inode));
        ufs_inode->ui_mtime.tv_usec = 0;
        ufs_inode->ui_blocks = cpu_to_fs32(sb, inode->i_blocks);
        ufs_inode->ui_flags = cpu_to_fs32(sb, ufsi->i_flags);
@@@ -774,15 -770,13 +774,15 @@@ static void ufs2_update_inode(struct in
        ufs_inode->ui_gid = cpu_to_fs32(sb, i_gid_read(inode));
  
        ufs_inode->ui_size = cpu_to_fs64(sb, inode->i_size);
 -      ufs_inode->ui_atime = cpu_to_fs64(sb, inode->i_atime.tv_sec);
 -      ufs_inode->ui_atimensec = cpu_to_fs32(sb, inode->i_atime.tv_nsec);
 -      ufs_inode->ui_ctime = cpu_to_fs64(sb, inode_get_ctime(inode).tv_sec);
 +      ufs_inode->ui_atime = cpu_to_fs64(sb, inode_get_atime_sec(inode));
 +      ufs_inode->ui_atimensec = cpu_to_fs32(sb,
 +                                            inode_get_atime_nsec(inode));
 +      ufs_inode->ui_ctime = cpu_to_fs64(sb, inode_get_ctime_sec(inode));
        ufs_inode->ui_ctimensec = cpu_to_fs32(sb,
 -                                            inode_get_ctime(inode).tv_nsec);
 -      ufs_inode->ui_mtime = cpu_to_fs64(sb, inode->i_mtime.tv_sec);
 -      ufs_inode->ui_mtimensec = cpu_to_fs32(sb, inode->i_mtime.tv_nsec);
 +                                            inode_get_ctime_nsec(inode));
 +      ufs_inode->ui_mtime = cpu_to_fs64(sb, inode_get_mtime_sec(inode));
 +      ufs_inode->ui_mtimensec = cpu_to_fs32(sb,
 +                                            inode_get_mtime_nsec(inode));
  
        ufs_inode->ui_blocks = cpu_to_fs64(sb, inode->i_blocks);
        ufs_inode->ui_flags = cpu_to_fs32(sb, ufsi->i_flags);
@@@ -1063,7 -1057,7 +1063,7 @@@ static int ufs_alloc_lastblock(struct i
        struct ufs_sb_private_info *uspi = UFS_SB(sb)->s_uspi;
        unsigned i, end;
        sector_t lastfrag;
-       struct page *lastpage;
+       struct folio *folio;
        struct buffer_head *bh;
        u64 phys64;
  
  
        lastfrag--;
  
-       lastpage = ufs_get_locked_page(mapping, lastfrag >>
+       folio = ufs_get_locked_folio(mapping, lastfrag >>
                                       (PAGE_SHIFT - inode->i_blkbits));
-        if (IS_ERR(lastpage)) {
-                err = -EIO;
-                goto out;
-        }
-        end = lastfrag & ((1 << (PAGE_SHIFT - inode->i_blkbits)) - 1);
-        bh = page_buffers(lastpage);
-        for (i = 0; i < end; ++i)
-                bh = bh->b_this_page;
+       if (IS_ERR(folio)) {
+               err = -EIO;
+               goto out;
+       }
  
+       end = lastfrag & ((1 << (PAGE_SHIFT - inode->i_blkbits)) - 1);
+       bh = folio_buffers(folio);
+       for (i = 0; i < end; ++i)
+               bh = bh->b_this_page;
  
         err = ufs_getfrag_block(inode, lastfrag, bh, 1);
  
                */
               set_buffer_uptodate(bh);
               mark_buffer_dirty(bh);
-              set_page_dirty(lastpage);
+               folio_mark_dirty(folio);
         }
  
         if (lastfrag >= UFS_IND_FRAGMENT) {
               }
         }
  out_unlock:
-        ufs_put_locked_page(lastpage);
+        ufs_put_locked_folio(folio);
  out:
         return err;
  }
@@@ -1214,7 -1207,7 +1213,7 @@@ static int ufs_truncate(struct inode *i
        truncate_setsize(inode, size);
  
        ufs_truncate_blocks(inode);
 -      inode->i_mtime = inode_set_ctime_current(inode);
 +      inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
        mark_inode_dirty(inode);
  out:
        UFSD("EXIT: err %d\n", err);
diff --combined fs/xfs/xfs_buf.c
index 003e157241da1e39a888f351f70d33b94f25e8b2,9e7ba04572db0526db853e4922605077fbefa129..545c7991b9b584cb576d33d55cdd67826e0fa753
@@@ -1913,8 -1913,7 +1913,7 @@@ xfs_buftarg_shrink_scan
        struct shrinker         *shrink,
        struct shrink_control   *sc)
  {
-       struct xfs_buftarg      *btp = container_of(shrink,
-                                       struct xfs_buftarg, bt_shrinker);
+       struct xfs_buftarg      *btp = shrink->private_data;
        LIST_HEAD(dispose);
        unsigned long           freed;
  
@@@ -1936,8 -1935,7 +1935,7 @@@ xfs_buftarg_shrink_count
        struct shrinker         *shrink,
        struct shrink_control   *sc)
  {
-       struct xfs_buftarg      *btp = container_of(shrink,
-                                       struct xfs_buftarg, bt_shrinker);
+       struct xfs_buftarg      *btp = shrink->private_data;
        return list_lru_shrink_count(&btp->bt_lru, sc);
  }
  
  xfs_free_buftarg(
        struct xfs_buftarg      *btp)
  {
-       unregister_shrinker(&btp->bt_shrinker);
 -      struct block_device     *bdev = btp->bt_bdev;
 -
+       shrinker_free(btp->bt_shrinker);
        ASSERT(percpu_counter_sum(&btp->bt_io_count) == 0);
        percpu_counter_destroy(&btp->bt_io_count);
        list_lru_destroy(&btp->bt_lru);
  
        fs_put_dax(btp->bt_daxdev, btp->bt_mount);
        /* the main block device is closed by kill_block_super */
 -      if (bdev != btp->bt_mount->m_super->s_bdev)
 -              blkdev_put(bdev, btp->bt_mount->m_super);
 +      if (btp->bt_bdev != btp->bt_mount->m_super->s_bdev)
 +              bdev_release(btp->bt_bdev_handle);
  
        kmem_free(btp);
  }
@@@ -1988,15 -1988,16 +1986,15 @@@ xfs_setsize_buftarg
   */
  STATIC int
  xfs_setsize_buftarg_early(
 -      xfs_buftarg_t           *btp,
 -      struct block_device     *bdev)
 +      xfs_buftarg_t           *btp)
  {
 -      return xfs_setsize_buftarg(btp, bdev_logical_block_size(bdev));
 +      return xfs_setsize_buftarg(btp, bdev_logical_block_size(btp->bt_bdev));
  }
  
  struct xfs_buftarg *
  xfs_alloc_buftarg(
        struct xfs_mount        *mp,
 -      struct block_device     *bdev)
 +      struct bdev_handle      *bdev_handle)
  {
        xfs_buftarg_t           *btp;
        const struct dax_holder_operations *ops = NULL;
        btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
  
        btp->bt_mount = mp;
 -      btp->bt_dev =  bdev->bd_dev;
 -      btp->bt_bdev = bdev;
 -      btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off,
 +      btp->bt_bdev_handle = bdev_handle;
 +      btp->bt_dev = bdev_handle->bdev->bd_dev;
 +      btp->bt_bdev = bdev_handle->bdev;
 +      btp->bt_daxdev = fs_dax_get_by_bdev(btp->bt_bdev, &btp->bt_dax_part_off,
                                            mp, ops);
  
        /*
        ratelimit_state_init(&btp->bt_ioerror_rl, 30 * HZ,
                             DEFAULT_RATELIMIT_BURST);
  
 -      if (xfs_setsize_buftarg_early(btp, bdev))
 +      if (xfs_setsize_buftarg_early(btp))
                goto error_free;
  
        if (list_lru_init(&btp->bt_lru))
        if (percpu_counter_init(&btp->bt_io_count, 0, GFP_KERNEL))
                goto error_lru;
  
-       btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
-       btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
-       btp->bt_shrinker.seeks = DEFAULT_SEEKS;
-       btp->bt_shrinker.flags = SHRINKER_NUMA_AWARE;
-       if (register_shrinker(&btp->bt_shrinker, "xfs-buf:%s",
-                             mp->m_super->s_id))
+       btp->bt_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "xfs-buf:%s",
+                                         mp->m_super->s_id);
+       if (!btp->bt_shrinker)
                goto error_pcpu;
+       btp->bt_shrinker->count_objects = xfs_buftarg_shrink_count;
+       btp->bt_shrinker->scan_objects = xfs_buftarg_shrink_scan;
+       btp->bt_shrinker->private_data = btp;
+       shrinker_register(btp->bt_shrinker);
        return btp;
  
  error_pcpu:
diff --combined fs/xfs/xfs_buf.h
index ada9d310b7d3a0b6810d364b57faeb2fa02e3d53,702e7d9ea2ac9ced09af7035e55c03096e0fc0ea..c86e16419656875e6ea395c5ea45c7c338eb0ea4
@@@ -98,7 -98,6 +98,7 @@@ typedef unsigned int xfs_buf_flags_t
   */
  typedef struct xfs_buftarg {
        dev_t                   bt_dev;
 +      struct bdev_handle      *bt_bdev_handle;
        struct block_device     *bt_bdev;
        struct dax_device       *bt_daxdev;
        u64                     bt_dax_part_off;
        size_t                  bt_logical_sectormask;
  
        /* LRU control structures */
-       struct shrinker         bt_shrinker;
+       struct shrinker         *bt_shrinker;
        struct list_lru         bt_lru;
  
        struct percpu_counter   bt_io_count;
@@@ -365,7 -364,7 +365,7 @@@ xfs_buf_update_cksum(struct xfs_buf *bp
   *    Handling of buftargs.
   */
  struct xfs_buftarg *xfs_alloc_buftarg(struct xfs_mount *mp,
 -              struct block_device *bdev);
 +              struct bdev_handle *bdev_handle);
  extern void xfs_free_buftarg(struct xfs_buftarg *);
  extern void xfs_buftarg_wait(struct xfs_buftarg *);
  extern void xfs_buftarg_drain(struct xfs_buftarg *);
index 265da00a1a8b1b34c2405a6ae1e31d18cadb389f,8641f4320c9803b8c3aab4a54f9d16bf833fa069..4a6b6b77ccb6c346f112e8c83ea7a451c8e1d300
@@@ -115,6 -115,11 +115,11 @@@ enum 
         * Enable recursive subtree protection
         */
        CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),
+       /*
+        * Enable hugetlb accounting for the memory controller.
+        */
+        CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19),
  };
  
  /* cftype->flags */
@@@ -238,7 -243,7 +243,7 @@@ struct css_set 
         * Lists running through all tasks using this cgroup group.
         * mg_tasks lists tasks which belong to this cset but are in the
         * process of being migrated out or in.  Protected by
 -       * css_set_rwsem, but, during migration, once tasks are moved to
 +       * css_set_lock, but, during migration, once tasks are moved to
         * mg_tasks, it can be read safely while holding cgroup_mutex.
         */
        struct list_head tasks;
diff --combined include/linux/fs.h
index c27c324ba58ac0a90aecc945b25bc14d992d4625,5265186da0e2022d7c355e4d12ced667e40f638c..98b7a7a8c42e36cc140d14797cf8eb00f7c36f76
@@@ -67,7 -67,7 +67,7 @@@ struct swap_info_struct
  struct seq_file;
  struct workqueue_struct;
  struct iov_iter;
 -struct fscrypt_info;
 +struct fscrypt_inode_info;
  struct fscrypt_operations;
  struct fsverity_info;
  struct fsverity_operations;
@@@ -454,7 -454,7 +454,7 @@@ extern const struct address_space_opera
   *   It is also used to block modification of page cache contents through
   *   memory mappings.
   * @gfp_mask: Memory allocation flags to use for allocating pages.
-  * @i_mmap_writable: Number of VM_SHARED mappings.
+  * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
   * @nr_thps: Number of THPs in the pagecache (non-shmem only).
   * @i_mmap: Tree of private and shared mappings.
   * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
@@@ -557,7 -557,7 +557,7 @@@ static inline int mapping_mapped(struc
  
  /*
   * Might pages of this file have been modified in userspace?
-  * Note that i_mmap_writable counts all VM_SHARED vmas: do_mmap
+  * Note that i_mmap_writable counts all VM_SHARED, VM_MAYWRITE vmas: do_mmap
   * marks vma as VM_SHARED if it is shared, and the file was opened for
   * writing i.e. vma may be mprotected writable even if now readonly.
   *
@@@ -671,8 -671,8 +671,8 @@@ struct inode 
        };
        dev_t                   i_rdev;
        loff_t                  i_size;
 -      struct timespec64       i_atime;
 -      struct timespec64       i_mtime;
 +      struct timespec64       __i_atime;
 +      struct timespec64       __i_mtime;
        struct timespec64       __i_ctime; /* use inode_*_ctime accessors! */
        spinlock_t              i_lock; /* i_blocks, i_bytes, maybe i_size */
        unsigned short          i_bytes;
  #endif
  
  #ifdef CONFIG_FS_ENCRYPTION
 -      struct fscrypt_info     *i_crypt_info;
 +      struct fscrypt_inode_info       *i_crypt_info;
  #endif
  
  #ifdef CONFIG_FS_VERITY
@@@ -1042,10 -1042,7 +1042,10 @@@ static inline struct file *get_file(str
        atomic_long_inc(&f->f_count);
        return f;
  }
 -#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count)
 +
 +struct file *get_file_rcu(struct file __rcu **f);
 +struct file *get_file_active(struct file **f);
 +
  #define file_count(x) atomic_long_read(&(x)->f_count)
  
  #define       MAX_NON_LFS     ((1UL<<31) - 1)
@@@ -1122,7 -1119,7 +1122,7 @@@ extern int send_sigurg(struct fown_stru
  #define SB_NOATIME      BIT(10)       /* Do not update access times. */
  #define SB_NODIRATIME   BIT(11)       /* Do not update directory access times */
  #define SB_SILENT       BIT(15)
 -#define SB_POSIXACL     BIT(16)       /* VFS does not apply the umask */
 +#define SB_POSIXACL     BIT(16)       /* Supports POSIX ACLs */
  #define SB_INLINECRYPT  BIT(17)       /* Use blk-crypto for encrypted files */
  #define SB_KERNMOUNT    BIT(22)       /* this is a kern_mount call */
  #define SB_I_VERSION    BIT(23)       /* Update inode I_version field */
  #define SB_I_PERSB_BDI        0x00000200      /* has a per-sb bdi */
  #define SB_I_TS_EXPIRY_WARNED 0x00000400 /* warned about timestamp range expiry */
  #define SB_I_RETIRED  0x00000800      /* superblock shouldn't be reused */
 +#define SB_I_NOUMASK  0x00001000      /* VFS does not apply umask */
  
  /* Possible states of 'frozen' field */
  enum {
@@@ -1210,7 -1206,7 +1210,7 @@@ struct super_block 
  #ifdef CONFIG_SECURITY
        void                    *s_security;
  #endif
 -      const struct xattr_handler **s_xattr;
 +      const struct xattr_handler * const *s_xattr;
  #ifdef CONFIG_FS_ENCRYPTION
        const struct fscrypt_operations *s_cop;
        struct fscrypt_keyring  *s_master_keys; /* master crypto keys in use */
        struct hlist_bl_head    s_roots;        /* alternate root dentries for NFS */
        struct list_head        s_mounts;       /* list of mounts; _not_ for fs use */
        struct block_device     *s_bdev;
 +      struct bdev_handle      *s_bdev_handle;
        struct backing_dev_info *s_bdi;
        struct mtd_info         *s_mtd;
        struct hlist_node       s_instances;
  
        const struct dentry_operations *s_d_op; /* default d_op for dentries */
  
-       struct shrinker s_shrink;       /* per-sb shrinker handle */
+       struct shrinker *s_shrink;      /* per-sb shrinker handle */
  
        /* Number of inodes with nlink == 0 but still referenced */
        atomic_long_t s_remove_count;
@@@ -1516,81 -1511,24 +1516,81 @@@ static inline bool fsuidgid_has_mapping
  struct timespec64 current_time(struct inode *inode);
  struct timespec64 inode_set_ctime_current(struct inode *inode);
  
 -/**
 - * inode_get_ctime - fetch the current ctime from the inode
 - * @inode: inode from which to fetch ctime
 - *
 - * Grab the current ctime from the inode and return it.
 - */
 +static inline time64_t inode_get_atime_sec(const struct inode *inode)
 +{
 +      return inode->__i_atime.tv_sec;
 +}
 +
 +static inline long inode_get_atime_nsec(const struct inode *inode)
 +{
 +      return inode->__i_atime.tv_nsec;
 +}
 +
 +static inline struct timespec64 inode_get_atime(const struct inode *inode)
 +{
 +      return inode->__i_atime;
 +}
 +
 +static inline struct timespec64 inode_set_atime_to_ts(struct inode *inode,
 +                                                    struct timespec64 ts)
 +{
 +      inode->__i_atime = ts;
 +      return ts;
 +}
 +
 +static inline struct timespec64 inode_set_atime(struct inode *inode,
 +                                              time64_t sec, long nsec)
 +{
 +      struct timespec64 ts = { .tv_sec  = sec,
 +                               .tv_nsec = nsec };
 +      return inode_set_atime_to_ts(inode, ts);
 +}
 +
 +static inline time64_t inode_get_mtime_sec(const struct inode *inode)
 +{
 +      return inode->__i_mtime.tv_sec;
 +}
 +
 +static inline long inode_get_mtime_nsec(const struct inode *inode)
 +{
 +      return inode->__i_mtime.tv_nsec;
 +}
 +
 +static inline struct timespec64 inode_get_mtime(const struct inode *inode)
 +{
 +      return inode->__i_mtime;
 +}
 +
 +static inline struct timespec64 inode_set_mtime_to_ts(struct inode *inode,
 +                                                    struct timespec64 ts)
 +{
 +      inode->__i_mtime = ts;
 +      return ts;
 +}
 +
 +static inline struct timespec64 inode_set_mtime(struct inode *inode,
 +                                              time64_t sec, long nsec)
 +{
 +      struct timespec64 ts = { .tv_sec  = sec,
 +                               .tv_nsec = nsec };
 +      return inode_set_mtime_to_ts(inode, ts);
 +}
 +
 +static inline time64_t inode_get_ctime_sec(const struct inode *inode)
 +{
 +      return inode->__i_ctime.tv_sec;
 +}
 +
 +static inline long inode_get_ctime_nsec(const struct inode *inode)
 +{
 +      return inode->__i_ctime.tv_nsec;
 +}
 +
  static inline struct timespec64 inode_get_ctime(const struct inode *inode)
  {
        return inode->__i_ctime;
  }
  
 -/**
 - * inode_set_ctime_to_ts - set the ctime in the inode
 - * @inode: inode in which to set the ctime
 - * @ts: value to set in the ctime field
 - *
 - * Set the ctime in @inode to @ts
 - */
  static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
                                                      struct timespec64 ts)
  {
@@@ -1615,8 -1553,6 +1615,8 @@@ static inline struct timespec64 inode_s
        return inode_set_ctime_to_ts(inode, ts);
  }
  
 +struct timespec64 simple_inode_init_ts(struct inode *inode);
 +
  /*
   * Snapshotting support.
   */
@@@ -2145,12 -2081,7 +2145,12 @@@ static inline bool sb_rdonly(const stru
  #define IS_NOQUOTA(inode)     ((inode)->i_flags & S_NOQUOTA)
  #define IS_APPEND(inode)      ((inode)->i_flags & S_APPEND)
  #define IS_IMMUTABLE(inode)   ((inode)->i_flags & S_IMMUTABLE)
 +
 +#ifdef CONFIG_FS_POSIX_ACL
  #define IS_POSIXACL(inode)    __IS_FLG(inode, SB_POSIXACL)
 +#else
 +#define IS_POSIXACL(inode)    0
 +#endif
  
  #define IS_DEADDIR(inode)     ((inode)->i_flags & S_DEAD)
  #define IS_NOCMTIME(inode)    ((inode)->i_flags & S_NOCMTIME)
@@@ -2472,13 -2403,13 +2472,13 @@@ struct audit_names
  struct filename {
        const char              *name;  /* pointer to actual string */
        const __user char       *uptr;  /* original userland pointer */
 -      int                     refcnt;
 +      atomic_t                refcnt;
        struct audit_names      *aname;
        const char              iname[];
  };
  static_assert(offsetof(struct filename, iname) % sizeof(long) == 0);
  
 -static inline struct mnt_idmap *file_mnt_idmap(struct file *file)
 +static inline struct mnt_idmap *file_mnt_idmap(const struct file *file)
  {
        return mnt_idmap(file->f_path.mnt);
  }
@@@ -2517,24 -2448,24 +2517,24 @@@ struct file *dentry_open(const struct p
                         const struct cred *creds);
  struct file *dentry_create(const struct path *path, int flags, umode_t mode,
                           const struct cred *cred);
 -struct file *backing_file_open(const struct path *path, int flags,
 +struct file *backing_file_open(const struct path *user_path, int flags,
                               const struct path *real_path,
                               const struct cred *cred);
 -struct path *backing_file_real_path(struct file *f);
 +struct path *backing_file_user_path(struct file *f);
  
  /*
 - * file_real_path - get the path corresponding to f_inode
 + * file_user_path - get the path to display for memory mapped file
   *
 - * When opening a backing file for a stackable filesystem (e.g.,
 - * overlayfs) f_path may be on the stackable filesystem and f_inode on
 - * the underlying filesystem.  When the path associated with f_inode is
 - * needed, this helper should be used instead of accessing f_path
 - * directly.
 -*/
 -static inline const struct path *file_real_path(struct file *f)
 + * When mmapping a file on a stackable filesystem (e.g., overlayfs), the file
 + * stored in ->vm_file is a backing file whose f_inode is on the underlying
 + * filesystem.  When the mapped file path is displayed to user (e.g. via
 + * /proc/<pid>/maps), this helper should be used to get the path to display
 + * to the user, which is the path of the fd that user has requested to map.
 + */
 +static inline const struct path *file_user_path(struct file *f)
  {
        if (unlikely(f->f_mode & FMODE_BACKING))
 -              return backing_file_real_path(f);
 +              return backing_file_user_path(f);
        return &f->f_path;
  }
  
diff --combined include/linux/mm.h
index ba896e946651d7d51966a8992867477ae50e01d0,14d5aaff96d0f1b7c4398b1a693ea7724eb155d9..418d26608ece70d12a5608dff42f0f4d04af5aea
@@@ -362,6 -362,8 +362,6 @@@ extern unsigned int kobjsize(const voi
  # define VM_SAO               VM_ARCH_1       /* Strong Access Ordering (powerpc) */
  #elif defined(CONFIG_PARISC)
  # define VM_GROWSUP   VM_ARCH_1
 -#elif defined(CONFIG_IA64)
 -# define VM_GROWSUP   VM_ARCH_1
  #elif defined(CONFIG_SPARC64)
  # define VM_SPARC_ADI VM_ARCH_1       /* Uses ADI tag for access control */
  # define VM_ARCH_CLEAR        VM_SPARC_ADI
@@@ -617,7 -619,7 +617,7 @@@ struct vm_operations_struct 
         * policy.
         */
        struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
-                                       unsigned long addr);
+                                       unsigned long addr, pgoff_t *ilx);
  #endif
        /*
         * Called by vm_normal_page() for special PTEs to find the
@@@ -935,6 -937,17 +935,17 @@@ static inline bool vma_is_accessible(st
        return vma->vm_flags & VM_ACCESS_FLAGS;
  }
  
+ static inline bool is_shared_maywrite(vm_flags_t vm_flags)
+ {
+       return (vm_flags & (VM_SHARED | VM_MAYWRITE)) ==
+               (VM_SHARED | VM_MAYWRITE);
+ }
+ static inline bool vma_is_shared_maywrite(struct vm_area_struct *vma)
+ {
+       return is_shared_maywrite(vma->vm_flags);
+ }
  static inline
  struct vm_area_struct *vma_find(struct vma_iterator *vmi, unsigned long max)
  {
@@@ -1335,7 -1348,6 +1346,6 @@@ void set_pte_range(struct vm_fault *vmf
                struct page *page, unsigned int nr, unsigned long addr);
  
  vm_fault_t finish_fault(struct vm_fault *vmf);
- vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
  #endif
  
  /*
@@@ -1684,26 -1696,26 +1694,26 @@@ static inline bool __cpupid_match_pid(p
  
  #define cpupid_match_pid(task, cpupid) __cpupid_match_pid(task->pid, cpupid)
  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
- static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
+ static inline int folio_xchg_last_cpupid(struct folio *folio, int cpupid)
  {
-       return xchg(&page->_last_cpupid, cpupid & LAST_CPUPID_MASK);
+       return xchg(&folio->_last_cpupid, cpupid & LAST_CPUPID_MASK);
  }
  
- static inline int page_cpupid_last(struct page *page)
+ static inline int folio_last_cpupid(struct folio *folio)
  {
-       return page->_last_cpupid;
+       return folio->_last_cpupid;
  }
  static inline void page_cpupid_reset_last(struct page *page)
  {
        page->_last_cpupid = -1 & LAST_CPUPID_MASK;
  }
  #else
- static inline int page_cpupid_last(struct page *page)
+ static inline int folio_last_cpupid(struct folio *folio)
  {
-       return (page->flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
+       return (folio->flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
  }
  
extern int page_cpupid_xchg_last(struct page *page, int cpupid);
int folio_xchg_last_cpupid(struct folio *folio, int cpupid);
  
  static inline void page_cpupid_reset_last(struct page *page)
  {
  }
  #endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
  
- static inline int xchg_page_access_time(struct page *page, int time)
+ static inline int folio_xchg_access_time(struct folio *folio, int time)
  {
        int last_time;
  
-       last_time = page_cpupid_xchg_last(page, time >> PAGE_ACCESS_TIME_BUCKETS);
+       last_time = folio_xchg_last_cpupid(folio,
+                                          time >> PAGE_ACCESS_TIME_BUCKETS);
        return last_time << PAGE_ACCESS_TIME_BUCKETS;
  }
  
@@@ -1724,24 -1737,24 +1735,24 @@@ static inline void vma_set_access_pid_b
        unsigned int pid_bit;
  
        pid_bit = hash_32(current->pid, ilog2(BITS_PER_LONG));
 -      if (vma->numab_state && !test_bit(pid_bit, &vma->numab_state->access_pids[1])) {
 -              __set_bit(pid_bit, &vma->numab_state->access_pids[1]);
 +      if (vma->numab_state && !test_bit(pid_bit, &vma->numab_state->pids_active[1])) {
 +              __set_bit(pid_bit, &vma->numab_state->pids_active[1]);
        }
  }
  #else /* !CONFIG_NUMA_BALANCING */
- static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
+ static inline int folio_xchg_last_cpupid(struct folio *folio, int cpupid)
  {
-       return page_to_nid(page); /* XXX */
+       return folio_nid(folio); /* XXX */
  }
  
- static inline int xchg_page_access_time(struct page *page, int time)
+ static inline int folio_xchg_access_time(struct folio *folio, int time)
  {
        return 0;
  }
  
- static inline int page_cpupid_last(struct page *page)
+ static inline int folio_last_cpupid(struct folio *folio)
  {
-       return page_to_nid(page); /* XXX */
+       return folio_nid(folio); /* XXX */
  }
  
  static inline int cpupid_to_nid(int cpupid)
@@@ -2325,6 -2338,8 +2336,8 @@@ struct folio *vm_normal_folio(struct vm
                             pte_t pte);
  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
                             pte_t pte);
+ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
+                                 unsigned long addr, pmd_t pmd);
  struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
                                pmd_t pmd);
  
@@@ -2411,8 -2426,6 +2424,6 @@@ extern int access_process_vm(struct tas
                void *buf, int len, unsigned int gup_flags);
  extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
                void *buf, int len, unsigned int gup_flags);
- extern int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
-                             void *buf, int len, unsigned int gup_flags);
  
  long get_user_pages_remote(struct mm_struct *mm,
                           unsigned long start, unsigned long nr_pages,
@@@ -2423,6 -2436,9 +2434,9 @@@ long pin_user_pages_remote(struct mm_st
                           unsigned int gup_flags, struct page **pages,
                           int *locked);
  
+ /*
+  * Retrieves a single page alongside its VMA. Does not support FOLL_NOWAIT.
+  */
  static inline struct page *get_user_page_vma_remote(struct mm_struct *mm,
                                                    unsigned long addr,
                                                    int gup_flags,
  {
        struct page *page;
        struct vm_area_struct *vma;
-       int got = get_user_pages_remote(mm, addr, 1, gup_flags, &page, NULL);
+       int got;
+       if (WARN_ON_ONCE(unlikely(gup_flags & FOLL_NOWAIT)))
+               return ERR_PTR(-EINVAL);
+       got = get_user_pages_remote(mm, addr, 1, gup_flags, &page, NULL);
  
        if (got < 0)
                return ERR_PTR(got);
-       if (got == 0)
-               return NULL;
  
        vma = vma_lookup(mm, addr);
        if (WARN_ON_ONCE(!vma)) {
@@@ -2478,7 -2497,7 +2495,7 @@@ int get_cmdline(struct task_struct *tas
  extern unsigned long move_page_tables(struct vm_area_struct *vma,
                unsigned long old_addr, struct vm_area_struct *new_vma,
                unsigned long new_addr, unsigned long len,
-               bool need_rmap_locks);
+               bool need_rmap_locks, bool for_stack);
  
  /*
   * Flags used by change_protection().  For now we make it a bitmap so
@@@ -2626,14 -2645,6 +2643,6 @@@ static inline void setmax_mm_hiwater_rs
                *maxrss = hiwater_rss;
  }
  
- #if defined(SPLIT_RSS_COUNTING)
- void sync_mm_rss(struct mm_struct *mm);
- #else
- static inline void sync_mm_rss(struct mm_struct *mm)
- {
- }
- #endif
  #ifndef CONFIG_ARCH_HAS_PTE_SPECIAL
  static inline int pte_special(pte_t pte)
  {
@@@ -3055,6 -3066,22 +3064,22 @@@ static inline spinlock_t *pud_lock(stru
        return ptl;
  }
  
+ static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
+ {
+       struct folio *folio = ptdesc_folio(ptdesc);
+       __folio_set_pgtable(folio);
+       lruvec_stat_add_folio(folio, NR_PAGETABLE);
+ }
+ static inline void pagetable_pud_dtor(struct ptdesc *ptdesc)
+ {
+       struct folio *folio = ptdesc_folio(ptdesc);
+       __folio_clear_pgtable(folio);
+       lruvec_stat_sub_folio(folio, NR_PAGETABLE);
+ }
  extern void __init pagecache_init(void);
  extern void free_initmem(void);
  
@@@ -3219,22 -3246,73 +3244,73 @@@ extern int vma_expand(struct vma_iterat
                      struct vm_area_struct *next);
  extern int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
                       unsigned long start, unsigned long end, pgoff_t pgoff);
- extern struct vm_area_struct *vma_merge(struct vma_iterator *vmi,
-       struct mm_struct *, struct vm_area_struct *prev, unsigned long addr,
-       unsigned long end, unsigned long vm_flags, struct anon_vma *,
-       struct file *, pgoff_t, struct mempolicy *, struct vm_userfaultfd_ctx,
-       struct anon_vma_name *);
  extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
- extern int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *,
-                      unsigned long addr, int new_below);
- extern int split_vma(struct vma_iterator *vmi, struct vm_area_struct *,
-                        unsigned long addr, int new_below);
  extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
  extern void unlink_file_vma(struct vm_area_struct *);
  extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
        unsigned long addr, unsigned long len, pgoff_t pgoff,
        bool *need_rmap_locks);
  extern void exit_mmap(struct mm_struct *);
+ struct vm_area_struct *vma_modify(struct vma_iterator *vmi,
+                                 struct vm_area_struct *prev,
+                                 struct vm_area_struct *vma,
+                                 unsigned long start, unsigned long end,
+                                 unsigned long vm_flags,
+                                 struct mempolicy *policy,
+                                 struct vm_userfaultfd_ctx uffd_ctx,
+                                 struct anon_vma_name *anon_name);
+ /* We are about to modify the VMA's flags. */
+ static inline struct vm_area_struct
+ *vma_modify_flags(struct vma_iterator *vmi,
+                 struct vm_area_struct *prev,
+                 struct vm_area_struct *vma,
+                 unsigned long start, unsigned long end,
+                 unsigned long new_flags)
+ {
+       return vma_modify(vmi, prev, vma, start, end, new_flags,
+                         vma_policy(vma), vma->vm_userfaultfd_ctx,
+                         anon_vma_name(vma));
+ }
+ /* We are about to modify the VMA's flags and/or anon_name. */
+ static inline struct vm_area_struct
+ *vma_modify_flags_name(struct vma_iterator *vmi,
+                      struct vm_area_struct *prev,
+                      struct vm_area_struct *vma,
+                      unsigned long start,
+                      unsigned long end,
+                      unsigned long new_flags,
+                      struct anon_vma_name *new_name)
+ {
+       return vma_modify(vmi, prev, vma, start, end, new_flags,
+                         vma_policy(vma), vma->vm_userfaultfd_ctx, new_name);
+ }
+ /* We are about to modify the VMA's memory policy. */
+ static inline struct vm_area_struct
+ *vma_modify_policy(struct vma_iterator *vmi,
+                  struct vm_area_struct *prev,
+                  struct vm_area_struct *vma,
+                  unsigned long start, unsigned long end,
+                  struct mempolicy *new_pol)
+ {
+       return vma_modify(vmi, prev, vma, start, end, vma->vm_flags,
+                         new_pol, vma->vm_userfaultfd_ctx, anon_vma_name(vma));
+ }
+ /* We are about to modify the VMA's flags and/or uffd context. */
+ static inline struct vm_area_struct
+ *vma_modify_flags_uffd(struct vma_iterator *vmi,
+                      struct vm_area_struct *prev,
+                      struct vm_area_struct *vma,
+                      unsigned long start, unsigned long end,
+                      unsigned long new_flags,
+                      struct vm_userfaultfd_ctx new_ctx)
+ {
+       return vma_modify(vmi, prev, vma, start, end, new_flags,
+                         vma_policy(vma), new_ctx, anon_vma_name(vma));
+ }
  
  static inline int check_data_rlimit(unsigned long rlim,
                                    unsigned long new,
@@@ -3306,7 -3384,8 +3382,7 @@@ static inline void mm_populate(unsigne
  static inline void mm_populate(unsigned long addr, unsigned long len) {}
  #endif
  
 -/* These take the mm semaphore themselves */
 -extern int __must_check vm_brk(unsigned long, unsigned long);
 +/* This takes the mm semaphore itself */
  extern int __must_check vm_brk_flags(unsigned long, unsigned long, unsigned long);
  extern int vm_munmap(unsigned long, size_t);
  extern unsigned long __must_check vm_mmap(struct file *, unsigned long,
@@@ -3997,25 -4076,26 +4073,26 @@@ static inline void mem_dump_obj(void *o
  #endif
  
  /**
-  * seal_check_future_write - Check for F_SEAL_FUTURE_WRITE flag and handle it
+  * seal_check_write - Check for F_SEAL_WRITE or F_SEAL_FUTURE_WRITE flags and
+  *                    handle them.
   * @seals: the seals to check
   * @vma: the vma to operate on
   *
-  * Check whether F_SEAL_FUTURE_WRITE is set; if so, do proper check/handling on
-  * the vma flags.  Return 0 if check pass, or <0 for errors.
+  * Check whether F_SEAL_WRITE or F_SEAL_FUTURE_WRITE are set; if so, do proper
+  * check/handling on the vma flags.  Return 0 if check pass, or <0 for errors.
   */
- static inline int seal_check_future_write(int seals, struct vm_area_struct *vma)
+ static inline int seal_check_write(int seals, struct vm_area_struct *vma)
  {
-       if (seals & F_SEAL_FUTURE_WRITE) {
+       if (seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) {
                /*
                 * New PROT_WRITE and MAP_SHARED mmaps are not allowed when
-                * "future write" seal active.
+                * write seals are active.
                 */
                if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE))
                        return -EPERM;
  
                /*
-                * Since an F_SEAL_FUTURE_WRITE sealed memfd can be mapped as
+                * Since an F_SEAL_[FUTURE_]WRITE sealed memfd can be mapped as
                 * MAP_SHARED and read-only, take care to not allow mprotect to
                 * revert protections on such mappings. Do this only for shared
                 * mappings. For private mappings, don't need to mask
@@@ -4059,4 -4139,11 +4136,11 @@@ static inline void accept_memory(phys_a
  
  #endif
  
+ static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
+ {
+       phys_addr_t paddr = pfn << PAGE_SHIFT;
+       return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE);
+ }
  #endif /* _LINUX_MM_H */
diff --combined include/linux/mm_types.h
index 4be8e310b189361fbf7044fd28d604304536d88e,692e41213cd3596954f1ce9ff87183b64973d66e..957ce38768b2a88d754112e72f923f7947ab0854
@@@ -125,7 -125,18 +125,7 @@@ struct page 
                        struct page_pool *pp;
                        unsigned long _pp_mapping_pad;
                        unsigned long dma_addr;
 -                      union {
 -                              /**
 -                               * dma_addr_upper: might require a 64-bit
 -                               * value on 32-bit architectures.
 -                               */
 -                              unsigned long dma_addr_upper;
 -                              /**
 -                               * For frag page support, not supported in
 -                               * 32-bit architectures with 64-bit DMA.
 -                               */
 -                              atomic_long_t pp_frag_count;
 -                      };
 +                      atomic_long_t pp_frag_count;
                };
                struct {        /* Tail pages of compound page */
                        unsigned long compound_head;    /* Bit zero is set */
                                           not kmapped, ie. highmem) */
  #endif /* WANT_PAGE_VIRTUAL */
  
+ #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+       int _last_cpupid;
+ #endif
  #ifdef CONFIG_KMSAN
        /*
         * KMSAN metadata for this page:
        struct page *kmsan_shadow;
        struct page *kmsan_origin;
  #endif
- #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
-       int _last_cpupid;
- #endif
  } _struct_page_alignment;
  
  /*
@@@ -261,6 -272,8 +261,8 @@@ typedef struct 
   * @_refcount: Do not access this member directly.  Use folio_ref_count()
   *    to find how many references there are to this folio.
   * @memcg_data: Memory Control Group data.
+  * @virtual: Virtual address in the kernel direct map.
+  * @_last_cpupid: IDs of last CPU and last process that accessed the folio.
   * @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
   * @_nr_pages_mapped: Do not use directly, call folio_mapcount().
   * @_pincount: Do not use directly, call folio_maybe_dma_pinned().
@@@ -306,6 -319,12 +308,12 @@@ struct folio 
                        atomic_t _refcount;
  #ifdef CONFIG_MEMCG
                        unsigned long memcg_data;
+ #endif
+ #if defined(WANT_PAGE_VIRTUAL)
+                       void *virtual;
+ #endif
+ #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+                       int _last_cpupid;
  #endif
        /* private: the union with struct page is transitional */
                };
@@@ -362,6 -381,12 +370,12 @@@ FOLIO_MATCH(_refcount, _refcount)
  #ifdef CONFIG_MEMCG
  FOLIO_MATCH(memcg_data, memcg_data);
  #endif
+ #if defined(WANT_PAGE_VIRTUAL)
+ FOLIO_MATCH(virtual, virtual);
+ #endif
+ #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+ FOLIO_MATCH(_last_cpupid, _last_cpupid);
+ #endif
  #undef FOLIO_MATCH
  #define FOLIO_MATCH(pg, fl)                                           \
        static_assert(offsetof(struct folio, fl) ==                     \
@@@ -535,41 -560,35 +549,62 @@@ struct anon_vma_name 
        char name[];
  };
  
+ #ifdef CONFIG_ANON_VMA_NAME
+ /*
+  * mmap_lock should be read-locked when calling anon_vma_name(). Caller should
+  * either keep holding the lock while using the returned pointer or it should
+  * raise anon_vma_name refcount before releasing the lock.
+  */
+ struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma);
+ struct anon_vma_name *anon_vma_name_alloc(const char *name);
+ void anon_vma_name_free(struct kref *kref);
+ #else /* CONFIG_ANON_VMA_NAME */
+ static inline struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma)
+ {
+       return NULL;
+ }
+ static inline struct anon_vma_name *anon_vma_name_alloc(const char *name)
+ {
+       return NULL;
+ }
+ #endif
  struct vma_lock {
        struct rw_semaphore lock;
  };
  
  struct vma_numab_state {
 +      /*
 +       * Initialised as time in 'jiffies' after which VMA
 +       * should be scanned.  Delays first scan of new VMA by at
 +       * least sysctl_numa_balancing_scan_delay:
 +       */
        unsigned long next_scan;
 -      unsigned long next_pid_reset;
 -      unsigned long access_pids[2];
 +
 +      /*
 +       * Time in jiffies when pids_active[] is reset to
 +       * detect phase change behaviour:
 +       */
 +      unsigned long pids_active_reset;
 +
 +      /*
 +       * Approximate tracking of PIDs that trapped a NUMA hinting
 +       * fault. May produce false positives due to hash collisions.
 +       *
 +       *   [0] Previous PID tracking
 +       *   [1] Current PID tracking
 +       *
 +       * Window moves after next_pid_reset has expired approximately
 +       * every VMA_PID_RESET_PERIOD jiffies:
 +       */
 +      unsigned long pids_active[2];
 +
 +      /*
 +       * MM scan sequence ID when the VMA was last completely scanned.
 +       * A VMA is not eligible for scanning if prev_scan_seq == numa_scan_seq
 +       */
 +      int prev_scan_seq;
  };
  
  /*
@@@ -678,6 -697,12 +713,12 @@@ struct vm_area_struct 
        struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
  } __randomize_layout;
  
+ #ifdef CONFIG_NUMA
+ #define vma_policy(vma) ((vma)->vm_policy)
+ #else
+ #define vma_policy(vma) NULL
+ #endif
  #ifdef CONFIG_SCHED_MM_CID
  struct mm_cid {
        u64 time;
diff --combined include/linux/sched.h
index 12ec109ce8c9b3ea17d7b237c298be1511ac6a21,60de42715b5680ba5481822e648b1943f7732bf4..b49ca40f633550b191dd60e33d86fbf097f17a97
@@@ -63,6 -63,7 +63,6 @@@ struct robust_list_head
  struct root_domain;
  struct rq;
  struct sched_attr;
 -struct sched_param;
  struct seq_file;
  struct sighand_struct;
  struct signal_struct;
@@@ -369,10 -370,6 +369,10 @@@ extern struct root_domain def_root_doma
  extern struct mutex sched_domains_mutex;
  #endif
  
 +struct sched_param {
 +      int sched_priority;
 +};
 +
  struct sched_info {
  #ifdef CONFIG_SCHED_INFO
        /* Cumulative counters: */
@@@ -753,8 -750,10 +753,8 @@@ struct task_struct 
  #endif
        unsigned int                    __state;
  
 -#ifdef CONFIG_PREEMPT_RT
        /* saved state for "spinlock sleepers" */
        unsigned int                    saved_state;
 -#endif
  
        /*
         * This begins the randomizable portion of task_struct. Only
  
        struct mm_struct                *mm;
        struct mm_struct                *active_mm;
 +      struct address_space            *faults_disabled_mapping;
  
        int                             exit_state;
        int                             exit_code;
         * ->sched_remote_wakeup gets used, so it can be in this word.
         */
        unsigned                        sched_remote_wakeup:1;
 +#ifdef CONFIG_RT_MUTEXES
 +      unsigned                        sched_rt_mutex:1;
 +#endif
  
        /* Bit to tell LSMs we're in execve(): */
        unsigned                        in_execve:1;
        struct mem_cgroup               *active_memcg;
  #endif
  
+ #ifdef CONFIG_MEMCG_KMEM
+       struct obj_cgroup               *objcg;
+ #endif
  #ifdef CONFIG_BLK_CGROUP
        struct gendisk                  *throttle_disk;
  #endif
index b69afb8630db4a98a41bb6dbf90761369a510aa1,06a9d35650f07da2dc19aeda9f927153d6f7853d..52b22c5c396db0984d62d53e59ef3fd31711190c
  #define TNF_FAULT_LOCAL       0x08
  #define TNF_MIGRATE_FAIL 0x10
  
 +enum numa_vmaskip_reason {
 +      NUMAB_SKIP_UNSUITABLE,
 +      NUMAB_SKIP_SHARED_RO,
 +      NUMAB_SKIP_INACCESSIBLE,
 +      NUMAB_SKIP_SCAN_DELAY,
 +      NUMAB_SKIP_PID_INACTIVE,
 +      NUMAB_SKIP_IGNORE_PID,
 +      NUMAB_SKIP_SEQ_COMPLETED,
 +};
 +
  #ifdef CONFIG_NUMA_BALANCING
  extern void task_numa_fault(int last_node, int node, int pages, int flags);
  extern pid_t task_numa_group_id(struct task_struct *p);
  extern void set_numabalancing_state(bool enabled);
  extern void task_numa_free(struct task_struct *p, bool final);
extern bool should_numa_migrate_memory(struct task_struct *p, struct page *page,
-                                       int src_nid, int dst_cpu);
bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
+                               int src_nid, int dst_cpu);
  #else
  static inline void task_numa_fault(int last_node, int node, int pages,
                                   int flags)
@@@ -48,7 -38,7 +48,7 @@@ static inline void task_numa_free(struc
  {
  }
  static inline bool should_numa_migrate_memory(struct task_struct *p,
-                               struct page *page, int src_nid, int dst_cpu)
+                               struct folio *folio, int src_nid, int dst_cpu)
  {
        return true;
  }
diff --combined kernel/cgroup/cgroup.c
index 484adb375b155bc00bfaaf0bc808e4b1d43ea097,f11488b18ceb2c4adc3e472a78c83956bf10d2ab..1d5b9de3b1b9d01791b1222bf2fcbb4e46c852ee
@@@ -207,8 -207,6 +207,8 @@@ static u16 have_exit_callback __read_mo
  static u16 have_release_callback __read_mostly;
  static u16 have_canfork_callback __read_mostly;
  
 +static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS);
 +
  /* cgroup namespace for init task */
  struct cgroup_namespace init_cgroup_ns = {
        .ns.count       = REFCOUNT_INIT(2),
@@@ -1352,9 -1350,7 +1352,9 @@@ static void cgroup_destroy_root(struct 
                cgroup_root_count--;
        }
  
 -      cgroup_favor_dynmods(root, false);
 +      if (!have_favordynmods)
 +              cgroup_favor_dynmods(root, false);
 +
        cgroup_exit_root_id(root);
  
        cgroup_unlock();
@@@ -1723,22 -1719,20 +1723,22 @@@ static int css_populate_dir(struct cgro
  
        if (!css->ss) {
                if (cgroup_on_dfl(cgrp)) {
 -                      ret = cgroup_addrm_files(&cgrp->self, cgrp,
 +                      ret = cgroup_addrm_files(css, cgrp,
                                                 cgroup_base_files, true);
                        if (ret < 0)
                                return ret;
  
                        if (cgroup_psi_enabled()) {
 -                              ret = cgroup_addrm_files(&cgrp->self, cgrp,
 +                              ret = cgroup_addrm_files(css, cgrp,
                                                         cgroup_psi_files, true);
                                if (ret < 0)
                                        return ret;
                        }
                } else {
 -                      cgroup_addrm_files(css, cgrp,
 -                                         cgroup1_base_files, true);
 +                      ret = cgroup_addrm_files(css, cgrp,
 +                                               cgroup1_base_files, true);
 +                      if (ret < 0)
 +                              return ret;
                }
        } else {
                list_for_each_entry(cfts, &css->ss->cfts, node) {
@@@ -1908,6 -1902,7 +1908,7 @@@ enum cgroup2_param 
        Opt_favordynmods,
        Opt_memory_localevents,
        Opt_memory_recursiveprot,
+       Opt_memory_hugetlb_accounting,
        nr__cgroup2_params
  };
  
@@@ -1916,6 -1911,7 +1917,7 @@@ static const struct fs_parameter_spec c
        fsparam_flag("favordynmods",            Opt_favordynmods),
        fsparam_flag("memory_localevents",      Opt_memory_localevents),
        fsparam_flag("memory_recursiveprot",    Opt_memory_recursiveprot),
+       fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting),
        {}
  };
  
@@@ -1942,6 -1938,9 +1944,9 @@@ static int cgroup2_parse_param(struct f
        case Opt_memory_recursiveprot:
                ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
                return 0;
+       case Opt_memory_hugetlb_accounting:
+               ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
+               return 0;
        }
        return -EINVAL;
  }
@@@ -1966,6 -1965,11 +1971,11 @@@ static void apply_cgroup_root_flags(uns
                        cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
                else
                        cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;
+               if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
+                       cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
+               else
+                       cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
        }
  }
  
@@@ -1979,6 -1983,8 +1989,8 @@@ static int cgroup_show_options(struct s
                seq_puts(seq, ",memory_localevents");
        if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
                seq_puts(seq, ",memory_recursiveprot");
+       if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
+               seq_puts(seq, ",memory_hugetlb_accounting");
        return 0;
  }
  
@@@ -2249,9 -2255,9 +2261,9 @@@ static int cgroup_init_fs_context(struc
        fc->user_ns = get_user_ns(ctx->ns->user_ns);
        fc->global = true;
  
 -#ifdef CONFIG_CGROUP_FAVOR_DYNMODS
 -      ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
 -#endif
 +      if (have_favordynmods)
 +              ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
 +
        return 0;
  }
  
@@@ -4923,11 -4929,9 +4935,11 @@@ repeat
  void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
                         struct css_task_iter *it)
  {
 +      unsigned long irqflags;
 +
        memset(it, 0, sizeof(*it));
  
 -      spin_lock_irq(&css_set_lock);
 +      spin_lock_irqsave(&css_set_lock, irqflags);
  
        it->ss = css->ss;
        it->flags = flags;
  
        css_task_iter_advance(it);
  
 -      spin_unlock_irq(&css_set_lock);
 +      spin_unlock_irqrestore(&css_set_lock, irqflags);
  }
  
  /**
   */
  struct task_struct *css_task_iter_next(struct css_task_iter *it)
  {
 +      unsigned long irqflags;
 +
        if (it->cur_task) {
                put_task_struct(it->cur_task);
                it->cur_task = NULL;
        }
  
 -      spin_lock_irq(&css_set_lock);
 +      spin_lock_irqsave(&css_set_lock, irqflags);
  
        /* @it may be half-advanced by skips, finish advancing */
        if (it->flags & CSS_TASK_ITER_SKIPPED)
                css_task_iter_advance(it);
        }
  
 -      spin_unlock_irq(&css_set_lock);
 +      spin_unlock_irqrestore(&css_set_lock, irqflags);
  
        return it->cur_task;
  }
   */
  void css_task_iter_end(struct css_task_iter *it)
  {
 +      unsigned long irqflags;
 +
        if (it->cur_cset) {
 -              spin_lock_irq(&css_set_lock);
 +              spin_lock_irqsave(&css_set_lock, irqflags);
                list_del(&it->iters_node);
                put_css_set_locked(it->cur_cset);
 -              spin_unlock_irq(&css_set_lock);
 +              spin_unlock_irqrestore(&css_set_lock, irqflags);
        }
  
        if (it->cur_dcset)
@@@ -6133,7 -6133,7 +6145,7 @@@ int __init cgroup_init(void
  
                if (cgroup1_ssid_disabled(ssid))
                        pr_info("Disabling %s control group subsystem in v1 mounts\n",
 -                              ss->name);
 +                              ss->legacy_name);
  
                cgrp_dfl_root.subsys_mask |= 1 << ss->id;
  
@@@ -6776,12 -6776,6 +6788,12 @@@ static int __init enable_cgroup_debug(c
  }
  __setup("cgroup_debug", enable_cgroup_debug);
  
 +static int __init cgroup_favordynmods_setup(char *str)
 +{
 +      return (kstrtobool(str, &have_favordynmods) == 0);
 +}
 +__setup("cgroup_favordynmods=", cgroup_favordynmods_setup);
 +
  /**
   * css_tryget_online_from_dir - get corresponding css from a cgroup dentry
   * @dentry: directory dentry of interest
@@@ -7068,7 -7062,8 +7080,8 @@@ static ssize_t features_show(struct kob
                        "nsdelegate\n"
                        "favordynmods\n"
                        "memory_localevents\n"
-                       "memory_recursiveprot\n");
+                       "memory_recursiveprot\n"
+                       "memory_hugetlb_accounting\n");
  }
  static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
  
diff --combined kernel/exit.c
index 2b4a232f2f68fa622cda5e1cf0a134eb03f9c831,3cdbe797008fa302fe61ed8d4736ea63ca8077ac..61ebba96909b98408d4bb802436385e4b25aad75
@@@ -74,8 -74,6 +74,8 @@@
  #include <asm/unistd.h>
  #include <asm/mmu_context.h>
  
 +#include "exit.h"
 +
  /*
   * The default value should be high enough to not crash a system that randomly
   * crashes its kernel from time to time, but low enough to at least not permit
@@@ -541,7 -539,6 +541,6 @@@ static void exit_mm(void
        exit_mm_release(current, mm);
        if (!mm)
                return;
-       sync_mm_rss(mm);
        mmap_read_lock(mm);
        mmgrab_lazy_tlb(mm);
        BUG_ON(mm != current->active_mm);
@@@ -831,9 -828,6 +830,6 @@@ void __noreturn do_exit(long code
        io_uring_files_cancel();
        exit_signals(tsk);  /* sets PF_EXITING */
  
-       /* sync mm's RSS info before statistics gathering */
-       if (tsk->mm)
-               sync_mm_rss(tsk->mm);
        acct_update_integrals(tsk);
        group_dead = atomic_dec_and_test(&tsk->signal->live);
        if (group_dead) {
@@@ -1039,6 -1033,26 +1035,6 @@@ SYSCALL_DEFINE1(exit_group, int, error_
        return 0;
  }
  
 -struct waitid_info {
 -      pid_t pid;
 -      uid_t uid;
 -      int status;
 -      int cause;
 -};
 -
 -struct wait_opts {
 -      enum pid_type           wo_type;
 -      int                     wo_flags;
 -      struct pid              *wo_pid;
 -
 -      struct waitid_info      *wo_info;
 -      int                     wo_stat;
 -      struct rusage           *wo_rusage;
 -
 -      wait_queue_entry_t              child_wait;
 -      int                     notask_error;
 -};
 -
  static int eligible_pid(struct wait_opts *wo, struct task_struct *p)
  {
        return  wo->wo_type == PIDTYPE_MAX ||
@@@ -1502,17 -1516,6 +1498,17 @@@ static int ptrace_do_wait(struct wait_o
        return 0;
  }
  
 +bool pid_child_should_wake(struct wait_opts *wo, struct task_struct *p)
 +{
 +      if (!eligible_pid(wo, p))
 +              return false;
 +
 +      if ((wo->wo_flags & __WNOTHREAD) && wo->child_wait.private != p->parent)
 +              return false;
 +
 +      return true;
 +}
 +
  static int child_wait_callback(wait_queue_entry_t *wait, unsigned mode,
                                int sync, void *key)
  {
                                                child_wait);
        struct task_struct *p = key;
  
 -      if (!eligible_pid(wo, p))
 -              return 0;
 +      if (pid_child_should_wake(wo, p))
 +              return default_wake_function(wait, mode, sync, key);
  
 -      if ((wo->wo_flags & __WNOTHREAD) && wait->private != p->parent)
 -              return 0;
 -
 -      return default_wake_function(wait, mode, sync, key);
 +      return 0;
  }
  
  void __wake_up_parent(struct task_struct *p, struct task_struct *parent)
@@@ -1572,10 -1578,16 +1568,10 @@@ static int do_wait_pid(struct wait_opt
        return 0;
  }
  
 -static long do_wait(struct wait_opts *wo)
 +long __do_wait(struct wait_opts *wo)
  {
 -      int retval;
 -
 -      trace_sched_process_wait(wo->wo_pid);
 +      long retval;
  
 -      init_waitqueue_func_entry(&wo->child_wait, child_wait_callback);
 -      wo->child_wait.private = current;
 -      add_wait_queue(&current->signal->wait_chldexit, &wo->child_wait);
 -repeat:
        /*
         * If there is nothing that can match our criteria, just get out.
         * We will clear ->notask_error to zero if we see any child that
           (!wo->wo_pid || !pid_has_task(wo->wo_pid, wo->wo_type)))
                goto notask;
  
 -      set_current_state(TASK_INTERRUPTIBLE);
        read_lock(&tasklist_lock);
  
        if (wo->wo_type == PIDTYPE_PID) {
                retval = do_wait_pid(wo);
                if (retval)
 -                      goto end;
 +                      return retval;
        } else {
                struct task_struct *tsk = current;
  
                do {
                        retval = do_wait_thread(wo, tsk);
                        if (retval)
 -                              goto end;
 +                              return retval;
  
                        retval = ptrace_do_wait(wo, tsk);
                        if (retval)
 -                              goto end;
 +                              return retval;
  
                        if (wo->wo_flags & __WNOTHREAD)
                                break;
  
  notask:
        retval = wo->notask_error;
 -      if (!retval && !(wo->wo_flags & WNOHANG)) {
 -              retval = -ERESTARTSYS;
 -              if (!signal_pending(current)) {
 -                      schedule();
 -                      goto repeat;
 -              }
 -      }
 -end:
 +      if (!retval && !(wo->wo_flags & WNOHANG))
 +              return -ERESTARTSYS;
 +
 +      return retval;
 +}
 +
 +static long do_wait(struct wait_opts *wo)
 +{
 +      int retval;
 +
 +      trace_sched_process_wait(wo->wo_pid);
 +
 +      init_waitqueue_func_entry(&wo->child_wait, child_wait_callback);
 +      wo->child_wait.private = current;
 +      add_wait_queue(&current->signal->wait_chldexit, &wo->child_wait);
 +
 +      do {
 +              set_current_state(TASK_INTERRUPTIBLE);
 +              retval = __do_wait(wo);
 +              if (retval != -ERESTARTSYS)
 +                      break;
 +              if (signal_pending(current))
 +                      break;
 +              schedule();
 +      } while (1);
 +
        __set_current_state(TASK_RUNNING);
        remove_wait_queue(&current->signal->wait_chldexit, &wo->child_wait);
        return retval;
  }
  
 -static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop,
 -                        int options, struct rusage *ru)
 +int kernel_waitid_prepare(struct wait_opts *wo, int which, pid_t upid,
 +                        struct waitid_info *infop, int options,
 +                        struct rusage *ru)
  {
 -      struct wait_opts wo;
 +      unsigned int f_flags = 0;
        struct pid *pid = NULL;
        enum pid_type type;
 -      long ret;
 -      unsigned int f_flags = 0;
  
        if (options & ~(WNOHANG|WNOWAIT|WEXITED|WSTOPPED|WCONTINUED|
                        __WNOTHREAD|__WCLONE|__WALL))
                return -EINVAL;
        }
  
 -      wo.wo_type      = type;
 -      wo.wo_pid       = pid;
 -      wo.wo_flags     = options;
 -      wo.wo_info      = infop;
 -      wo.wo_rusage    = ru;
 +      wo->wo_type     = type;
 +      wo->wo_pid      = pid;
 +      wo->wo_flags    = options;
 +      wo->wo_info     = infop;
 +      wo->wo_rusage   = ru;
        if (f_flags & O_NONBLOCK)
 -              wo.wo_flags |= WNOHANG;
 +              wo->wo_flags |= WNOHANG;
 +
 +      return 0;
 +}
 +
 +static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop,
 +                        int options, struct rusage *ru)
 +{
 +      struct wait_opts wo;
 +      long ret;
 +
 +      ret = kernel_waitid_prepare(&wo, which, upid, infop, options, ru);
 +      if (ret)
 +              return ret;
  
        ret = do_wait(&wo);
 -      if (!ret && !(options & WNOHANG) && (f_flags & O_NONBLOCK))
 +      if (!ret && !(options & WNOHANG) && (wo.wo_flags & WNOHANG))
                ret = -EAGAIN;
  
 -      put_pid(pid);
 +      put_pid(wo.wo_pid);
        return ret;
  }
  
diff --combined kernel/fork.c
index 70e301b63a7bdb58502ca8c997afa2a37af4cac1,1e6c656e08577ded42155e907d177b39cc00dc26..373fa2f739bc41ced8dc9074d84ec1ce5336483a
@@@ -733,7 -733,7 +733,7 @@@ static __latent_entropy int dup_mmap(st
  
                        get_file(file);
                        i_mmap_lock_write(mapping);
-                       if (tmp->vm_flags & VM_SHARED)
+                       if (vma_is_shared_maywrite(tmp))
                                mapping_allow_writable(mapping);
                        flush_dcache_mmap_lock(mapping);
                        /* insert tmp into the share list, just after mpnt */
@@@ -1288,7 -1288,7 +1288,7 @@@ static struct mm_struct *mm_init(struc
        hugetlb_count_init(mm);
  
        if (current->mm) {
-               mm->flags = current->mm->flags & MMF_INIT_MASK;
+               mm->flags = mmf_init_flags(current->mm->flags);
                mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
        } else {
                mm->flags = default_dump_filter;
@@@ -1393,8 -1393,6 +1393,8 @@@ EXPORT_SYMBOL_GPL(mmput_async)
  
  /**
   * set_mm_exe_file - change a reference to the mm's executable file
 + * @mm: The mm to change.
 + * @new_exe_file: The new file to use.
   *
   * This changes mm's executable file (shown as symlink /proc/[pid]/exe).
   *
@@@ -1434,8 -1432,6 +1434,8 @@@ int set_mm_exe_file(struct mm_struct *m
  
  /**
   * replace_mm_exe_file - replace a reference to the mm's executable file
 + * @mm: The mm to change.
 + * @new_exe_file: The new file to use.
   *
   * This changes mm's executable file (shown as symlink /proc/[pid]/exe).
   *
@@@ -1487,7 -1483,6 +1487,7 @@@ int replace_mm_exe_file(struct mm_struc
  
  /**
   * get_mm_exe_file - acquire a reference to the mm's executable file
 + * @mm: The mm of interest.
   *
   * Returns %NULL if mm has no associated executable file.
   * User must release file via fput().
@@@ -1497,14 -1492,15 +1497,14 @@@ struct file *get_mm_exe_file(struct mm_
        struct file *exe_file;
  
        rcu_read_lock();
 -      exe_file = rcu_dereference(mm->exe_file);
 -      if (exe_file && !get_file_rcu(exe_file))
 -              exe_file = NULL;
 +      exe_file = get_file_rcu(&mm->exe_file);
        rcu_read_unlock();
        return exe_file;
  }
  
  /**
   * get_task_exe_file - acquire a reference to the task's executable file
 + * @task: The task.
   *
   * Returns %NULL if task's mm (if any) has no associated executable file or
   * this is a kernel thread with borrowed mm (see the comment above get_task_mm).
@@@ -1527,7 -1523,6 +1527,7 @@@ struct file *get_task_exe_file(struct t
  
  /**
   * get_task_mm - acquire a reference to the task's mm
 + * @task: The task.
   *
   * Returns %NULL if the task has no mm.  Checks PF_KTHREAD (meaning
   * this kernel workthread has transiently adopted a user mm with use_mm,
@@@ -2107,11 -2102,11 +2107,11 @@@ const struct file_operations pidfd_fop
   * __pidfd_prepare - allocate a new pidfd_file and reserve a pidfd
   * @pid:   the struct pid for which to create a pidfd
   * @flags: flags of the new @pidfd
 - * @pidfd: the pidfd to return
 + * @ret: Where to return the file for the pidfd.
   *
   * Allocate a new file that stashes @pid and reserve a new pidfd number in the
   * caller's file descriptor table. The pidfd is reserved but not installed yet.
 -
 + *
   * The helper doesn't perform checks on @pid which makes it useful for pidfds
   * created via CLONE_PIDFD where @pid has no task attached when the pidfd and
   * pidfd file are prepared.
@@@ -2158,7 -2153,7 +2158,7 @@@ static int __pidfd_prepare(struct pid *
   * pidfd_prepare - allocate a new pidfd_file and reserve a pidfd
   * @pid:   the struct pid for which to create a pidfd
   * @flags: flags of the new @pidfd
 - * @pidfd: the pidfd to return
 + * @ret: Where to return the pidfd.
   *
   * Allocate a new file that stashes @pid and reserve a new pidfd number in the
   * caller's file descriptor table. The pidfd is reserved but not installed yet.
@@@ -2411,10 -2406,6 +2411,6 @@@ __latent_entropy struct task_struct *co
        p->io_uring = NULL;
  #endif
  
- #if defined(SPLIT_RSS_COUNTING)
-       memset(&p->rss_stat, 0, sizeof(p->rss_stat));
- #endif
        p->default_timer_slack_ns = current->timer_slack_ns;
  
  #ifdef CONFIG_PSI
@@@ -3149,7 -3140,7 +3145,7 @@@ static inline bool clone3_stack_valid(s
                if (!access_ok((void __user *)kargs->stack, kargs->stack_size))
                        return false;
  
 -#if !defined(CONFIG_STACK_GROWSUP) && !defined(CONFIG_IA64)
 +#if !defined(CONFIG_STACK_GROWSUP)
                kargs->stack += kargs->stack_size;
  #endif
        }
@@@ -3186,7 -3177,7 +3182,7 @@@ static bool clone3_args_valid(struct ke
  }
  
  /**
 - * clone3 - create a new process with specific properties
 + * sys_clone3 - create a new process with specific properties
   * @uargs: argument structure
   * @size:  size of @uargs
   *
diff --combined kernel/rcu/tree.c
index 7005247260794b18579c0c26356c52e53313b5ac,06e2ed495c026c093e2e84ffda9cb3f53778c51b..d3a97e1290203f8a00c4dc97fa0087676926b08e
@@@ -31,7 -31,6 +31,7 @@@
  #include <linux/bitops.h>
  #include <linux/export.h>
  #include <linux/completion.h>
 +#include <linux/kmemleak.h>
  #include <linux/moduleparam.h>
  #include <linux/panic.h>
  #include <linux/panic_notifier.h>
@@@ -1261,7 -1260,7 +1261,7 @@@ EXPORT_SYMBOL_GPL(rcu_gp_slow_register)
  /* Unregister a counter, with NULL for not caring which. */
  void rcu_gp_slow_unregister(atomic_t *rgssp)
  {
 -      WARN_ON_ONCE(rgssp && rgssp != rcu_gp_slow_suppress);
 +      WARN_ON_ONCE(rgssp && rgssp != rcu_gp_slow_suppress && rcu_gp_slow_suppress != NULL);
  
        WRITE_ONCE(rcu_gp_slow_suppress, NULL);
  }
@@@ -1557,22 -1556,10 +1557,22 @@@ static bool rcu_gp_fqs_check_wake(int *
   */
  static void rcu_gp_fqs(bool first_time)
  {
 +      int nr_fqs = READ_ONCE(rcu_state.nr_fqs_jiffies_stall);
        struct rcu_node *rnp = rcu_get_root();
  
        WRITE_ONCE(rcu_state.gp_activity, jiffies);
        WRITE_ONCE(rcu_state.n_force_qs, rcu_state.n_force_qs + 1);
 +
 +      WARN_ON_ONCE(nr_fqs > 3);
 +      /* Only countdown nr_fqs for stall purposes if jiffies moves. */
 +      if (nr_fqs) {
 +              if (nr_fqs == 1) {
 +                      WRITE_ONCE(rcu_state.jiffies_stall,
 +                                 jiffies + rcu_jiffies_till_stall_check());
 +              }
 +              WRITE_ONCE(rcu_state.nr_fqs_jiffies_stall, --nr_fqs);
 +      }
 +
        if (first_time) {
                /* Collect dyntick-idle snapshots. */
                force_qs_rnp(dyntick_save_progress_counter);
@@@ -2148,7 -2135,6 +2148,7 @@@ static void rcu_do_batch(struct rcu_dat
                trace_rcu_invoke_callback(rcu_state.name, rhp);
  
                f = rhp->func;
 +              debug_rcu_head_callback(rhp);
                WRITE_ONCE(rhp->func, (rcu_callback_t)0L);
                f(rhp);
  
@@@ -2727,7 -2713,7 +2727,7 @@@ __call_rcu_common(struct rcu_head *head
   */
  void call_rcu_hurry(struct rcu_head *head, rcu_callback_t func)
  {
 -      return __call_rcu_common(head, func, false);
 +      __call_rcu_common(head, func, false);
  }
  EXPORT_SYMBOL_GPL(call_rcu_hurry);
  #endif
   */
  void call_rcu(struct rcu_head *head, rcu_callback_t func)
  {
 -      return __call_rcu_common(head, func, IS_ENABLED(CONFIG_RCU_LAZY));
 +      __call_rcu_common(head, func, IS_ENABLED(CONFIG_RCU_LAZY));
  }
  EXPORT_SYMBOL_GPL(call_rcu);
  
@@@ -3402,14 -3388,6 +3402,14 @@@ void kvfree_call_rcu(struct rcu_head *h
                success = true;
        }
  
 +      /*
 +       * The kvfree_rcu() caller considers the pointer freed at this point
 +       * and likely removes any references to it. Since the actual slab
 +       * freeing (and kmemleak_free()) is deferred, tell kmemleak to ignore
 +       * this object (no scanning or false positives reporting).
 +       */
 +      kmemleak_ignore(ptr);
 +
        // Set timer to drain after KFREE_DRAIN_JIFFIES.
        if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING)
                schedule_delayed_monitor_work(krcp);
@@@ -3471,13 -3449,6 +3471,6 @@@ kfree_rcu_shrink_scan(struct shrinker *
        return freed == 0 ? SHRINK_STOP : freed;
  }
  
- static struct shrinker kfree_rcu_shrinker = {
-       .count_objects = kfree_rcu_shrink_count,
-       .scan_objects = kfree_rcu_shrink_scan,
-       .batch = 0,
-       .seeks = DEFAULT_SEEKS,
- };
  void __init kfree_rcu_scheduler_running(void)
  {
        int cpu;
@@@ -4105,82 -4076,6 +4098,82 @@@ retry
  }
  EXPORT_SYMBOL_GPL(rcu_barrier);
  
 +static unsigned long rcu_barrier_last_throttle;
 +
 +/**
 + * rcu_barrier_throttled - Do rcu_barrier(), but limit to one per second
 + *
 + * This can be thought of as guard rails around rcu_barrier() that
 + * permits unrestricted userspace use, at least assuming the hardware's
 + * try_cmpxchg() is robust.  There will be at most one call per second to
 + * rcu_barrier() system-wide from use of this function, which means that
 + * callers might needlessly wait a second or three.
 + *
 + * This is intended for use by test suites to avoid OOM by flushing RCU
 + * callbacks from the previous test before starting the next.  See the
 + * rcutree.do_rcu_barrier module parameter for more information.
 + *
 + * Why not simply make rcu_barrier() more scalable?  That might be
 + * the eventual endpoint, but let's keep it simple for the time being.
 + * Note that the module parameter infrastructure serializes calls to a
 + * given .set() function, but should concurrent .set() invocation ever be
 + * possible, we are ready!
 + */
 +static void rcu_barrier_throttled(void)
 +{
 +      unsigned long j = jiffies;
 +      unsigned long old = READ_ONCE(rcu_barrier_last_throttle);
 +      unsigned long s = rcu_seq_snap(&rcu_state.barrier_sequence);
 +
 +      while (time_in_range(j, old, old + HZ / 16) ||
 +             !try_cmpxchg(&rcu_barrier_last_throttle, &old, j)) {
 +              schedule_timeout_idle(HZ / 16);
 +              if (rcu_seq_done(&rcu_state.barrier_sequence, s)) {
 +                      smp_mb(); /* caller's subsequent code after above check. */
 +                      return;
 +              }
 +              j = jiffies;
 +              old = READ_ONCE(rcu_barrier_last_throttle);
 +      }
 +      rcu_barrier();
 +}
 +
 +/*
 + * Invoke rcu_barrier_throttled() when a rcutree.do_rcu_barrier
 + * request arrives.  We insist on a true value to allow for possible
 + * future expansion.
 + */
 +static int param_set_do_rcu_barrier(const char *val, const struct kernel_param *kp)
 +{
 +      bool b;
 +      int ret;
 +
 +      if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING)
 +              return -EAGAIN;
 +      ret = kstrtobool(val, &b);
 +      if (!ret && b) {
 +              atomic_inc((atomic_t *)kp->arg);
 +              rcu_barrier_throttled();
 +              atomic_dec((atomic_t *)kp->arg);
 +      }
 +      return ret;
 +}
 +
 +/*
 + * Output the number of outstanding rcutree.do_rcu_barrier requests.
 + */
 +static int param_get_do_rcu_barrier(char *buffer, const struct kernel_param *kp)
 +{
 +      return sprintf(buffer, "%d\n", atomic_read((atomic_t *)kp->arg));
 +}
 +
 +static const struct kernel_param_ops do_rcu_barrier_ops = {
 +      .set = param_set_do_rcu_barrier,
 +      .get = param_get_do_rcu_barrier,
 +};
 +static atomic_t do_rcu_barrier;
 +module_param_cb(do_rcu_barrier, &do_rcu_barrier_ops, &do_rcu_barrier, 0644);
 +
  /*
   * Compute the mask of online CPUs for the specified rcu_node structure.
   * This will not be stable unless the rcu_node structure's ->lock is
@@@ -4228,7 -4123,7 +4221,7 @@@ bool rcu_lockdep_current_cpu_online(voi
        rdp = this_cpu_ptr(&rcu_data);
        /*
         * Strictly, we care here about the case where the current CPU is
 -       * in rcu_cpu_starting() and thus has an excuse for rdp->grpmask
 +       * in rcutree_report_cpu_starting() and thus has an excuse for rdp->grpmask
         * not being up to date. So arch_spin_is_locked() might have a
         * false positive if it's held by some *other* CPU, but that's
         * OK because that just means a false *negative* on the warning.
@@@ -4249,6 -4144,25 +4242,6 @@@ static bool rcu_init_invoked(void
        return !!rcu_state.n_online_cpus;
  }
  
 -/*
 - * Near the end of the offline process.  Trace the fact that this CPU
 - * is going offline.
 - */
 -int rcutree_dying_cpu(unsigned int cpu)
 -{
 -      bool blkd;
 -      struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
 -      struct rcu_node *rnp = rdp->mynode;
 -
 -      if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
 -              return 0;
 -
 -      blkd = !!(READ_ONCE(rnp->qsmask) & rdp->grpmask);
 -      trace_rcu_grace_period(rcu_state.name, READ_ONCE(rnp->gp_seq),
 -                             blkd ? TPS("cpuofl-bgp") : TPS("cpuofl"));
 -      return 0;
 -}
 -
  /*
   * All CPUs for the specified rcu_node structure have gone offline,
   * and all tasks that were preempted within an RCU read-side critical
@@@ -4294,6 -4208,23 +4287,6 @@@ static void rcu_cleanup_dead_rnp(struc
        }
  }
  
 -/*
 - * The CPU has been completely removed, and some other CPU is reporting
 - * this fact from process context.  Do the remainder of the cleanup.
 - * There can only be one CPU hotplug operation at a time, so no need for
 - * explicit locking.
 - */
 -int rcutree_dead_cpu(unsigned int cpu)
 -{
 -      if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
 -              return 0;
 -
 -      WRITE_ONCE(rcu_state.n_online_cpus, rcu_state.n_online_cpus - 1);
 -      // Stop-machine done, so allow nohz_full to disable tick.
 -      tick_dep_clear(TICK_DEP_BIT_RCU);
 -      return 0;
 -}
 -
  /*
   * Propagate ->qsinitmask bits up the rcu_node tree to account for the
   * first CPU in a given leaf rcu_node structure coming online.  The caller
@@@ -4446,6 -4377,29 +4439,6 @@@ int rcutree_online_cpu(unsigned int cpu
        return 0;
  }
  
 -/*
 - * Near the beginning of the process.  The CPU is still very much alive
 - * with pretty much all services enabled.
 - */
 -int rcutree_offline_cpu(unsigned int cpu)
 -{
 -      unsigned long flags;
 -      struct rcu_data *rdp;
 -      struct rcu_node *rnp;
 -
 -      rdp = per_cpu_ptr(&rcu_data, cpu);
 -      rnp = rdp->mynode;
 -      raw_spin_lock_irqsave_rcu_node(rnp, flags);
 -      rnp->ffmask &= ~rdp->grpmask;
 -      raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 -
 -      rcutree_affinity_setting(cpu, cpu);
 -
 -      // nohz_full CPUs need the tick for stop-machine to work quickly
 -      tick_dep_set(TICK_DEP_BIT_RCU);
 -      return 0;
 -}
 -
  /*
   * Mark the specified CPU as being online so that subsequent grace periods
   * (both expedited and normal) will wait on it.  Note that this means that
   * from the incoming CPU rather than from the cpuhp_step mechanism.
   * This is because this function must be invoked at a precise location.
   * This incoming CPU must not have enabled interrupts yet.
 + *
 + * This mirrors the effects of rcutree_report_cpu_dead().
   */
 -void rcu_cpu_starting(unsigned int cpu)
 +void rcutree_report_cpu_starting(unsigned int cpu)
  {
        unsigned long mask;
        struct rcu_data *rdp;
   * Note that this function is special in that it is invoked directly
   * from the outgoing CPU rather than from the cpuhp_step mechanism.
   * This is because this function must be invoked at a precise location.
 + *
 + * This mirrors the effect of rcutree_report_cpu_starting().
   */
 -void rcu_report_dead(unsigned int cpu)
 +void rcutree_report_cpu_dead(void)
  {
 -      unsigned long flags, seq_flags;
 +      unsigned long flags;
        unsigned long mask;
 -      struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
 +      struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
        struct rcu_node *rnp = rdp->mynode;  /* Outgoing CPU's rdp & rnp. */
  
 +      /*
 +       * IRQS must be disabled from now on and until the CPU dies, or an interrupt
 +       * may introduce a new READ-side while it is actually off the QS masks.
 +       */
 +      lockdep_assert_irqs_disabled();
        // Do any dangling deferred wakeups.
        do_nocb_deferred_wakeup(rdp);
  
  
        /* Remove outgoing CPU from mask in the leaf rcu_node structure. */
        mask = rdp->grpmask;
 -      local_irq_save(seq_flags);
        arch_spin_lock(&rcu_state.ofl_lock);
        raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
        rdp->rcu_ofl_gp_seq = READ_ONCE(rcu_state.gp_seq);
        WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask);
        raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
        arch_spin_unlock(&rcu_state.ofl_lock);
 -      local_irq_restore(seq_flags);
 -
        rdp->cpu_started = false;
  }
  
@@@ -4603,60 -4551,7 +4596,60 @@@ void rcutree_migrate_callbacks(int cpu
                  cpu, rcu_segcblist_n_cbs(&rdp->cblist),
                  rcu_segcblist_first_cb(&rdp->cblist));
  }
 -#endif
 +
 +/*
 + * The CPU has been completely removed, and some other CPU is reporting
 + * this fact from process context.  Do the remainder of the cleanup.
 + * There can only be one CPU hotplug operation at a time, so no need for
 + * explicit locking.
 + */
 +int rcutree_dead_cpu(unsigned int cpu)
 +{
 +      WRITE_ONCE(rcu_state.n_online_cpus, rcu_state.n_online_cpus - 1);
 +      // Stop-machine done, so allow nohz_full to disable tick.
 +      tick_dep_clear(TICK_DEP_BIT_RCU);
 +      return 0;
 +}
 +
 +/*
 + * Near the end of the offline process.  Trace the fact that this CPU
 + * is going offline.
 + */
 +int rcutree_dying_cpu(unsigned int cpu)
 +{
 +      bool blkd;
 +      struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
 +      struct rcu_node *rnp = rdp->mynode;
 +
 +      blkd = !!(READ_ONCE(rnp->qsmask) & rdp->grpmask);
 +      trace_rcu_grace_period(rcu_state.name, READ_ONCE(rnp->gp_seq),
 +                             blkd ? TPS("cpuofl-bgp") : TPS("cpuofl"));
 +      return 0;
 +}
 +
 +/*
 + * Near the beginning of the process.  The CPU is still very much alive
 + * with pretty much all services enabled.
 + */
 +int rcutree_offline_cpu(unsigned int cpu)
 +{
 +      unsigned long flags;
 +      struct rcu_data *rdp;
 +      struct rcu_node *rnp;
 +
 +      rdp = per_cpu_ptr(&rcu_data, cpu);
 +      rnp = rdp->mynode;
 +      raw_spin_lock_irqsave_rcu_node(rnp, flags);
 +      rnp->ffmask &= ~rdp->grpmask;
 +      raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 +
 +      rcutree_affinity_setting(cpu, cpu);
 +
 +      // nohz_full CPUs need the tick for stop-machine to work quickly
 +      tick_dep_set(TICK_DEP_BIT_RCU);
 +      return 0;
 +}
 +#endif /* #ifdef CONFIG_HOTPLUG_CPU */
  
  /*
   * On non-huge systems, use expedited RCU grace periods to make suspend
@@@ -5029,6 -4924,7 +5022,7 @@@ static void __init kfree_rcu_batch_init
  {
        int cpu;
        int i, j;
+       struct shrinker *kfree_rcu_shrinker;
  
        /* Clamp it to [0:100] seconds interval. */
        if (rcu_delay_page_cache_fill_msec < 0 ||
                INIT_DELAYED_WORK(&krcp->page_cache_work, fill_page_cache_func);
                krcp->initialized = true;
        }
-       if (register_shrinker(&kfree_rcu_shrinker, "rcu-kfree"))
-               pr_err("Failed to register kfree_rcu() shrinker!\n");
+       kfree_rcu_shrinker = shrinker_alloc(0, "rcu-kfree");
+       if (!kfree_rcu_shrinker) {
+               pr_err("Failed to allocate kfree_rcu() shrinker!\n");
+               return;
+       }
+       kfree_rcu_shrinker->count_objects = kfree_rcu_shrink_count;
+       kfree_rcu_shrinker->scan_objects = kfree_rcu_shrink_scan;
+       shrinker_register(kfree_rcu_shrinker);
  }
  
  void __init rcu_init(void)
        pm_notifier(rcu_pm_notify, 0);
        WARN_ON(num_online_cpus() > 1); // Only one CPU this early in boot.
        rcutree_prepare_cpu(cpu);
 -      rcu_cpu_starting(cpu);
 +      rcutree_report_cpu_starting(cpu);
        rcutree_online_cpu(cpu);
  
        /* Create workqueue for Tree SRCU and for expedited GPs. */
diff --combined kernel/sched/fair.c
index 8767988242ee3369d5e8bbb4ef594a32c35eab0c,d1a765cdf6e41933a54d51a368840232e53f033d..2048138ce54b574a3ba56b9f6bf7b1cefac1fd32
@@@ -51,6 -51,8 +51,6 @@@
  
  #include <asm/switch_to.h>
  
 -#include <linux/sched/cond_resched.h>
 -
  #include "sched.h"
  #include "stats.h"
  #include "autogroup.h"
@@@ -76,6 -78,12 +76,6 @@@ unsigned int sysctl_sched_tunable_scali
  unsigned int sysctl_sched_base_slice                  = 750000ULL;
  static unsigned int normalized_sysctl_sched_base_slice        = 750000ULL;
  
 -/*
 - * After fork, child runs first. If set to 0 (default) then
 - * parent will (try to) run first.
 - */
 -unsigned int sysctl_sched_child_runs_first __read_mostly;
 -
  const_debug unsigned int sysctl_sched_migration_cost  = 500000UL;
  
  int sched_thermal_decay_shift;
@@@ -137,6 -145,13 +137,6 @@@ static unsigned int sysctl_numa_balanci
  
  #ifdef CONFIG_SYSCTL
  static struct ctl_table sched_fair_sysctls[] = {
 -      {
 -              .procname       = "sched_child_runs_first",
 -              .data           = &sysctl_sched_child_runs_first,
 -              .maxlen         = sizeof(unsigned int),
 -              .mode           = 0644,
 -              .proc_handler   = proc_dointvec,
 -      },
  #ifdef CONFIG_CFS_BANDWIDTH
        {
                .procname       = "sched_cfs_bandwidth_slice_us",
@@@ -649,10 -664,6 +649,10 @@@ void avg_vruntime_update(struct cfs_rq 
        cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta;
  }
  
 +/*
 + * Specifically: avg_runtime() + 0 must result in entity_eligible() := true
 + * For this to be so, the result of this function must have a left bias.
 + */
  u64 avg_vruntime(struct cfs_rq *cfs_rq)
  {
        struct sched_entity *curr = cfs_rq->curr;
                load += weight;
        }
  
 -      if (load)
 +      if (load) {
 +              /* sign flips effective floor / ceil */
 +              if (avg < 0)
 +                      avg -= (load - 1);
                avg = div_s64(avg, load);
 +      }
  
        return cfs_rq->min_vruntime + avg;
  }
@@@ -857,16 -864,14 +857,16 @@@ struct sched_entity *__pick_first_entit
   *
   * Which allows an EDF like search on (sub)trees.
   */
 -static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 +static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
  {
        struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
        struct sched_entity *curr = cfs_rq->curr;
        struct sched_entity *best = NULL;
 +      struct sched_entity *best_left = NULL;
  
        if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
                curr = NULL;
 +      best = curr;
  
        /*
         * Once selected, run a task until it either becomes non-eligible or
                }
  
                /*
 -               * If this entity has an earlier deadline than the previous
 -               * best, take this one. If it also has the earliest deadline
 -               * of its subtree, we're done.
 +               * Now we heap search eligible trees for the best (min_)deadline
                 */
 -              if (!best || deadline_gt(deadline, best, se)) {
 +              if (!best || deadline_gt(deadline, best, se))
                        best = se;
 -                      if (best->deadline == best->min_deadline)
 -                              break;
 -              }
  
                /*
 -               * If the earlest deadline in this subtree is in the fully
 -               * eligible left half of our space, go there.
 +               * Every se in a left branch is eligible, keep track of the
 +               * branch with the best min_deadline
                 */
 +              if (node->rb_left) {
 +                      struct sched_entity *left = __node_2_se(node->rb_left);
 +
 +                      if (!best_left || deadline_gt(min_deadline, best_left, left))
 +                              best_left = left;
 +
 +                      /*
 +                       * min_deadline is in the left branch. rb_left and all
 +                       * descendants are eligible, so immediately switch to the second
 +                       * loop.
 +                       */
 +                      if (left->min_deadline == se->min_deadline)
 +                              break;
 +              }
 +
 +              /* min_deadline is at this node, no need to look right */
 +              if (se->deadline == se->min_deadline)
 +                      break;
 +
 +              /* else min_deadline is in the right branch. */
 +              node = node->rb_right;
 +      }
 +
 +      /*
 +       * We ran into an eligible node which is itself the best.
 +       * (Or nr_running == 0 and both are NULL)
 +       */
 +      if (!best_left || (s64)(best_left->min_deadline - best->deadline) > 0)
 +              return best;
 +
 +      /*
 +       * Now best_left and all of its children are eligible, and we are just
 +       * looking for deadline == min_deadline
 +       */
 +      node = &best_left->run_node;
 +      while (node) {
 +              struct sched_entity *se = __node_2_se(node);
 +
 +              /* min_deadline is the current node */
 +              if (se->deadline == se->min_deadline)
 +                      return se;
 +
 +              /* min_deadline is in the left branch */
                if (node->rb_left &&
                    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
                        node = node->rb_left;
                        continue;
                }
  
 +              /* else min_deadline is in the right branch */
                node = node->rb_right;
        }
 +      return NULL;
 +}
  
 -      if (!best || (curr && deadline_gt(deadline, best, curr)))
 -              best = curr;
 +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 +{
 +      struct sched_entity *se = __pick_eevdf(cfs_rq);
  
 -      if (unlikely(!best)) {
 +      if (!se) {
                struct sched_entity *left = __pick_first_entity(cfs_rq);
                if (left) {
                        pr_err("EEVDF scheduling fail, picking leftmost\n");
                }
        }
  
 -      return best;
 +      return se;
  }
  
  #ifdef CONFIG_SCHED_DEBUG
@@@ -1759,12 -1722,12 +1759,12 @@@ static bool pgdat_free_space_enough(str
   * The smaller the hint page fault latency, the higher the possibility
   * for the page to be hot.
   */
- static int numa_hint_fault_latency(struct page *page)
+ static int numa_hint_fault_latency(struct folio *folio)
  {
        int last_time, time;
  
        time = jiffies_to_msecs(jiffies);
-       last_time = xchg_page_access_time(page, time);
+       last_time = folio_xchg_access_time(folio, time);
  
        return (time - last_time) & PAGE_ACCESS_TIME_MASK;
  }
@@@ -1821,7 -1784,7 +1821,7 @@@ static void numa_promotion_adjust_thres
        }
  }
  
- bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
+ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
                                int src_nid, int dst_cpu)
  {
        struct numa_group *ng = deref_curr_numa_group(p);
                numa_promotion_adjust_threshold(pgdat, rate_limit, def_th);
  
                th = pgdat->nbp_threshold ? : def_th;
-               latency = numa_hint_fault_latency(page);
+               latency = numa_hint_fault_latency(folio);
                if (latency >= th)
                        return false;
  
                return !numa_promotion_rate_limit(pgdat, rate_limit,
-                                                 thp_nr_pages(page));
+                                                 folio_nr_pages(folio));
        }
  
        this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
-       last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
+       last_cpupid = folio_xchg_last_cpupid(folio, this_cpupid);
  
        if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
            !node_is_toptier(src_nid) && !cpupid_valid(last_cpupid))
@@@ -2884,7 -2847,19 +2884,7 @@@ static void task_numa_placement(struct 
        }
  
        /* Cannot migrate task to CPU-less node */
 -      if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) {
 -              int near_nid = max_nid;
 -              int distance, near_distance = INT_MAX;
 -
 -              for_each_node_state(nid, N_CPU) {
 -                      distance = node_distance(max_nid, nid);
 -                      if (distance < near_distance) {
 -                              near_nid = nid;
 -                              near_distance = distance;
 -                      }
 -              }
 -              max_nid = near_nid;
 -      }
 +      max_nid = numa_nearest_node(max_nid, N_CPU);
  
        if (ng) {
                numa_group_count_active_nodes(ng);
@@@ -3155,7 -3130,7 +3155,7 @@@ static void reset_ptenuma_scan(struct t
        p->mm->numa_scan_offset = 0;
  }
  
 -static bool vma_is_accessed(struct vm_area_struct *vma)
 +static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
  {
        unsigned long pids;
        /*
        if (READ_ONCE(current->mm->numa_scan_seq) < 2)
                return true;
  
 -      pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
 -      return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
 +      pids = vma->numab_state->pids_active[0] | vma->numab_state->pids_active[1];
 +      if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids))
 +              return true;
 +
 +      /*
 +       * Complete a scan that has already started regardless of PID access, or
 +       * some VMAs may never be scanned in multi-threaded applications:
 +       */
 +      if (mm->numa_scan_offset > vma->vm_start) {
 +              trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_IGNORE_PID);
 +              return true;
 +      }
 +
 +      return false;
  }
  
  #define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
@@@ -3200,8 -3163,6 +3200,8 @@@ static void task_numa_work(struct callb
        unsigned long nr_pte_updates = 0;
        long pages, virtpages;
        struct vma_iterator vmi;
 +      bool vma_pids_skipped;
 +      bool vma_pids_forced = false;
  
        SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
  
         */
        p->node_stamp += 2 * TICK_NSEC;
  
 -      start = mm->numa_scan_offset;
        pages = sysctl_numa_balancing_scan_size;
        pages <<= 20 - PAGE_SHIFT; /* MB in pages */
        virtpages = pages * 8;     /* Scan up to this much virtual space */
  
        if (!mmap_read_trylock(mm))
                return;
 +
 +      /*
 +       * VMAs are skipped if the current PID has not trapped a fault within
 +       * the VMA recently. Allow scanning to be forced if there is no
 +       * suitable VMA remaining.
 +       */
 +      vma_pids_skipped = false;
 +
 +retry_pids:
 +      start = mm->numa_scan_offset;
        vma_iter_init(&vmi, mm, start);
        vma = vma_next(&vmi);
        if (!vma) {
        do {
                if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
                        is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
 +                      trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UNSUITABLE);
                        continue;
                }
  
                 * as migrating the pages will be of marginal benefit.
                 */
                if (!vma->vm_mm ||
 -                  (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
 +                  (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ))) {
 +                      trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SHARED_RO);
                        continue;
 +              }
  
                /*
                 * Skip inaccessible VMAs to avoid any confusion between
                 * PROT_NONE and NUMA hinting ptes
                 */
 -              if (!vma_is_accessible(vma))
 +              if (!vma_is_accessible(vma)) {
 +                      trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_INACCESSIBLE);
                        continue;
 +              }
  
                /* Initialise new per-VMA NUMAB state. */
                if (!vma->numab_state) {
                                msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
  
                        /* Reset happens after 4 times scan delay of scan start */
 -                      vma->numab_state->next_pid_reset =  vma->numab_state->next_scan +
 +                      vma->numab_state->pids_active_reset =  vma->numab_state->next_scan +
                                msecs_to_jiffies(VMA_PID_RESET_PERIOD);
 +
 +                      /*
 +                       * Ensure prev_scan_seq does not match numa_scan_seq,
 +                       * to prevent VMAs being skipped prematurely on the
 +                       * first scan:
 +                       */
 +                       vma->numab_state->prev_scan_seq = mm->numa_scan_seq - 1;
                }
  
                /*
                 * delay the scan for new VMAs.
                 */
                if (mm->numa_scan_seq && time_before(jiffies,
 -                                              vma->numab_state->next_scan))
 +                                              vma->numab_state->next_scan)) {
 +                      trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SCAN_DELAY);
                        continue;
 +              }
 +
 +              /* RESET access PIDs regularly for old VMAs. */
 +              if (mm->numa_scan_seq &&
 +                              time_after(jiffies, vma->numab_state->pids_active_reset)) {
 +                      vma->numab_state->pids_active_reset = vma->numab_state->pids_active_reset +
 +                              msecs_to_jiffies(VMA_PID_RESET_PERIOD);
 +                      vma->numab_state->pids_active[0] = READ_ONCE(vma->numab_state->pids_active[1]);
 +                      vma->numab_state->pids_active[1] = 0;
 +              }
  
 -              /* Do not scan the VMA if task has not accessed */
 -              if (!vma_is_accessed(vma))
 +              /* Do not rescan VMAs twice within the same sequence. */
 +              if (vma->numab_state->prev_scan_seq == mm->numa_scan_seq) {
 +                      mm->numa_scan_offset = vma->vm_end;
 +                      trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SEQ_COMPLETED);
                        continue;
 +              }
  
                /*
 -               * RESET access PIDs regularly for old VMAs. Resetting after checking
 -               * vma for recent access to avoid clearing PID info before access..
 +               * Do not scan the VMA if task has not accessed it, unless no other
 +               * VMA candidate exists.
                 */
 -              if (mm->numa_scan_seq &&
 -                              time_after(jiffies, vma->numab_state->next_pid_reset)) {
 -                      vma->numab_state->next_pid_reset = vma->numab_state->next_pid_reset +
 -                              msecs_to_jiffies(VMA_PID_RESET_PERIOD);
 -                      vma->numab_state->access_pids[0] = READ_ONCE(vma->numab_state->access_pids[1]);
 -                      vma->numab_state->access_pids[1] = 0;
 +              if (!vma_pids_forced && !vma_is_accessed(mm, vma)) {
 +                      vma_pids_skipped = true;
 +                      trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
 +                      continue;
                }
  
                do {
  
                        cond_resched();
                } while (end != vma->vm_end);
 +
 +              /* VMA scan is complete, do not scan until next sequence. */
 +              vma->numab_state->prev_scan_seq = mm->numa_scan_seq;
 +
 +              /*
 +               * Only force scan within one VMA at a time, to limit the
 +               * cost of scanning a potentially uninteresting VMA.
 +               */
 +              if (vma_pids_forced)
 +                      break;
        } for_each_vma(vmi, vma);
  
 +      /*
 +       * If no VMAs are remaining and VMAs were skipped due to the PID
 +       * not accessing the VMA previously, then force a scan to ensure
 +       * forward progress:
 +       */
 +      if (!vma && !vma_pids_forced && vma_pids_skipped) {
 +              vma_pids_forced = true;
 +              goto retry_pids;
 +      }
 +
  out:
        /*
         * It is possible to reach the end of the VMA list but the last few
@@@ -3697,8 -3605,6 +3697,8 @@@ static void reweight_entity(struct cfs_
                 */
                deadline = div_s64(deadline * old_weight, weight);
                se->deadline = se->vruntime + deadline;
 +              if (se != cfs_rq->curr)
 +                      min_deadline_cb_propagate(&se->run_node, NULL);
        }
  
  #ifdef CONFIG_SMP
@@@ -3982,8 -3888,7 +3982,8 @@@ static inline bool cfs_rq_is_decayed(st
   */
  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
  {
 -      long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
 +      long delta;
 +      u64 now;
  
        /*
         * No need to update load_avg for root_task_group as it is not used.
        if (cfs_rq->tg == &root_task_group)
                return;
  
 +      /*
 +       * For migration heavy workloads, access to tg->load_avg can be
 +       * unbound. Limit the update rate to at most once per ms.
 +       */
 +      now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
 +      if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
 +              return;
 +
 +      delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
        if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
                atomic_long_add(delta, &cfs_rq->tg->load_avg);
                cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
 +              cfs_rq->last_update_tg_load_avg = now;
        }
  }
  
@@@ -4677,6 -4572,22 +4677,6 @@@ static inline unsigned long task_util_e
        return max(task_util(p), _task_util_est(p));
  }
  
 -#ifdef CONFIG_UCLAMP_TASK
 -static inline unsigned long uclamp_task_util(struct task_struct *p,
 -                                           unsigned long uclamp_min,
 -                                           unsigned long uclamp_max)
 -{
 -      return clamp(task_util_est(p), uclamp_min, uclamp_max);
 -}
 -#else
 -static inline unsigned long uclamp_task_util(struct task_struct *p,
 -                                           unsigned long uclamp_min,
 -                                           unsigned long uclamp_max)
 -{
 -      return task_util_est(p);
 -}
 -#endif
 -
  static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
                                    struct task_struct *p)
  {
@@@ -4780,7 -4691,7 +4780,7 @@@ static inline void util_est_update(stru
         * To avoid overestimation of actual task utilization, skip updates if
         * we cannot grant there is idle time in this CPU.
         */
 -      if (task_util(p) > capacity_orig_of(cpu_of(rq_of(cfs_rq))))
 +      if (task_util(p) > arch_scale_cpu_capacity(cpu_of(rq_of(cfs_rq))))
                return;
  
        /*
@@@ -4828,14 -4739,14 +4828,14 @@@ static inline int util_fits_cpu(unsigne
                return fits;
  
        /*
 -       * We must use capacity_orig_of() for comparing against uclamp_min and
 +       * We must use arch_scale_cpu_capacity() for comparing against uclamp_min and
         * uclamp_max. We only care about capacity pressure (by using
         * capacity_of()) for comparing against the real util.
         *
         * If a task is boosted to 1024 for example, we don't want a tiny
         * pressure to skew the check whether it fits a CPU or not.
         *
 -       * Similarly if a task is capped to capacity_orig_of(little_cpu), it
 +       * Similarly if a task is capped to arch_scale_cpu_capacity(little_cpu), it
         * should fit a little cpu even if there's some pressure.
         *
         * Only exception is for thermal pressure since it has a direct impact
         * For uclamp_max, we can tolerate a drop in performance level as the
         * goal is to cap the task. So it's okay if it's getting less.
         */
 -      capacity_orig = capacity_orig_of(cpu);
 +      capacity_orig = arch_scale_cpu_capacity(cpu);
        capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
  
        /*
@@@ -4967,7 -4878,7 +4967,7 @@@ static inline void update_misfit_status
  
  static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
  {
 -      return true;
 +      return !cfs_rq->nr_running;
  }
  
  #define UPDATE_TG     0x0
@@@ -5008,12 -4919,10 +5008,12 @@@ static inline void update_misfit_status
  static void
  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
  {
 -      u64 vslice = calc_delta_fair(se->slice, se);
 -      u64 vruntime = avg_vruntime(cfs_rq);
 +      u64 vslice, vruntime = avg_vruntime(cfs_rq);
        s64 lag = 0;
  
 +      se->slice = sysctl_sched_base_slice;
 +      vslice = calc_delta_fair(se->slice, se);
 +
        /*
         * Due to how V is constructed as the weighted average of entities,
         * adding tasks with positive lag, or removing tasks with negative lag
@@@ -5302,7 -5211,7 +5302,7 @@@ set_next_entity(struct cfs_rq *cfs_rq, 
   * 4) do not run the "skip" process, if something else is available
   */
  static struct sched_entity *
 -pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 +pick_next_entity(struct cfs_rq *cfs_rq)
  {
        /*
         * Enabling NEXT_BUDDY will affect latency but not fairness.
@@@ -5846,13 -5755,13 +5846,13 @@@ static void unthrottle_cfs_rq_async(str
  
  static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
  {
 -      struct cfs_rq *local_unthrottle = NULL;
        int this_cpu = smp_processor_id();
        u64 runtime, remaining = 1;
        bool throttled = false;
 -      struct cfs_rq *cfs_rq;
 +      struct cfs_rq *cfs_rq, *tmp;
        struct rq_flags rf;
        struct rq *rq;
 +      LIST_HEAD(local_unthrottle);
  
        rcu_read_lock();
        list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
                if (!cfs_rq_throttled(cfs_rq))
                        goto next;
  
 -#ifdef CONFIG_SMP
                /* Already queued for async unthrottle */
                if (!list_empty(&cfs_rq->throttled_csd_list))
                        goto next;
 -#endif
  
                /* By the above checks, this should never be true */
                SCHED_WARN_ON(cfs_rq->runtime_remaining > 0);
  
                /* we check whether we're throttled above */
                if (cfs_rq->runtime_remaining > 0) {
 -                      if (cpu_of(rq) != this_cpu ||
 -                          SCHED_WARN_ON(local_unthrottle))
 +                      if (cpu_of(rq) != this_cpu) {
                                unthrottle_cfs_rq_async(cfs_rq);
 -                      else
 -                              local_unthrottle = cfs_rq;
 +                      } else {
 +                              /*
 +                               * We currently only expect to be unthrottling
 +                               * a single cfs_rq locally.
 +                               */
 +                              SCHED_WARN_ON(!list_empty(&local_unthrottle));
 +                              list_add_tail(&cfs_rq->throttled_csd_list,
 +                                            &local_unthrottle);
 +                      }
                } else {
                        throttled = true;
                }
  next:
                rq_unlock_irqrestore(rq, &rf);
        }
 -      rcu_read_unlock();
  
 -      if (local_unthrottle) {
 -              rq = cpu_rq(this_cpu);
 +      list_for_each_entry_safe(cfs_rq, tmp, &local_unthrottle,
 +                               throttled_csd_list) {
 +              struct rq *rq = rq_of(cfs_rq);
 +
                rq_lock_irqsave(rq, &rf);
 -              if (cfs_rq_throttled(local_unthrottle))
 -                      unthrottle_cfs_rq(local_unthrottle);
 +
 +              list_del_init(&cfs_rq->throttled_csd_list);
 +
 +              if (cfs_rq_throttled(cfs_rq))
 +                      unthrottle_cfs_rq(cfs_rq);
 +
                rq_unlock_irqrestore(rq, &rf);
        }
 +      SCHED_WARN_ON(!list_empty(&local_unthrottle));
 +
 +      rcu_read_unlock();
  
        return throttled;
  }
@@@ -6251,7 -6148,9 +6251,7 @@@ static void init_cfs_rq_runtime(struct 
  {
        cfs_rq->runtime_enabled = 0;
        INIT_LIST_HEAD(&cfs_rq->throttled_list);
 -#ifdef CONFIG_SMP
        INIT_LIST_HEAD(&cfs_rq->throttled_csd_list);
 -#endif
  }
  
  void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
@@@ -7209,9 -7108,45 +7209,9 @@@ static int select_idle_cpu(struct task_
        struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
        int i, cpu, idle_cpu = -1, nr = INT_MAX;
        struct sched_domain_shared *sd_share;
 -      struct rq *this_rq = this_rq();
 -      int this = smp_processor_id();
 -      struct sched_domain *this_sd = NULL;
 -      u64 time = 0;
  
        cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
  
 -      if (sched_feat(SIS_PROP) && !has_idle_core) {
 -              u64 avg_cost, avg_idle, span_avg;
 -              unsigned long now = jiffies;
 -
 -              this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 -              if (!this_sd)
 -                      return -1;
 -
 -              /*
 -               * If we're busy, the assumption that the last idle period
 -               * predicts the future is flawed; age away the remaining
 -               * predicted idle time.
 -               */
 -              if (unlikely(this_rq->wake_stamp < now)) {
 -                      while (this_rq->wake_stamp < now && this_rq->wake_avg_idle) {
 -                              this_rq->wake_stamp++;
 -                              this_rq->wake_avg_idle >>= 1;
 -                      }
 -              }
 -
 -              avg_idle = this_rq->wake_avg_idle;
 -              avg_cost = this_sd->avg_scan_cost + 1;
 -
 -              span_avg = sd->span_weight * avg_idle;
 -              if (span_avg > 4*avg_cost)
 -                      nr = div_u64(span_avg, avg_cost);
 -              else
 -                      nr = 4;
 -
 -              time = cpu_clock(this);
 -      }
 -
        if (sched_feat(SIS_UTIL)) {
                sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
                if (sd_share) {
                }
        }
  
 +      if (static_branch_unlikely(&sched_cluster_active)) {
 +              struct sched_group *sg = sd->groups;
 +
 +              if (sg->flags & SD_CLUSTER) {
 +                      for_each_cpu_wrap(cpu, sched_group_span(sg), target + 1) {
 +                              if (!cpumask_test_cpu(cpu, cpus))
 +                                      continue;
 +
 +                              if (has_idle_core) {
 +                                      i = select_idle_core(p, cpu, cpus, &idle_cpu);
 +                                      if ((unsigned int)i < nr_cpumask_bits)
 +                                              return i;
 +                              } else {
 +                                      if (--nr <= 0)
 +                                              return -1;
 +                                      idle_cpu = __select_idle_cpu(cpu, p);
 +                                      if ((unsigned int)idle_cpu < nr_cpumask_bits)
 +                                              return idle_cpu;
 +                              }
 +                      }
 +                      cpumask_andnot(cpus, cpus, sched_group_span(sg));
 +              }
 +      }
 +
        for_each_cpu_wrap(cpu, cpus, target + 1) {
                if (has_idle_core) {
                        i = select_idle_core(p, cpu, cpus, &idle_cpu);
                                return i;
  
                } else {
 -                      if (!--nr)
 +                      if (--nr <= 0)
                                return -1;
                        idle_cpu = __select_idle_cpu(cpu, p);
                        if ((unsigned int)idle_cpu < nr_cpumask_bits)
        if (has_idle_core)
                set_idle_cores(target, false);
  
 -      if (sched_feat(SIS_PROP) && this_sd && !has_idle_core) {
 -              time = cpu_clock(this) - time;
 -
 -              /*
 -               * Account for the scan cost of wakeups against the average
 -               * idle time.
 -               */
 -              this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
 -
 -              update_avg(&this_sd->avg_scan_cost, time);
 -      }
 -
        return idle_cpu;
  }
  
@@@ -7304,7 -7227,7 +7304,7 @@@ select_idle_capacity(struct task_struc
                 * Look for the CPU with best capacity.
                 */
                else if (fits < 0)
 -                      cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
 +                      cpu_cap = arch_scale_cpu_capacity(cpu) - thermal_load_avg(cpu_rq(cpu));
  
                /*
                 * First, select CPU which fits better (-1 being better than 0).
@@@ -7344,7 -7267,7 +7344,7 @@@ static int select_idle_sibling(struct t
        bool has_idle_core = false;
        struct sched_domain *sd;
        unsigned long task_util, util_min, util_max;
 -      int i, recent_used_cpu;
 +      int i, recent_used_cpu, prev_aff = -1;
  
        /*
         * On asymmetric system, update task utilization because we will check
         */
        if (prev != target && cpus_share_cache(prev, target) &&
            (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
 -          asym_fits_cpu(task_util, util_min, util_max, prev))
 -              return prev;
 +          asym_fits_cpu(task_util, util_min, util_max, prev)) {
 +
 +              if (!static_branch_unlikely(&sched_cluster_active) ||
 +                  cpus_share_resources(prev, target))
 +                      return prev;
 +
 +              prev_aff = prev;
 +      }
  
        /*
         * Allow a per-cpu kthread to stack with the wakee if the
            (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
            cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
            asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
 -              return recent_used_cpu;
 +
 +              if (!static_branch_unlikely(&sched_cluster_active) ||
 +                  cpus_share_resources(recent_used_cpu, target))
 +                      return recent_used_cpu;
 +
 +      } else {
 +              recent_used_cpu = -1;
        }
  
        /*
        if ((unsigned)i < nr_cpumask_bits)
                return i;
  
 +      /*
 +       * For cluster machines which have lower sharing cache like L2 or
 +       * LLC Tag, we tend to find an idle CPU in the target's cluster
 +       * first. But prev_cpu or recent_used_cpu may also be a good candidate,
 +       * use them if possible when no idle CPU found in select_idle_cpu().
 +       */
 +      if ((unsigned int)prev_aff < nr_cpumask_bits)
 +              return prev_aff;
 +      if ((unsigned int)recent_used_cpu < nr_cpumask_bits)
 +              return recent_used_cpu;
 +
        return target;
  }
  
@@@ -7569,7 -7469,7 +7569,7 @@@ cpu_util(int cpu, struct task_struct *p
                util = max(util, util_est);
        }
  
 -      return min(util, capacity_orig_of(cpu));
 +      return min(util, arch_scale_cpu_capacity(cpu));
  }
  
  unsigned long cpu_util_cfs(int cpu)
@@@ -7721,16 -7621,11 +7721,16 @@@ compute_energy(struct energy_env *eenv
  {
        unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
        unsigned long busy_time = eenv->pd_busy_time;
 +      unsigned long energy;
  
        if (dst_cpu >= 0)
                busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
  
 -      return em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
 +      energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
 +
 +      trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
 +
 +      return energy;
  }
  
  /*
@@@ -7805,7 -7700,7 +7805,7 @@@ static int find_energy_efficient_cpu(st
        target = prev_cpu;
  
        sync_entity_load_avg(&p->se);
 -      if (!uclamp_task_util(p, p_util_min, p_util_max))
 +      if (!task_util_est(p) && p_util_min == 0)
                goto unlock;
  
        eenv_task_busy_time(&eenv, p, prev_cpu);
        for (; pd; pd = pd->next) {
                unsigned long util_min = p_util_min, util_max = p_util_max;
                unsigned long cpu_cap, cpu_thermal_cap, util;
 -              unsigned long cur_delta, max_spare_cap = 0;
 +              long prev_spare_cap = -1, max_spare_cap = -1;
                unsigned long rq_util_min, rq_util_max;
 -              unsigned long prev_spare_cap = 0;
 +              unsigned long cur_delta, base_energy;
                int max_spare_cap_cpu = -1;
 -              unsigned long base_energy;
                int fits, max_fits = -1;
  
                cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
                                prev_spare_cap = cpu_cap;
                                prev_fits = fits;
                        } else if ((fits > max_fits) ||
 -                                 ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
 +                                 ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
                                /*
                                 * Find the CPU with the maximum spare capacity
                                 * among the remaining CPUs in the performance
                        }
                }
  
 -              if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
 +              if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
                        continue;
  
                eenv_pd_busy_time(&eenv, cpus, p);
                base_energy = compute_energy(&eenv, pd, cpus, p, -1);
  
                /* Evaluate the energy impact of using prev_cpu. */
 -              if (prev_spare_cap > 0) {
 +              if (prev_spare_cap > -1) {
                        prev_delta = compute_energy(&eenv, pd, cpus, p,
                                                    prev_cpu);
                        /* CPU utilization has changed */
@@@ -8100,7 -7996,7 +8100,7 @@@ static void set_next_buddy(struct sched
  /*
   * Preempt the current task with a newly woken task if needed:
   */
 -static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 +static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
  {
        struct task_struct *curr = rq->curr;
        struct sched_entity *se = &curr->se, *pse = &p->se;
  
        /*
         * This is possible from callers such as attach_tasks(), in which we
 -       * unconditionally check_preempt_curr() after an enqueue (which may have
 +       * unconditionally wakeup_preempt() after an enqueue (which may have
         * lead to a throttle).  This both saves work and prevents false
         * next-buddy nomination below.
         */
@@@ -8205,7 -8101,7 +8205,7 @@@ again
                                goto again;
                }
  
 -              se = pick_next_entity(cfs_rq, curr);
 +              se = pick_next_entity(cfs_rq);
                cfs_rq = group_cfs_rq(se);
        } while (cfs_rq);
  
@@@ -8268,7 -8164,7 +8268,7 @@@ again
                        }
                }
  
 -              se = pick_next_entity(cfs_rq, curr);
 +              se = pick_next_entity(cfs_rq);
                cfs_rq = group_cfs_rq(se);
        } while (cfs_rq);
  
@@@ -8307,7 -8203,7 +8307,7 @@@ simple
                put_prev_task(rq, prev);
  
        do {
 -              se = pick_next_entity(cfs_rq, NULL);
 +              se = pick_next_entity(cfs_rq);
                set_next_entity(cfs_rq, se);
                cfs_rq = group_cfs_rq(se);
        } while (cfs_rq);
@@@ -9020,7 -8916,7 +9020,7 @@@ static void attach_task(struct rq *rq, 
  
        WARN_ON_ONCE(task_rq(p) != rq);
        activate_task(rq, p, ENQUEUE_NOCLOCK);
 -      check_preempt_curr(rq, p, 0);
 +      wakeup_preempt(rq, p, 0);
  }
  
  /*
@@@ -9360,6 -9256,8 +9360,6 @@@ static void update_cpu_capacity(struct 
        unsigned long capacity = scale_rt_capacity(cpu);
        struct sched_group *sdg = sd->groups;
  
 -      cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu);
 -
        if (!capacity)
                capacity = 1;
  
@@@ -9435,7 -9333,7 +9435,7 @@@ static inline in
  check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
  {
        return ((rq->cpu_capacity * sd->imbalance_pct) <
 -                              (rq->cpu_capacity_orig * 100));
 +                              (arch_scale_cpu_capacity(cpu_of(rq)) * 100));
  }
  
  /*
  static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
  {
        return rq->misfit_task_load &&
 -              (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
 +              (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
                 check_cpu_capacity(rq, sd));
  }
  
@@@ -9598,7 -9496,7 +9598,7 @@@ static bool sched_use_asym_prio(struct 
   * can only do it if @group is an SMT group and has exactly on busy CPU. Larger
   * imbalances in the number of CPUS are dealt with in find_busiest_group().
   *
 - * If we are balancing load within an SMT core, or at DIE domain level, always
 + * If we are balancing load within an SMT core, or at PKG domain level, always
   * proceed.
   *
   * Return: true if @env::dst_cpu can do with asym_packing load balance. False
@@@ -11297,15 -11195,13 +11297,15 @@@ more_balance
                                busiest->push_cpu = this_cpu;
                                active_balance = 1;
                        }
 -                      raw_spin_rq_unlock_irqrestore(busiest, flags);
  
 +                      preempt_disable();
 +                      raw_spin_rq_unlock_irqrestore(busiest, flags);
                        if (active_balance) {
                                stop_one_cpu_nowait(cpu_of(busiest),
                                        active_load_balance_cpu_stop, busiest,
                                        &busiest->active_balance_work);
                        }
 +                      preempt_enable();
                }
        } else {
                sd->nr_balance_failed = 0;
@@@ -11613,39 -11509,36 +11613,39 @@@ static inline int on_null_domain(struc
  
  #ifdef CONFIG_NO_HZ_COMMON
  /*
 - * idle load balancing details
 - * - When one of the busy CPUs notice that there may be an idle rebalancing
 + * NOHZ idle load balancing (ILB) details:
 + *
 + * - When one of the busy CPUs notices that there may be an idle rebalancing
   *   needed, they will kick the idle load balancer, which then does idle
   *   load balancing for all the idle CPUs.
 - * - HK_TYPE_MISC CPUs are used for this task, because HK_TYPE_SCHED not set
 + *
 + * - HK_TYPE_MISC CPUs are used for this task, because HK_TYPE_SCHED is not set
   *   anywhere yet.
   */
 -
  static inline int find_new_ilb(void)
  {
 -      int ilb;
        const struct cpumask *hk_mask;
 +      int ilb_cpu;
  
        hk_mask = housekeeping_cpumask(HK_TYPE_MISC);
  
 -      for_each_cpu_and(ilb, nohz.idle_cpus_mask, hk_mask) {
 +      for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
  
 -              if (ilb == smp_processor_id())
 +              if (ilb_cpu == smp_processor_id())
                        continue;
  
 -              if (idle_cpu(ilb))
 -                      return ilb;
 +              if (idle_cpu(ilb_cpu))
 +                      return ilb_cpu;
        }
  
 -      return nr_cpu_ids;
 +      return -1;
  }
  
  /*
 - * Kick a CPU to do the nohz balancing, if it is time for it. We pick any
 - * idle CPU in the HK_TYPE_MISC housekeeping set (if there is one).
 + * Kick a CPU to do the NOHZ balancing, if it is time for it, via a cross-CPU
 + * SMP function call (IPI).
 + *
 + * We pick the first idle CPU in the HK_TYPE_MISC housekeeping set (if there is one).
   */
  static void kick_ilb(unsigned int flags)
  {
                nohz.next_balance = jiffies+1;
  
        ilb_cpu = find_new_ilb();
 -
 -      if (ilb_cpu >= nr_cpu_ids)
 +      if (ilb_cpu < 0)
                return;
  
        /*
  
        /*
         * This way we generate an IPI on the target CPU which
 -       * is idle. And the softirq performing nohz idle load balance
 +       * is idle, and the softirq performing NOHZ idle load balancing
         * will be run before returning from the IPI.
         */
        smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
@@@ -11701,7 -11595,7 +11701,7 @@@ static void nohz_balancer_kick(struct r
  
        /*
         * None are in tickless mode and hence no need for NOHZ idle load
 -       * balancing.
 +       * balancing:
         */
        if (likely(!atomic_read(&nohz.nr_cpus)))
                return;
        sd = rcu_dereference(rq->sd);
        if (sd) {
                /*
 -               * If there's a CFS task and the current CPU has reduced
 -               * capacity; kick the ILB to see if there's a better CPU to run
 -               * on.
 +               * If there's a runnable CFS task and the current CPU has reduced
 +               * capacity, kick the ILB to see if there's a better CPU to run on:
                 */
                if (rq->cfs.h_nr_running >= 1 && check_cpu_capacity(rq, sd)) {
                        flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
        if (sds) {
                /*
                 * If there is an imbalance between LLC domains (IOW we could
 -               * increase the overall cache use), we need some less-loaded LLC
 -               * domain to pull some load. Likewise, we may need to spread
 +               * increase the overall cache utilization), we need a less-loaded LLC
 +               * domain to pull some load from. Likewise, we may need to spread
                 * load within the current LLC domain (e.g. packed SMT cores but
                 * other CPUs are idle). We can't really know from here how busy
 -               * the others are - so just get a nohz balance going if it looks
 +               * the others are - so just get a NOHZ balance going if it looks
                 * like this LLC domain has tasks we could move.
                 */
                nr_busy = atomic_read(&sds->nr_busy_cpus);
@@@ -12050,19 -11945,8 +12050,19 @@@ static bool nohz_idle_balance(struct r
  }
  
  /*
 - * Check if we need to run the ILB for updating blocked load before entering
 - * idle state.
 + * Check if we need to directly run the ILB for updating blocked load before
 + * entering idle state. Here we run ILB directly without issuing IPIs.
 + *
 + * Note that when this function is called, the tick may not yet be stopped on
 + * this CPU yet. nohz.idle_cpus_mask is updated only when tick is stopped and
 + * cleared on the next busy tick. In other words, nohz.idle_cpus_mask updates
 + * don't align with CPUs enter/exit idle to avoid bottlenecks due to high idle
 + * entry/exit rate (usec). So it is possible that _nohz_idle_balance() is
 + * called from this function on (this) CPU that's not yet in the mask. That's
 + * OK because the goal of nohz_run_idle_balance() is to run ILB only for
 + * updating the blocked load of already idle CPUs without waking up one of
 + * those idle CPUs and outside the preempt disable / irq off phase of the local
 + * cpu about to enter idle, because it can take a long time.
   */
  void nohz_run_idle_balance(int cpu)
  {
@@@ -12507,7 -12391,7 +12507,7 @@@ prio_changed_fair(struct rq *rq, struc
                if (p->prio > oldprio)
                        resched_curr(rq);
        } else
 -              check_preempt_curr(rq, p, 0);
 +              wakeup_preempt(rq, p, 0);
  }
  
  #ifdef CONFIG_FAIR_GROUP_SCHED
@@@ -12609,7 -12493,7 +12609,7 @@@ static void switched_to_fair(struct rq 
                if (task_current(rq, p))
                        resched_curr(rq);
                else
 -                      check_preempt_curr(rq, p, 0);
 +                      wakeup_preempt(rq, p, 0);
        }
  }
  
@@@ -12968,7 -12852,7 +12968,7 @@@ DEFINE_SCHED_CLASS(fair) = 
        .yield_task             = yield_task_fair,
        .yield_to_task          = yield_to_task_fair,
  
 -      .check_preempt_curr     = check_preempt_wakeup,
 +      .wakeup_preempt         = check_preempt_wakeup_fair,
  
        .pick_next_task         = __pick_next_task_fair,
        .put_prev_task          = put_prev_task_fair,
diff --combined mm/mempolicy.c
index e52e3a0b8f2e6c22b807851a6247cee4510ef559,5e472e6e0507c7da9de7c5eaa4f6b70d2f234a52..10a590ee1c89974c353a28aa0c4eb393a375e31b
@@@ -25,7 -25,7 +25,7 @@@
   *                to the last. It would be better if bind would truly restrict
   *                the allocation to memory nodes instead
   *
-  * preferred       Try a specific node first before normal fallback.
+  * preferred      Try a specific node first before normal fallback.
   *                As a special case NUMA_NO_NODE here means do the allocation
   *                on the local CPU. This is normally identical to default,
   *                but useful to set in a VMA when you have a non default
@@@ -52,7 -52,7 +52,7 @@@
   * on systems with highmem kernel lowmem allocation don't get policied.
   * Same with GFP_DMA allocations.
   *
-  * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between
+  * For shmem/tmpfs shared memory the policy is shared between
   * all users and remembered even when nobody has memory mapped.
   */
  
  
  /* Internal flags */
  #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0)  /* Skip checks for continuous vmas */
- #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)                /* Invert check for nodemask */
+ #define MPOL_MF_INVERT       (MPOL_MF_INTERNAL << 1)  /* Invert check for nodemask */
+ #define MPOL_MF_WRLOCK       (MPOL_MF_INTERNAL << 2)  /* Write-lock walked vmas */
  
  static struct kmem_cache *policy_cache;
  static struct kmem_cache *sn_cache;
@@@ -131,26 -132,22 +132,26 @@@ static struct mempolicy default_policy 
  static struct mempolicy preferred_node_policy[MAX_NUMNODES];
  
  /**
 - * numa_map_to_online_node - Find closest online node
 + * numa_nearest_node - Find nearest node by state
   * @node: Node id to start the search
 + * @state: State to filter the search
   *
 - * Lookup the next closest node by distance if @nid is not online.
 + * Lookup the closest node by distance if @nid is not in state.
   *
 - * Return: this @node if it is online, otherwise the closest node by distance
 + * Return: this @node if it is in state, otherwise the closest node by distance
   */
 -int numa_map_to_online_node(int node)
 +int numa_nearest_node(int node, unsigned int state)
  {
        int min_dist = INT_MAX, dist, n, min_node;
  
 -      if (node == NUMA_NO_NODE || node_online(node))
 +      if (state >= NR_NODE_STATES)
 +              return -EINVAL;
 +
 +      if (node == NUMA_NO_NODE || node_state(node, state))
                return node;
  
        min_node = node;
 -      for_each_online_node(n) {
 +      for_each_node_state(n, state) {
                dist = node_distance(node, n);
                if (dist < min_dist) {
                        min_dist = dist;
  
        return min_node;
  }
 -EXPORT_SYMBOL_GPL(numa_map_to_online_node);
 +EXPORT_SYMBOL_GPL(numa_nearest_node);
  
  struct mempolicy *get_task_policy(struct task_struct *p)
  {
@@@ -267,9 -264,6 +268,6 @@@ static struct mempolicy *mpol_new(unsig
  {
        struct mempolicy *policy;
  
-       pr_debug("setting mode %d flags %d nodes[0] %lx\n",
-                mode, flags, nodes ? nodes_addr(*nodes)[0] : NUMA_NO_NODE);
        if (mode == MPOL_DEFAULT) {
                if (nodes && !nodes_empty(*nodes))
                        return ERR_PTR(-EINVAL);
                        return ERR_PTR(-EINVAL);
        } else if (nodes_empty(*nodes))
                return ERR_PTR(-EINVAL);
        policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
        if (!policy)
                return ERR_PTR(-ENOMEM);
  }
  
  /* Slow path of a mpol destructor. */
- void __mpol_put(struct mempolicy *p)
+ void __mpol_put(struct mempolicy *pol)
  {
-       if (!atomic_dec_and_test(&p->refcnt))
+       if (!atomic_dec_and_test(&pol->refcnt))
                return;
-       kmem_cache_free(policy_cache, p);
+       kmem_cache_free(policy_cache, pol);
  }
  
  static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes)
@@@ -370,7 -365,6 +369,6 @@@ static void mpol_rebind_policy(struct m
   *
   * Called with task's alloc_lock held.
   */
  void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new)
  {
        mpol_rebind_policy(tsk->mempolicy, new);
   *
   * Call holding a reference to mm.  Takes mm->mmap_lock during call.
   */
  void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
  {
        struct vm_area_struct *vma;
@@@ -420,8 -413,25 +417,25 @@@ static const struct mempolicy_operation
        },
  };
  
- static int migrate_folio_add(struct folio *folio, struct list_head *foliolist,
+ static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
                                unsigned long flags);
+ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
+                               pgoff_t ilx, int *nid);
+ static bool strictly_unmovable(unsigned long flags)
+ {
+       /*
+        * STRICT without MOVE flags lets do_mbind() fail immediately with -EIO
+        * if any misplaced page is found.
+        */
+       return (flags & (MPOL_MF_STRICT | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ==
+                        MPOL_MF_STRICT;
+ }
+ struct migration_mpol {               /* for alloc_migration_target_by_mpol() */
+       struct mempolicy *pol;
+       pgoff_t ilx;
+ };
  
  struct queue_pages {
        struct list_head *pagelist;
        unsigned long start;
        unsigned long end;
        struct vm_area_struct *first;
-       bool has_unmovable;
+       struct folio *large;            /* note last large folio encountered */
+       long nr_failed;                 /* could not be isolated at this time */
  };
  
  /*
@@@ -448,61 -459,37 +463,37 @@@ static inline bool queue_folio_required
        return node_isset(nid, *qp->nmask) == !(flags & MPOL_MF_INVERT);
  }
  
- /*
-  * queue_folios_pmd() has three possible return values:
-  * 0 - folios are placed on the right node or queued successfully, or
-  *     special page is met, i.e. zero page, or unmovable page is found
-  *     but continue walking (indicated by queue_pages.has_unmovable).
-  * -EIO - is migration entry or only MPOL_MF_STRICT was specified and an
-  *        existing folio was already on a node that does not follow the
-  *        policy.
-  */
- static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
-                               unsigned long end, struct mm_walk *walk)
-       __releases(ptl)
+ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk)
  {
-       int ret = 0;
        struct folio *folio;
        struct queue_pages *qp = walk->private;
-       unsigned long flags;
  
        if (unlikely(is_pmd_migration_entry(*pmd))) {
-               ret = -EIO;
-               goto unlock;
+               qp->nr_failed++;
+               return;
        }
        folio = pfn_folio(pmd_pfn(*pmd));
        if (is_huge_zero_page(&folio->page)) {
                walk->action = ACTION_CONTINUE;
-               goto unlock;
+               return;
        }
        if (!queue_folio_required(folio, qp))
-               goto unlock;
-       flags = qp->flags;
-       /* go to folio migration */
-       if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
-               if (!vma_migratable(walk->vma) ||
-                   migrate_folio_add(folio, qp->pagelist, flags)) {
-                       qp->has_unmovable = true;
-                       goto unlock;
-               }
-       } else
-               ret = -EIO;
- unlock:
-       spin_unlock(ptl);
-       return ret;
+               return;
+       if (!(qp->flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ||
+           !vma_migratable(walk->vma) ||
+           !migrate_folio_add(folio, qp->pagelist, qp->flags))
+               qp->nr_failed++;
  }
  
  /*
-  * Scan through pages checking if pages follow certain conditions,
-  * and move them to the pagelist if they do.
+  * Scan through folios, checking if they satisfy the required conditions,
+  * moving them from LRU to local pagelist for migration if they do (or not).
   *
-  * queue_folios_pte_range() has three possible return values:
-  * 0 - folios are placed on the right node or queued successfully, or
-  *     special page is met, i.e. zero page, or unmovable page is found
-  *     but continue walking (indicated by queue_pages.has_unmovable).
-  * -EIO - only MPOL_MF_STRICT was specified and an existing folio was already
-  *        on a node that does not follow the policy.
+  * queue_folios_pte_range() has two possible return values:
+  * 0 - continue walking to scan for more, even if an existing folio on the
+  *     wrong node could not be isolated and queued for migration.
+  * -EIO - only MPOL_MF_STRICT was specified, without MPOL_MF_MOVE or ..._ALL,
+  *        and an existing folio was on a node that does not follow the policy.
   */
  static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
                        unsigned long end, struct mm_walk *walk)
        spinlock_t *ptl;
  
        ptl = pmd_trans_huge_lock(pmd, vma);
-       if (ptl)
-               return queue_folios_pmd(pmd, ptl, addr, end, walk);
+       if (ptl) {
+               queue_folios_pmd(pmd, walk);
+               spin_unlock(ptl);
+               goto out;
+       }
  
        mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
        if (!pte) {
        }
        for (; addr != end; pte++, addr += PAGE_SIZE) {
                ptent = ptep_get(pte);
-               if (!pte_present(ptent))
+               if (pte_none(ptent))
+                       continue;
+               if (!pte_present(ptent)) {
+                       if (is_migration_entry(pte_to_swp_entry(ptent)))
+                               qp->nr_failed++;
                        continue;
+               }
                folio = vm_normal_folio(vma, addr, ptent);
                if (!folio || folio_is_zone_device(folio))
                        continue;
                        continue;
                if (!queue_folio_required(folio, qp))
                        continue;
-               if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
-                       /*
-                        * MPOL_MF_STRICT must be specified if we get here.
-                        * Continue walking vmas due to MPOL_MF_MOVE* flags.
-                        */
-                       if (!vma_migratable(vma))
-                               qp->has_unmovable = true;
+               if (folio_test_large(folio)) {
                        /*
-                        * Do not abort immediately since there may be
-                        * temporary off LRU pages in the range.  Still
-                        * need migrate other LRU pages.
+                        * A large folio can only be isolated from LRU once,
+                        * but may be mapped by many PTEs (and Copy-On-Write may
+                        * intersperse PTEs of other, order 0, folios).  This is
+                        * a common case, so don't mistake it for failure (but
+                        * there can be other cases of multi-mapped pages which
+                        * this quick check does not help to filter out - and a
+                        * search of the pagelist might grow to be prohibitive).
+                        *
+                        * migrate_pages(&pagelist) returns nr_failed folios, so
+                        * check "large" now so that queue_pages_range() returns
+                        * a comparable nr_failed folios.  This does imply that
+                        * if folio could not be isolated for some racy reason
+                        * at its first PTE, later PTEs will not give it another
+                        * chance of isolation; but keeps the accounting simple.
                         */
-                       if (migrate_folio_add(folio, qp->pagelist, flags))
-                               qp->has_unmovable = true;
-               } else
-                       break;
+                       if (folio == qp->large)
+                               continue;
+                       qp->large = folio;
+               }
+               if (!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ||
+                   !vma_migratable(vma) ||
+                   !migrate_folio_add(folio, qp->pagelist, flags)) {
+                       qp->nr_failed++;
+                       if (strictly_unmovable(flags))
+                               break;
+               }
        }
        pte_unmap_unlock(mapped_pte, ptl);
        cond_resched();
-       return addr != end ? -EIO : 0;
+ out:
+       if (qp->nr_failed && strictly_unmovable(flags))
+               return -EIO;
+       return 0;
  }
  
  static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
                               unsigned long addr, unsigned long end,
                               struct mm_walk *walk)
  {
-       int ret = 0;
  #ifdef CONFIG_HUGETLB_PAGE
        struct queue_pages *qp = walk->private;
-       unsigned long flags = (qp->flags & MPOL_MF_VALID);
+       unsigned long flags = qp->flags;
        struct folio *folio;
        spinlock_t *ptl;
        pte_t entry;
  
        ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
        entry = huge_ptep_get(pte);
-       if (!pte_present(entry))
+       if (!pte_present(entry)) {
+               if (unlikely(is_hugetlb_entry_migration(entry)))
+                       qp->nr_failed++;
                goto unlock;
+       }
        folio = pfn_folio(pte_pfn(entry));
        if (!queue_folio_required(folio, qp))
                goto unlock;
-       if (flags == MPOL_MF_STRICT) {
-               /*
-                * STRICT alone means only detecting misplaced folio and no
-                * need to further check other vma.
-                */
-               ret = -EIO;
+       if (!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ||
+           !vma_migratable(walk->vma)) {
+               qp->nr_failed++;
                goto unlock;
        }
-       if (!vma_migratable(walk->vma)) {
-               /*
-                * Must be STRICT with MOVE*, otherwise .test_walk() have
-                * stopped walking current vma.
-                * Detecting misplaced folio but allow migrating folios which
-                * have been queued.
-                */
-               qp->has_unmovable = true;
-               goto unlock;
-       }
        /*
-        * With MPOL_MF_MOVE, we try to migrate only unshared folios. If it
-        * is shared it is likely not worth migrating.
+        * Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio.
+        * Choosing not to migrate a shared folio is not counted as a failure.
         *
         * To check if the folio is shared, ideally we want to make sure
         * every page is mapped to the same process. Doing that is very
-        * expensive, so check the estimated mapcount of the folio instead.
+        * expensive, so check the estimated sharers of the folio instead.
         */
-       if (flags & (MPOL_MF_MOVE_ALL) ||
-           (flags & MPOL_MF_MOVE && folio_estimated_sharers(folio) == 1 &&
-            !hugetlb_pmd_shared(pte))) {
-               if (!isolate_hugetlb(folio, qp->pagelist) &&
-                       (flags & MPOL_MF_STRICT))
-                       /*
-                        * Failed to isolate folio but allow migrating pages
-                        * which have been queued.
-                        */
-                       qp->has_unmovable = true;
-       }
+       if ((flags & MPOL_MF_MOVE_ALL) ||
+           (folio_estimated_sharers(folio) == 1 && !hugetlb_pmd_shared(pte)))
+               if (!isolate_hugetlb(folio, qp->pagelist))
+                       qp->nr_failed++;
  unlock:
        spin_unlock(ptl);
- #else
-       BUG();
+       if (qp->nr_failed && strictly_unmovable(flags))
+               return -EIO;
  #endif
-       return ret;
+       return 0;
  }
  
  #ifdef CONFIG_NUMA_BALANCING
@@@ -656,12 -643,6 +647,6 @@@ unsigned long change_prot_numa(struct v
  
        return nr_updated;
  }
- #else
- static unsigned long change_prot_numa(struct vm_area_struct *vma,
-                       unsigned long addr, unsigned long end)
- {
-       return 0;
- }
  #endif /* CONFIG_NUMA_BALANCING */
  
  static int queue_pages_test_walk(unsigned long start, unsigned long end,
        if (endvma > end)
                endvma = end;
  
-       if (flags & MPOL_MF_LAZY) {
-               /* Similar to task_numa_work, skip inaccessible VMAs */
-               if (!is_vm_hugetlb_page(vma) && vma_is_accessible(vma) &&
-                       !(vma->vm_flags & VM_MIXEDMAP))
-                       change_prot_numa(vma, start, endvma);
-               return 1;
-       }
-       /* queue pages from current vma */
-       if (flags & MPOL_MF_VALID)
+       /*
+        * Check page nodes, and queue pages to move, in the current vma.
+        * But if no moving, and no strict checking, the scan can be skipped.
+        */
+       if (flags & (MPOL_MF_STRICT | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
                return 0;
        return 1;
  }
@@@ -731,22 -707,21 +711,21 @@@ static const struct mm_walk_ops queue_p
  /*
   * Walk through page tables and collect pages to be migrated.
   *
-  * If pages found in a given range are on a set of nodes (determined by
-  * @nodes and @flags,) it's isolated and queued to the pagelist which is
-  * passed via @private.
+  * If pages found in a given range are not on the required set of @nodes,
+  * and migration is allowed, they are isolated and queued to @pagelist.
   *
-  * queue_pages_range() has three possible return values:
-  * 1 - there is unmovable page, but MPOL_MF_MOVE* & MPOL_MF_STRICT were
-  *     specified.
-  * 0 - queue pages successfully or no misplaced page.
-  * errno - i.e. misplaced pages with MPOL_MF_STRICT specified (-EIO) or
-  *         memory range specified by nodemask and maxnode points outside
-  *         your accessible address space (-EFAULT)
+  * queue_pages_range() may return:
+  * 0 - all pages already on the right node, or successfully queued for moving
+  *     (or neither strict checking nor moving requested: only range checking).
+  * >0 - this number of misplaced folios could not be queued for moving
+  *      (a hugetlbfs page or a transparent huge page being counted as 1).
+  * -EIO - a misplaced page found, when MPOL_MF_STRICT specified without MOVEs.
+  * -EFAULT - a hole in the memory range, when MPOL_MF_DISCONTIG_OK unspecified.
   */
- static int
+ static long
  queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
                nodemask_t *nodes, unsigned long flags,
-               struct list_head *pagelist, bool lock_vma)
+               struct list_head *pagelist)
  {
        int err;
        struct queue_pages qp = {
                .start = start,
                .end = end,
                .first = NULL,
-               .has_unmovable = false,
        };
-       const struct mm_walk_ops *ops = lock_vma ?
+       const struct mm_walk_ops *ops = (flags & MPOL_MF_WRLOCK) ?
                        &queue_pages_lock_vma_walk_ops : &queue_pages_walk_ops;
  
        err = walk_page_range(mm, start, end, ops, &qp);
  
-       if (qp.has_unmovable)
-               err = 1;
        if (!qp.first)
                /* whole range in hole */
                err = -EFAULT;
  
-       return err;
+       return err ? : qp.nr_failed;
  }
  
  /*
   * This must be called with the mmap_lock held for writing.
   */
  static int vma_replace_policy(struct vm_area_struct *vma,
-                                               struct mempolicy *pol)
+                               struct mempolicy *pol)
  {
        int err;
        struct mempolicy *old;
  
        vma_assert_write_locked(vma);
  
-       pr_debug("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n",
-                vma->vm_start, vma->vm_end, vma->vm_pgoff,
-                vma->vm_ops, vma->vm_file,
-                vma->vm_ops ? vma->vm_ops->set_policy : NULL);
        new = mpol_dup(pol);
        if (IS_ERR(new))
                return PTR_ERR(new);
@@@ -815,10 -782,7 +786,7 @@@ static int mbind_range(struct vma_itera
                struct vm_area_struct **prev, unsigned long start,
                unsigned long end, struct mempolicy *new_pol)
  {
-       struct vm_area_struct *merged;
        unsigned long vmstart, vmend;
-       pgoff_t pgoff;
-       int err;
  
        vmend = min(end, vma->vm_end);
        if (start > vma->vm_start) {
                vmstart = vma->vm_start;
        }
  
-       if (mpol_equal(vma_policy(vma), new_pol)) {
+       if (mpol_equal(vma->vm_policy, new_pol)) {
                *prev = vma;
                return 0;
        }
  
-       pgoff = vma->vm_pgoff + ((vmstart - vma->vm_start) >> PAGE_SHIFT);
-       merged = vma_merge(vmi, vma->vm_mm, *prev, vmstart, vmend, vma->vm_flags,
-                        vma->anon_vma, vma->vm_file, pgoff, new_pol,
-                        vma->vm_userfaultfd_ctx, anon_vma_name(vma));
-       if (merged) {
-               *prev = merged;
-               return vma_replace_policy(merged, new_pol);
-       }
-       if (vma->vm_start != vmstart) {
-               err = split_vma(vmi, vma, vmstart, 1);
-               if (err)
-                       return err;
-       }
-       if (vma->vm_end != vmend) {
-               err = split_vma(vmi, vma, vmend, 0);
-               if (err)
-                       return err;
-       }
+       vma =  vma_modify_policy(vmi, *prev, vma, vmstart, vmend, new_pol);
+       if (IS_ERR(vma))
+               return PTR_ERR(vma);
  
        *prev = vma;
        return vma_replace_policy(vma, new_pol);
@@@ -900,18 -847,18 +851,18 @@@ out
   *
   * Called with task's alloc_lock held
   */
- static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
+ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
  {
        nodes_clear(*nodes);
-       if (p == &default_policy)
+       if (pol == &default_policy)
                return;
  
-       switch (p->mode) {
+       switch (pol->mode) {
        case MPOL_BIND:
        case MPOL_INTERLEAVE:
        case MPOL_PREFERRED:
        case MPOL_PREFERRED_MANY:
-               *nodes = p->nodes;
+               *nodes = pol->nodes;
                break;
        case MPOL_LOCAL:
                /* return empty node mask for local allocation */
@@@ -958,6 -905,7 +909,7 @@@ static long do_get_mempolicy(int *polic
        }
  
        if (flags & MPOL_F_ADDR) {
+               pgoff_t ilx;            /* ignored here */
                /*
                 * Do NOT fall back to task policy if the
                 * vma/shared policy at addr is NULL.  We
                        mmap_read_unlock(mm);
                        return -EFAULT;
                }
-               if (vma->vm_ops && vma->vm_ops->get_policy)
-                       pol = vma->vm_ops->get_policy(vma, addr);
-               else
-                       pol = vma->vm_policy;
+               pol = __get_vma_policy(vma, addr, &ilx);
        } else if (addr)
                return -EINVAL;
  
  }
  
  #ifdef CONFIG_MIGRATION
- static int migrate_folio_add(struct folio *folio, struct list_head *foliolist,
+ static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
                                unsigned long flags)
  {
        /*
-        * We try to migrate only unshared folios. If it is shared it
-        * is likely not worth migrating.
+        * Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio.
+        * Choosing not to migrate a shared folio is not counted as a failure.
         *
         * To check if the folio is shared, ideally we want to make sure
         * every page is mapped to the same process. Doing that is very
-        * expensive, so check the estimated mapcount of the folio instead.
+        * expensive, so check the estimated sharers of the folio instead.
         */
        if ((flags & MPOL_MF_MOVE_ALL) || folio_estimated_sharers(folio) == 1) {
                if (folio_isolate_lru(folio)) {
                        node_stat_mod_folio(folio,
                                NR_ISOLATED_ANON + folio_is_file_lru(folio),
                                folio_nr_pages(folio));
-               } else if (flags & MPOL_MF_STRICT) {
+               } else {
                        /*
                         * Non-movable folio may reach here.  And, there may be
                         * temporary off LRU folios or non-LRU movable folios.
                         * Treat them as unmovable folios since they can't be
-                        * isolated, so they can't be moved at the moment.  It
-                        * should return -EIO for this case too.
+                        * isolated, so they can't be moved at the moment.
                         */
-                       return -EIO;
+                       return false;
                }
        }
-       return 0;
+       return true;
  }
  
  /*
   * Migrate pages from one node to a target node.
   * Returns error or the number of pages not migrated.
   */
- static int migrate_to_node(struct mm_struct *mm, int source, int dest,
-                          int flags)
+ static long migrate_to_node(struct mm_struct *mm, int source, int dest,
+                           int flags)
  {
        nodemask_t nmask;
        struct vm_area_struct *vma;
        LIST_HEAD(pagelist);
-       int err = 0;
+       long nr_failed;
+       long err = 0;
        struct migration_target_control mtc = {
                .nid = dest,
                .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE,
        nodes_clear(nmask);
        node_set(source, nmask);
  
+       VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)));
+       mmap_read_lock(mm);
+       vma = find_vma(mm, 0);
        /*
-        * This does not "check" the range but isolates all pages that
+        * This does not migrate the range, but isolates all pages that
         * need migration.  Between passing in the full user address
-        * space range and MPOL_MF_DISCONTIG_OK, this call can not fail.
+        * space range and MPOL_MF_DISCONTIG_OK, this call cannot fail,
+        * but passes back the count of pages which could not be isolated.
         */
-       vma = find_vma(mm, 0);
-       VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)));
-       queue_pages_range(mm, vma->vm_start, mm->task_size, &nmask,
-                       flags | MPOL_MF_DISCONTIG_OK, &pagelist, false);
+       nr_failed = queue_pages_range(mm, vma->vm_start, mm->task_size, &nmask,
+                                     flags | MPOL_MF_DISCONTIG_OK, &pagelist);
+       mmap_read_unlock(mm);
  
        if (!list_empty(&pagelist)) {
                err = migrate_pages(&pagelist, alloc_migration_target, NULL,
-                               (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL, NULL);
+                       (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL, NULL);
                if (err)
                        putback_movable_pages(&pagelist);
        }
  
+       if (err >= 0)
+               err += nr_failed;
        return err;
  }
  
  int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
                     const nodemask_t *to, int flags)
  {
-       int busy = 0;
-       int err = 0;
+       long nr_failed = 0;
+       long err = 0;
        nodemask_t tmp;
  
        lru_cache_disable();
  
-       mmap_read_lock(mm);
        /*
         * Find a 'source' bit set in 'tmp' whose corresponding 'dest'
         * bit in 'to' is not also set in 'tmp'.  Clear the found 'source'
                node_clear(source, tmp);
                err = migrate_to_node(mm, source, dest, flags);
                if (err > 0)
-                       busy += err;
+                       nr_failed += err;
                if (err < 0)
                        break;
        }
-       mmap_read_unlock(mm);
  
        lru_cache_enable();
        if (err < 0)
                return err;
-       return busy;
+       return (nr_failed < INT_MAX) ? nr_failed : INT_MAX;
  }
  
  /*
-  * Allocate a new page for page migration based on vma policy.
-  * Start by assuming the page is mapped by the same vma as contains @start.
-  * Search forward from there, if not.  N.B., this assumes that the
-  * list of pages handed to migrate_pages()--which is how we get here--
-  * is in virtual address order.
+  * Allocate a new folio for page migration, according to NUMA mempolicy.
   */
- static struct folio *new_folio(struct folio *src, unsigned long start)
+ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
+                                                   unsigned long private)
  {
-       struct vm_area_struct *vma;
-       unsigned long address;
-       VMA_ITERATOR(vmi, current->mm, start);
-       gfp_t gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL;
+       struct migration_mpol *mmpol = (struct migration_mpol *)private;
+       struct mempolicy *pol = mmpol->pol;
+       pgoff_t ilx = mmpol->ilx;
+       struct page *page;
+       unsigned int order;
+       int nid = numa_node_id();
+       gfp_t gfp;
  
-       for_each_vma(vmi, vma) {
-               address = page_address_in_vma(&src->page, vma);
-               if (address != -EFAULT)
-                       break;
-       }
+       order = folio_order(src);
+       ilx += src->index >> order;
  
        if (folio_test_hugetlb(src)) {
-               return alloc_hugetlb_folio_vma(folio_hstate(src),
-                               vma, address);
+               nodemask_t *nodemask;
+               struct hstate *h;
+               h = folio_hstate(src);
+               gfp = htlb_alloc_mask(h);
+               nodemask = policy_nodemask(gfp, pol, ilx, &nid);
+               return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp);
        }
  
        if (folio_test_large(src))
                gfp = GFP_TRANSHUGE;
+       else
+               gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP;
  
-       /*
-        * if !vma, vma_alloc_folio() will use task or system default policy
-        */
-       return vma_alloc_folio(gfp, folio_order(src), vma, address,
-                       folio_test_large(src));
+       page = alloc_pages_mpol(gfp, order, pol, ilx, nid);
+       return page_rmappable_folio(page);
  }
  #else
  
- static int migrate_folio_add(struct folio *folio, struct list_head *foliolist,
+ static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
                                unsigned long flags)
  {
-       return -EIO;
+       return false;
  }
  
  int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
        return -ENOSYS;
  }
  
- static struct folio *new_folio(struct folio *src, unsigned long start)
+ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
+                                                   unsigned long private)
  {
        return NULL;
  }
@@@ -1269,10 -1218,11 +1222,11 @@@ static long do_mbind(unsigned long star
        struct mm_struct *mm = current->mm;
        struct vm_area_struct *vma, *prev;
        struct vma_iterator vmi;
+       struct migration_mpol mmpol;
        struct mempolicy *new;
        unsigned long end;
-       int err;
-       int ret;
+       long err;
+       long nr_failed;
        LIST_HEAD(pagelist);
  
        if (flags & ~(unsigned long)MPOL_MF_VALID)
        if (IS_ERR(new))
                return PTR_ERR(new);
  
-       if (flags & MPOL_MF_LAZY)
-               new->flags |= MPOL_F_MOF;
        /*
         * If we are using the default policy then operation
         * on discontinuous address spaces is okay after all
        if (!new)
                flags |= MPOL_MF_DISCONTIG_OK;
  
-       pr_debug("mbind %lx-%lx mode:%d flags:%d nodes:%lx\n",
-                start, start + len, mode, mode_flags,
-                nmask ? nodes_addr(*nmask)[0] : NUMA_NO_NODE);
-       if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
+       if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
                lru_cache_disable();
-       }
        {
                NODEMASK_SCRATCH(scratch);
                if (scratch) {
                goto mpol_out;
  
        /*
-        * Lock the VMAs before scanning for pages to migrate, to ensure we don't
-        * miss a concurrently inserted page.
+        * Lock the VMAs before scanning for pages to migrate,
+        * to ensure we don't miss a concurrently inserted page.
         */
-       ret = queue_pages_range(mm, start, end, nmask,
-                         flags | MPOL_MF_INVERT, &pagelist, true);
+       nr_failed = queue_pages_range(mm, start, end, nmask,
+                       flags | MPOL_MF_INVERT | MPOL_MF_WRLOCK, &pagelist);
  
-       if (ret < 0) {
-               err = ret;
-               goto up_out;
-       }
-       vma_iter_init(&vmi, mm, start);
-       prev = vma_prev(&vmi);
-       for_each_vma_range(vmi, vma, end) {
-               err = mbind_range(&vmi, vma, &prev, start, end, new);
-               if (err)
-                       break;
+       if (nr_failed < 0) {
+               err = nr_failed;
+               nr_failed = 0;
+       } else {
+               vma_iter_init(&vmi, mm, start);
+               prev = vma_prev(&vmi);
+               for_each_vma_range(vmi, vma, end) {
+                       err = mbind_range(&vmi, vma, &prev, start, end, new);
+                       if (err)
+                               break;
+               }
        }
  
-       if (!err) {
-               int nr_failed = 0;
-               if (!list_empty(&pagelist)) {
-                       WARN_ON_ONCE(flags & MPOL_MF_LAZY);
-                       nr_failed = migrate_pages(&pagelist, new_folio, NULL,
-                               start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND, NULL);
-                       if (nr_failed)
-                               putback_movable_pages(&pagelist);
+       if (!err && !list_empty(&pagelist)) {
+               /* Convert MPOL_DEFAULT's NULL to task or default policy */
+               if (!new) {
+                       new = get_task_policy(current);
+                       mpol_get(new);
                }
+               mmpol.pol = new;
+               mmpol.ilx = 0;
  
-               if (((ret > 0) || nr_failed) && (flags & MPOL_MF_STRICT))
-                       err = -EIO;
-       } else {
- up_out:
-               if (!list_empty(&pagelist))
-                       putback_movable_pages(&pagelist);
+               /*
+                * In the interleaved case, attempt to allocate on exactly the
+                * targeted nodes, for the first VMA to be migrated; for later
+                * VMAs, the nodes will still be interleaved from the targeted
+                * nodemask, but one by one may be selected differently.
+                */
+               if (new->mode == MPOL_INTERLEAVE) {
+                       struct page *page;
+                       unsigned int order;
+                       unsigned long addr = -EFAULT;
+                       list_for_each_entry(page, &pagelist, lru) {
+                               if (!PageKsm(page))
+                                       break;
+                       }
+                       if (!list_entry_is_head(page, &pagelist, lru)) {
+                               vma_iter_init(&vmi, mm, start);
+                               for_each_vma_range(vmi, vma, end) {
+                                       addr = page_address_in_vma(page, vma);
+                                       if (addr != -EFAULT)
+                                               break;
+                               }
+                       }
+                       if (addr != -EFAULT) {
+                               order = compound_order(page);
+                               /* We already know the pol, but not the ilx */
+                               mpol_cond_put(get_vma_policy(vma, addr, order,
+                                                            &mmpol.ilx));
+                               /* Set base from which to increment by index */
+                               mmpol.ilx -= page->index >> order;
+                       }
+               }
        }
  
        mmap_write_unlock(mm);
+       if (!err && !list_empty(&pagelist)) {
+               nr_failed |= migrate_pages(&pagelist,
+                               alloc_migration_target_by_mpol, NULL,
+                               (unsigned long)&mmpol, MIGRATE_SYNC,
+                               MR_MEMPOLICY_MBIND, NULL);
+       }
+       if (nr_failed && (flags & MPOL_MF_STRICT))
+               err = -EIO;
+       if (!list_empty(&pagelist))
+               putback_movable_pages(&pagelist);
  mpol_out:
        mpol_put(new);
        if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
@@@ -1690,7 -1667,6 +1671,6 @@@ out
  out_put:
        put_task_struct(task);
        goto out;
  }
  
  SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
        return kernel_migrate_pages(pid, maxnode, old_nodes, new_nodes);
  }
  
  /* Retrieve NUMA policy */
  static int kernel_get_mempolicy(int __user *policy,
                                unsigned long __user *nmask,
@@@ -1767,34 -1742,19 +1746,19 @@@ bool vma_migratable(struct vm_area_stru
  }
  
  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
-                                               unsigned long addr)
+                                  unsigned long addr, pgoff_t *ilx)
  {
-       struct mempolicy *pol = NULL;
-       if (vma) {
-               if (vma->vm_ops && vma->vm_ops->get_policy) {
-                       pol = vma->vm_ops->get_policy(vma, addr);
-               } else if (vma->vm_policy) {
-                       pol = vma->vm_policy;
-                       /*
-                        * shmem_alloc_page() passes MPOL_F_SHARED policy with
-                        * a pseudo vma whose vma->vm_ops=NULL. Take a reference
-                        * count on these policies which will be dropped by
-                        * mpol_cond_put() later
-                        */
-                       if (mpol_needs_cond_ref(pol))
-                               mpol_get(pol);
-               }
-       }
-       return pol;
+       *ilx = 0;
+       return (vma->vm_ops && vma->vm_ops->get_policy) ?
+               vma->vm_ops->get_policy(vma, addr, ilx) : vma->vm_policy;
  }
  
  /*
-  * get_vma_policy(@vma, @addr)
+  * get_vma_policy(@vma, @addr, @order, @ilx)
   * @vma: virtual memory area whose policy is sought
   * @addr: address in @vma for shared policy lookup
+  * @order: 0, or appropriate huge_page_order for interleaving
+  * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE
   *
   * Returns effective policy for a VMA at specified address.
   * Falls back to current->mempolicy or system default policy, as necessary.
   * freeing by another task.  It is the caller's responsibility to free the
   * extra reference for shared policies.
   */
- static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
-                                               unsigned long addr)
+ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+                                unsigned long addr, int order, pgoff_t *ilx)
  {
-       struct mempolicy *pol = __get_vma_policy(vma, addr);
+       struct mempolicy *pol;
  
+       pol = __get_vma_policy(vma, addr, ilx);
        if (!pol)
                pol = get_task_policy(current);
+       if (pol->mode == MPOL_INTERLEAVE) {
+               *ilx += vma->vm_pgoff >> order;
+               *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
+       }
        return pol;
  }
  
@@@ -1820,8 -1784,9 +1788,9 @@@ bool vma_policy_mof(struct vm_area_stru
  
        if (vma->vm_ops && vma->vm_ops->get_policy) {
                bool ret = false;
+               pgoff_t ilx;            /* ignored here */
  
-               pol = vma->vm_ops->get_policy(vma, vma->vm_start);
+               pol = vma->vm_ops->get_policy(vma, vma->vm_start, &ilx);
                if (pol && (pol->flags & MPOL_F_MOF))
                        ret = true;
                mpol_cond_put(pol);
@@@ -1856,64 -1821,15 +1825,15 @@@ bool apply_policy_zone(struct mempolic
        return zone >= dynamic_policy_zone;
  }
  
- /*
-  * Return a nodemask representing a mempolicy for filtering nodes for
-  * page allocation
-  */
- nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
- {
-       int mode = policy->mode;
-       /* Lower zones don't get a nodemask applied for MPOL_BIND */
-       if (unlikely(mode == MPOL_BIND) &&
-               apply_policy_zone(policy, gfp_zone(gfp)) &&
-               cpuset_nodemask_valid_mems_allowed(&policy->nodes))
-               return &policy->nodes;
-       if (mode == MPOL_PREFERRED_MANY)
-               return &policy->nodes;
-       return NULL;
- }
- /*
-  * Return the  preferred node id for 'prefer' mempolicy, and return
-  * the given id for all other policies.
-  *
-  * policy_node() is always coupled with policy_nodemask(), which
-  * secures the nodemask limit for 'bind' and 'prefer-many' policy.
-  */
- static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
- {
-       if (policy->mode == MPOL_PREFERRED) {
-               nd = first_node(policy->nodes);
-       } else {
-               /*
-                * __GFP_THISNODE shouldn't even be used with the bind policy
-                * because we might easily break the expectation to stay on the
-                * requested node and not break the policy.
-                */
-               WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
-       }
-       if ((policy->mode == MPOL_BIND ||
-            policy->mode == MPOL_PREFERRED_MANY) &&
-           policy->home_node != NUMA_NO_NODE)
-               return policy->home_node;
-       return nd;
- }
  /* Do dynamic interleaving for a process */
- static unsigned interleave_nodes(struct mempolicy *policy)
+ static unsigned int interleave_nodes(struct mempolicy *policy)
  {
-       unsigned next;
-       struct task_struct *me = current;
+       unsigned int nid;
  
-       next = next_node_in(me->il_prev, policy->nodes);
-       if (next < MAX_NUMNODES)
-               me->il_prev = next;
-       return next;
+       nid = next_node_in(current->il_prev, policy->nodes);
+       if (nid < MAX_NUMNODES)
+               current->il_prev = nid;
+       return nid;
  }
  
  /*
@@@ -1964,11 -1880,11 +1884,11 @@@ unsigned int mempolicy_slab_node(void
  }
  
  /*
-  * Do static interleaving for a VMA with known offset @n.  Returns the n'th
-  * node in pol->nodes (starting from n=0), wrapping around if n exceeds the
-  * number of present nodes.
+  * Do static interleaving for interleave index @ilx.  Returns the ilx'th
+  * node in pol->nodes (starting from ilx=0), wrapping around if ilx
+  * exceeds the number of present nodes.
   */
- static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
+ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
  {
        nodemask_t nodemask = pol->nodes;
        unsigned int target, nnodes;
        nnodes = nodes_weight(nodemask);
        if (!nnodes)
                return numa_node_id();
-       target = (unsigned int)n % nnodes;
+       target = ilx % nnodes;
        nid = first_node(nodemask);
        for (i = 0; i < target; i++)
                nid = next_node(nid, nodemask);
        return nid;
  }
  
- /* Determine a node number for interleave */
- static inline unsigned interleave_nid(struct mempolicy *pol,
-                struct vm_area_struct *vma, unsigned long addr, int shift)
+ /*
+  * Return a nodemask representing a mempolicy for filtering nodes for
+  * page allocation, together with preferred node id (or the input node id).
+  */
+ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
+                                  pgoff_t ilx, int *nid)
  {
-       if (vma) {
-               unsigned long off;
+       nodemask_t *nodemask = NULL;
  
+       switch (pol->mode) {
+       case MPOL_PREFERRED:
+               /* Override input node id */
+               *nid = first_node(pol->nodes);
+               break;
+       case MPOL_PREFERRED_MANY:
+               nodemask = &pol->nodes;
+               if (pol->home_node != NUMA_NO_NODE)
+                       *nid = pol->home_node;
+               break;
+       case MPOL_BIND:
+               /* Restrict to nodemask (but not on lower zones) */
+               if (apply_policy_zone(pol, gfp_zone(gfp)) &&
+                   cpuset_nodemask_valid_mems_allowed(&pol->nodes))
+                       nodemask = &pol->nodes;
+               if (pol->home_node != NUMA_NO_NODE)
+                       *nid = pol->home_node;
                /*
-                * for small pages, there is no difference between
-                * shift and PAGE_SHIFT, so the bit-shift is safe.
-                * for huge pages, since vm_pgoff is in units of small
-                * pages, we need to shift off the always 0 bits to get
-                * a useful offset.
+                * __GFP_THISNODE shouldn't even be used with the bind policy
+                * because we might easily break the expectation to stay on the
+                * requested node and not break the policy.
                 */
-               BUG_ON(shift < PAGE_SHIFT);
-               off = vma->vm_pgoff >> (shift - PAGE_SHIFT);
-               off += (addr - vma->vm_start) >> shift;
-               return offset_il_node(pol, off);
-       } else
-               return interleave_nodes(pol);
+               WARN_ON_ONCE(gfp & __GFP_THISNODE);
+               break;
+       case MPOL_INTERLEAVE:
+               /* Override input node id */
+               *nid = (ilx == NO_INTERLEAVE_INDEX) ?
+                       interleave_nodes(pol) : interleave_nid(pol, ilx);
+               break;
+       }
+       return nodemask;
  }
  
  #ifdef CONFIG_HUGETLBFS
   * to the struct mempolicy for conditional unref after allocation.
   * If the effective policy is 'bind' or 'prefer-many', returns a pointer
   * to the mempolicy's @nodemask for filtering the zonelist.
-  *
-  * Must be protected by read_mems_allowed_begin()
   */
  int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
-                               struct mempolicy **mpol, nodemask_t **nodemask)
+               struct mempolicy **mpol, nodemask_t **nodemask)
  {
+       pgoff_t ilx;
        int nid;
-       int mode;
-       *mpol = get_vma_policy(vma, addr);
-       *nodemask = NULL;
-       mode = (*mpol)->mode;
  
-       if (unlikely(mode == MPOL_INTERLEAVE)) {
-               nid = interleave_nid(*mpol, vma, addr,
-                                       huge_page_shift(hstate_vma(vma)));
-       } else {
-               nid = policy_node(gfp_flags, *mpol, numa_node_id());
-               if (mode == MPOL_BIND || mode == MPOL_PREFERRED_MANY)
-                       *nodemask = &(*mpol)->nodes;
-       }
+       nid = numa_node_id();
+       *mpol = get_vma_policy(vma, addr, hstate_vma(vma)->order, &ilx);
+       *nodemask = policy_nodemask(gfp_flags, *mpol, ilx, &nid);
        return nid;
  }
  
@@@ -2126,27 -2052,8 +2056,8 @@@ bool mempolicy_in_oom_domain(struct tas
        return ret;
  }
  
- /* Allocate a page in interleaved policy.
-    Own path because it needs to do special accounting. */
- static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
-                                       unsigned nid)
- {
-       struct page *page;
-       page = __alloc_pages(gfp, order, nid, NULL);
-       /* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
-       if (!static_branch_likely(&vm_numa_stat_key))
-               return page;
-       if (page && page_to_nid(page) == nid) {
-               preempt_disable();
-               __count_numa_event(page_zone(page), NUMA_INTERLEAVE_HIT);
-               preempt_enable();
-       }
-       return page;
- }
  static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
-                                               int nid, struct mempolicy *pol)
+                                               int nid, nodemask_t *nodemask)
  {
        struct page *page;
        gfp_t preferred_gfp;
         */
        preferred_gfp = gfp | __GFP_NOWARN;
        preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
-       page = __alloc_pages(preferred_gfp, order, nid, &pol->nodes);
+       page = __alloc_pages(preferred_gfp, order, nid, nodemask);
        if (!page)
                page = __alloc_pages(gfp, order, nid, NULL);
  
  }
  
  /**
-  * vma_alloc_folio - Allocate a folio for a VMA.
+  * alloc_pages_mpol - Allocate pages according to NUMA mempolicy.
   * @gfp: GFP flags.
-  * @order: Order of the folio.
-  * @vma: Pointer to VMA or NULL if not available.
-  * @addr: Virtual address of the allocation.  Must be inside @vma.
-  * @hugepage: For hugepages try only the preferred node if possible.
+  * @order: Order of the page allocation.
+  * @pol: Pointer to the NUMA mempolicy.
+  * @ilx: Index for interleave mempolicy (also distinguishes alloc_pages()).
+  * @nid: Preferred node (usually numa_node_id() but @mpol may override it).
   *
-  * Allocate a folio for a specific address in @vma, using the appropriate
-  * NUMA policy.  When @vma is not NULL the caller must hold the mmap_lock
-  * of the mm_struct of the VMA to prevent it from going away.  Should be
-  * used for all allocations for folios that will be mapped into user space.
-  *
-  * Return: The folio on success or NULL if allocation fails.
+  * Return: The page on success or NULL if allocation fails.
   */
- struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
-               unsigned long addr, bool hugepage)
+ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
+               struct mempolicy *pol, pgoff_t ilx, int nid)
  {
-       struct mempolicy *pol;
-       int node = numa_node_id();
-       struct folio *folio;
-       int preferred_nid;
-       nodemask_t *nmask;
-       pol = get_vma_policy(vma, addr);
-       if (pol->mode == MPOL_INTERLEAVE) {
-               struct page *page;
-               unsigned nid;
-               nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
-               mpol_cond_put(pol);
-               gfp |= __GFP_COMP;
-               page = alloc_page_interleave(gfp, order, nid);
-               folio = (struct folio *)page;
-               if (folio && order > 1)
-                       folio_prep_large_rmappable(folio);
-               goto out;
-       }
-       if (pol->mode == MPOL_PREFERRED_MANY) {
-               struct page *page;
+       nodemask_t *nodemask;
+       struct page *page;
  
-               node = policy_node(gfp, pol, node);
-               gfp |= __GFP_COMP;
-               page = alloc_pages_preferred_many(gfp, order, node, pol);
-               mpol_cond_put(pol);
-               folio = (struct folio *)page;
-               if (folio && order > 1)
-                       folio_prep_large_rmappable(folio);
-               goto out;
-       }
+       nodemask = policy_nodemask(gfp, pol, ilx, &nid);
  
-       if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
-               int hpage_node = node;
+       if (pol->mode == MPOL_PREFERRED_MANY)
+               return alloc_pages_preferred_many(gfp, order, nid, nodemask);
  
+       if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+           /* filter "hugepage" allocation, unless from alloc_pages() */
+           order == HPAGE_PMD_ORDER && ilx != NO_INTERLEAVE_INDEX) {
                /*
                 * For hugepage allocation and non-interleave policy which
                 * allows the current node (or other explicitly preferred
                 * If the policy is interleave or does not allow the current
                 * node in its nodemask, we allocate the standard way.
                 */
-               if (pol->mode == MPOL_PREFERRED)
-                       hpage_node = first_node(pol->nodes);
-               nmask = policy_nodemask(gfp, pol);
-               if (!nmask || node_isset(hpage_node, *nmask)) {
-                       mpol_cond_put(pol);
+               if (pol->mode != MPOL_INTERLEAVE &&
+                   (!nodemask || node_isset(nid, *nodemask))) {
                        /*
                         * First, try to allocate THP only on local node, but
                         * don't reclaim unnecessarily, just compact.
                         */
-                       folio = __folio_alloc_node(gfp | __GFP_THISNODE |
-                                       __GFP_NORETRY, order, hpage_node);
+                       page = __alloc_pages_node(nid,
+                               gfp | __GFP_THISNODE | __GFP_NORETRY, order);
+                       if (page || !(gfp & __GFP_DIRECT_RECLAIM))
+                               return page;
                        /*
                         * If hugepage allocations are configured to always
                         * synchronous compact or the vma has been madvised
                         * to prefer hugepage backing, retry allowing remote
                         * memory with both reclaim and compact as well.
                         */
-                       if (!folio && (gfp & __GFP_DIRECT_RECLAIM))
-                               folio = __folio_alloc(gfp, order, hpage_node,
-                                                     nmask);
+               }
+       }
  
-                       goto out;
+       page = __alloc_pages(gfp, order, nid, nodemask);
+       if (unlikely(pol->mode == MPOL_INTERLEAVE) && page) {
+               /* skip NUMA_INTERLEAVE_HIT update if numa stats is disabled */
+               if (static_branch_likely(&vm_numa_stat_key) &&
+                   page_to_nid(page) == nid) {
+                       preempt_disable();
+                       __count_numa_event(page_zone(page), NUMA_INTERLEAVE_HIT);
+                       preempt_enable();
                }
        }
  
-       nmask = policy_nodemask(gfp, pol);
-       preferred_nid = policy_node(gfp, pol, node);
-       folio = __folio_alloc(gfp, order, preferred_nid, nmask);
+       return page;
+ }
+ /**
+  * vma_alloc_folio - Allocate a folio for a VMA.
+  * @gfp: GFP flags.
+  * @order: Order of the folio.
+  * @vma: Pointer to VMA.
+  * @addr: Virtual address of the allocation.  Must be inside @vma.
+  * @hugepage: Unused (was: For hugepages try only preferred node if possible).
+  *
+  * Allocate a folio for a specific address in @vma, using the appropriate
+  * NUMA policy.  The caller must hold the mmap_lock of the mm_struct of the
+  * VMA to prevent it from going away.  Should be used for all allocations
+  * for folios that will be mapped into user space, excepting hugetlbfs, and
+  * excepting where direct use of alloc_pages_mpol() is more appropriate.
+  *
+  * Return: The folio on success or NULL if allocation fails.
+  */
+ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
+               unsigned long addr, bool hugepage)
+ {
+       struct mempolicy *pol;
+       pgoff_t ilx;
+       struct page *page;
+       pol = get_vma_policy(vma, addr, order, &ilx);
+       page = alloc_pages_mpol(gfp | __GFP_COMP, order,
+                               pol, ilx, numa_node_id());
        mpol_cond_put(pol);
- out:
-       return folio;
+       return page_rmappable_folio(page);
  }
  EXPORT_SYMBOL(vma_alloc_folio);
  
   * flags are used.
   * Return: The page on success or NULL if allocation fails.
   */
- struct page *alloc_pages(gfp_t gfp, unsigned order)
+ struct page *alloc_pages(gfp_t gfp, unsigned int order)
  {
        struct mempolicy *pol = &default_policy;
-       struct page *page;
-       if (!in_interrupt() && !(gfp & __GFP_THISNODE))
-               pol = get_task_policy(current);
  
        /*
         * No reference counting needed for current->mempolicy
         * nor system default_policy
         */
-       if (pol->mode == MPOL_INTERLEAVE)
-               page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
-       else if (pol->mode == MPOL_PREFERRED_MANY)
-               page = alloc_pages_preferred_many(gfp, order,
-                                 policy_node(gfp, pol, numa_node_id()), pol);
-       else
-               page = __alloc_pages(gfp, order,
-                               policy_node(gfp, pol, numa_node_id()),
-                               policy_nodemask(gfp, pol));
+       if (!in_interrupt() && !(gfp & __GFP_THISNODE))
+               pol = get_task_policy(current);
  
-       return page;
+       return alloc_pages_mpol(gfp, order,
+                               pol, NO_INTERLEAVE_INDEX, numa_node_id());
  }
  EXPORT_SYMBOL(alloc_pages);
  
- struct folio *folio_alloc(gfp_t gfp, unsigned order)
+ struct folio *folio_alloc(gfp_t gfp, unsigned int order)
  {
-       struct page *page = alloc_pages(gfp | __GFP_COMP, order);
-       struct folio *folio = (struct folio *)page;
-       if (folio && order > 1)
-               folio_prep_large_rmappable(folio);
-       return folio;
+       return page_rmappable_folio(alloc_pages(gfp | __GFP_COMP, order));
  }
  EXPORT_SYMBOL(folio_alloc);
  
@@@ -2384,6 -2273,8 +2277,8 @@@ unsigned long alloc_pages_bulk_array_me
                unsigned long nr_pages, struct page **page_array)
  {
        struct mempolicy *pol = &default_policy;
+       nodemask_t *nodemask;
+       int nid;
  
        if (!in_interrupt() && !(gfp & __GFP_THISNODE))
                pol = get_task_policy(current);
                return alloc_pages_bulk_array_preferred_many(gfp,
                                numa_node_id(), pol, nr_pages, page_array);
  
-       return __alloc_pages_bulk(gfp, policy_node(gfp, pol, numa_node_id()),
-                                 policy_nodemask(gfp, pol), nr_pages, NULL,
-                                 page_array);
+       nid = numa_node_id();
+       nodemask = policy_nodemask(gfp, pol, NO_INTERLEAVE_INDEX, &nid);
+       return __alloc_pages_bulk(gfp, nid, nodemask,
+                                 nr_pages, NULL, page_array);
  }
  
  int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
  {
-       struct mempolicy *pol = mpol_dup(vma_policy(src));
+       struct mempolicy *pol = mpol_dup(src->vm_policy);
  
        if (IS_ERR(pol))
                return PTR_ERR(pol);
@@@ -2488,8 -2380,8 +2384,8 @@@ bool __mpol_equal(struct mempolicy *a, 
   * lookup first element intersecting start-end.  Caller holds sp->lock for
   * reading or for writing
   */
- static struct sp_node *
sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end)
+ static struct sp_node *sp_lookup(struct shared_policy *sp,
                                      pgoff_t start, pgoff_t end)
  {
        struct rb_node *n = sp->root.rb_node;
  
@@@ -2540,13 -2432,11 +2436,11 @@@ static void sp_insert(struct shared_pol
        }
        rb_link_node(&new->nd, parent, p);
        rb_insert_color(&new->nd, &sp->root);
-       pr_debug("inserting %lx-%lx: %d\n", new->start, new->end,
-                new->policy ? new->policy->mode : 0);
  }
  
  /* Find shared policy intersecting idx */
- struct mempolicy *
mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
+ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
                                              pgoff_t idx)
  {
        struct mempolicy *pol = NULL;
        struct sp_node *sn;
@@@ -2570,39 -2460,38 +2464,38 @@@ static void sp_free(struct sp_node *n
  }
  
  /**
-  * mpol_misplaced - check whether current page node is valid in policy
+  * mpol_misplaced - check whether current folio node is valid in policy
   *
-  * @page: page to be checked
-  * @vma: vm area where page mapped
-  * @addr: virtual address where page mapped
+  * @folio: folio to be checked
+  * @vma: vm area where folio mapped
+  * @addr: virtual address in @vma for shared policy lookup and interleave policy
   *
-  * Lookup current policy node id for vma,addr and "compare to" page's
+  * Lookup current policy node id for vma,addr and "compare to" folio's
   * node id.  Policy determination "mimics" alloc_page_vma().
   * Called from fault path where we know the vma and faulting address.
   *
   * Return: NUMA_NO_NODE if the page is in a node that is valid for this
-  * policy, or a suitable node ID to allocate a replacement page from.
+  * policy, or a suitable node ID to allocate a replacement folio from.
   */
- int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
+                  unsigned long addr)
  {
        struct mempolicy *pol;
+       pgoff_t ilx;
        struct zoneref *z;
-       int curnid = page_to_nid(page);
-       unsigned long pgoff;
+       int curnid = folio_nid(folio);
        int thiscpu = raw_smp_processor_id();
        int thisnid = cpu_to_node(thiscpu);
        int polnid = NUMA_NO_NODE;
        int ret = NUMA_NO_NODE;
  
-       pol = get_vma_policy(vma, addr);
+       pol = get_vma_policy(vma, addr, folio_order(folio), &ilx);
        if (!(pol->flags & MPOL_F_MOF))
                goto out;
  
        switch (pol->mode) {
        case MPOL_INTERLEAVE:
-               pgoff = vma->vm_pgoff;
-               pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
-               polnid = offset_il_node(pol, pgoff);
+               polnid = interleave_nid(pol, ilx);
                break;
  
        case MPOL_PREFERRED:
                BUG();
        }
  
-       /* Migrate the page towards the node whose CPU is referencing it */
+       /* Migrate the folio towards the node whose CPU is referencing it */
        if (pol->flags & MPOL_F_MORON) {
                polnid = thisnid;
  
-               if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
+               if (!should_numa_migrate_memory(current, folio, curnid,
+                                               thiscpu))
                        goto out;
        }
  
@@@ -2678,7 -2568,6 +2572,6 @@@ void mpol_put_task_policy(struct task_s
  
  static void sp_delete(struct shared_policy *sp, struct sp_node *n)
  {
-       pr_debug("deleting %lx-l%lx\n", n->start, n->end);
        rb_erase(&n->nd, &sp->root);
        sp_free(n);
  }
@@@ -2713,8 -2602,8 +2606,8 @@@ static struct sp_node *sp_alloc(unsigne
  }
  
  /* Replace a policy range. */
- static int shared_policy_replace(struct shared_policy *sp, unsigned long start,
-                                unsigned long end, struct sp_node *new)
+ static int shared_policy_replace(struct shared_policy *sp, pgoff_t start,
+                                pgoff_t end, struct sp_node *new)
  {
        struct sp_node *n;
        struct sp_node *n_new = NULL;
@@@ -2797,30 -2686,30 +2690,30 @@@ void mpol_shared_policy_init(struct sha
        rwlock_init(&sp->lock);
  
        if (mpol) {
-               struct vm_area_struct pvma;
-               struct mempolicy *new;
+               struct sp_node *sn;
+               struct mempolicy *npol;
                NODEMASK_SCRATCH(scratch);
  
                if (!scratch)
                        goto put_mpol;
-               /* contextualize the tmpfs mount point mempolicy */
-               new = mpol_new(mpol->mode, mpol->flags, &mpol->w.user_nodemask);
-               if (IS_ERR(new))
+               /* contextualize the tmpfs mount point mempolicy to this file */
+               npol = mpol_new(mpol->mode, mpol->flags, &mpol->w.user_nodemask);
+               if (IS_ERR(npol))
                        goto free_scratch; /* no valid nodemask intersection */
  
                task_lock(current);
-               ret = mpol_set_nodemask(new, &mpol->w.user_nodemask, scratch);
+               ret = mpol_set_nodemask(npol, &mpol->w.user_nodemask, scratch);
                task_unlock(current);
                if (ret)
-                       goto put_new;
-               /* Create pseudo-vma that contains just the policy */
-               vma_init(&pvma, NULL);
-               pvma.vm_end = TASK_SIZE;        /* policy covers entire file */
-               mpol_set_shared_policy(sp, &pvma, new); /* adds ref */
- put_new:
-               mpol_put(new);                  /* drop initial ref */
+                       goto put_npol;
+               /* alloc node covering entire file; adds ref to file's npol */
+               sn = sp_alloc(0, MAX_LFS_FILESIZE >> PAGE_SHIFT, npol);
+               if (sn)
+                       sp_insert(sp, sn);
+ put_npol:
+               mpol_put(npol); /* drop initial ref on file's npol */
  free_scratch:
                NODEMASK_SCRATCH_FREE(scratch);
  put_mpol:
        }
  }
  
- int mpol_set_shared_policy(struct shared_policy *info,
-                       struct vm_area_struct *vma, struct mempolicy *npol)
+ int mpol_set_shared_policy(struct shared_policy *sp,
+                       struct vm_area_struct *vma, struct mempolicy *pol)
  {
        int err;
        struct sp_node *new = NULL;
        unsigned long sz = vma_pages(vma);
  
-       pr_debug("set_shared_policy %lx sz %lu %d %d %lx\n",
-                vma->vm_pgoff,
-                sz, npol ? npol->mode : -1,
-                npol ? npol->flags : -1,
-                npol ? nodes_addr(npol->nodes)[0] : NUMA_NO_NODE);
-       if (npol) {
-               new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, npol);
+       if (pol) {
+               new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, pol);
                if (!new)
                        return -ENOMEM;
        }
-       err = shared_policy_replace(info, vma->vm_pgoff, vma->vm_pgoff+sz, new);
+       err = shared_policy_replace(sp, vma->vm_pgoff, vma->vm_pgoff + sz, new);
        if (err && new)
                sp_free(new);
        return err;
  }
  
  /* Free a backing policy store on inode delete. */
- void mpol_free_shared_policy(struct shared_policy *p)
+ void mpol_free_shared_policy(struct shared_policy *sp)
  {
        struct sp_node *n;
        struct rb_node *next;
  
-       if (!p->root.rb_node)
+       if (!sp->root.rb_node)
                return;
-       write_lock(&p->lock);
-       next = rb_first(&p->root);
+       write_lock(&sp->lock);
+       next = rb_first(&sp->root);
        while (next) {
                n = rb_entry(next, struct sp_node, nd);
                next = rb_next(&n->nd);
-               sp_delete(p, n);
+               sp_delete(sp, n);
        }
-       write_unlock(&p->lock);
+       write_unlock(&sp->lock);
  }
  
  #ifdef CONFIG_NUMA_BALANCING
@@@ -2917,7 -2800,6 +2804,6 @@@ static inline void __init check_numabal
  }
  #endif /* CONFIG_NUMA_BALANCING */
  
- /* assumes fs == KERNEL_DS */
  void __init numa_policy_init(void)
  {
        nodemask_t interleave_nodes;
@@@ -2980,7 -2862,6 +2866,6 @@@ void numa_default_policy(void
  /*
   * Parse and format mempolicy from/to strings
   */
  static const char * const policy_modes[] =
  {
        [MPOL_DEFAULT]    = "default",
        [MPOL_PREFERRED_MANY]  = "prefer (many)",
  };
  
  #ifdef CONFIG_TMPFS
  /**
   * mpol_parse_str - parse string to mempolicy, for tmpfs mpol mount option.
diff --combined mm/mmap.c
index da2e3bd6dba146bc47641a0875c590d3b5cf3e24,984804d77ae1bce93c183405c604dfff7e043f58..1971bfffcc032f3666e28c1855718c31e5de70f3
+++ b/mm/mmap.c
@@@ -107,7 -107,7 +107,7 @@@ void vma_set_page_prot(struct vm_area_s
  static void __remove_shared_vm_struct(struct vm_area_struct *vma,
                struct file *file, struct address_space *mapping)
  {
-       if (vma->vm_flags & VM_SHARED)
+       if (vma_is_shared_maywrite(vma))
                mapping_unmap_writable(mapping);
  
        flush_dcache_mmap_lock(mapping);
@@@ -384,7 -384,7 +384,7 @@@ static unsigned long count_vma_pages_ra
  static void __vma_link_file(struct vm_area_struct *vma,
                            struct address_space *mapping)
  {
-       if (vma->vm_flags & VM_SHARED)
+       if (vma_is_shared_maywrite(vma))
                mapping_allow_writable(mapping);
  
        flush_dcache_mmap_lock(mapping);
@@@ -860,13 -860,13 +860,13 @@@ can_vma_merge_after(struct vm_area_stru
   * **** is not represented - it will be merged and the vma containing the
   *      area is returned, or the function will return NULL
   */
- struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
                      struct vm_area_struct *prev, unsigned long addr,
-                       unsigned long end, unsigned long vm_flags,
-                       struct anon_vma *anon_vma, struct file *file,
-                       pgoff_t pgoff, struct mempolicy *policy,
-                       struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
-                       struct anon_vma_name *anon_name)
+ static struct vm_area_struct
*vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
+          struct vm_area_struct *prev, unsigned long addr, unsigned long end,
+          unsigned long vm_flags, struct anon_vma *anon_vma, struct file *file,
+          pgoff_t pgoff, struct mempolicy *policy,
+          struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+          struct anon_vma_name *anon_name)
  {
        struct vm_area_struct *curr, *next, *res;
        struct vm_area_struct *vma, *adjust, *remove, *remove2;
                        vma_start_write(curr);
                        remove = curr;
                        remove2 = next;
+                       /*
+                        * Note that the dup_anon_vma below cannot overwrite err
+                        * since the first caller would do nothing unless next
+                        * has an anon_vma.
+                        */
                        if (!next->anon_vma)
                                err = dup_anon_vma(prev, curr, &anon_dup);
                }
@@@ -1218,7 -1223,7 +1223,7 @@@ unsigned long do_mmap(struct file *file
         * Does the application expect PROT_READ to imply PROT_EXEC?
         *
         * (the exception is when the underlying filesystem is noexec
-        *  mounted, in which case we dont add PROT_EXEC.)
+        *  mounted, in which case we don't add PROT_EXEC.)
         */
        if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
                if (!(file && path_noexec(&file->f_path)))
@@@ -1944,9 -1949,9 +1949,9 @@@ static int acct_stack_growth(struct vm_
        return 0;
  }
  
 -#if defined(CONFIG_STACK_GROWSUP) || defined(CONFIG_IA64)
 +#if defined(CONFIG_STACK_GROWSUP)
  /*
 - * PA-RISC uses this for its stack; IA64 for its Register Backing Store.
 + * PA-RISC uses this for its stack.
   * vma is the last one with address > vma->vm_end.  Have to extend vma.
   */
  static int expand_upwards(struct vm_area_struct *vma, unsigned long address)
        validate_mm(mm);
        return error;
  }
 -#endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
 +#endif /* CONFIG_STACK_GROWSUP */
  
  /*
   * vma is the first one with address < vma->vm_start.  Have to extend vma.
@@@ -2179,8 -2184,6 +2184,6 @@@ struct vm_area_struct *find_extend_vma_
  #else
  int expand_stack_locked(struct vm_area_struct *vma, unsigned long address)
  {
-       if (unlikely(!(vma->vm_flags & VM_GROWSDOWN)))
-               return -EINVAL;
        return expand_downwards(vma, address);
  }
  
@@@ -2343,8 -2346,8 +2346,8 @@@ static void unmap_region(struct mm_stru
   * has already been checked or doesn't make sense to fail.
   * VMA Iterator will point to the end VMA.
   */
- int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
-               unsigned long addr, int new_below)
static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
+                      unsigned long addr, int new_below)
  {
        struct vma_prepare vp;
        struct vm_area_struct *new;
@@@ -2425,8 -2428,8 +2428,8 @@@ out_free_vma
   * Split a vma into two pieces at address 'addr', a new vma is allocated
   * either for the first part or the tail.
   */
- int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
-             unsigned long addr, int new_below)
static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
+                    unsigned long addr, int new_below)
  {
        if (vma->vm_mm->map_count >= sysctl_max_map_count)
                return -ENOMEM;
        return __split_vma(vmi, vma, addr, new_below);
  }
  
+ /*
+  * We are about to modify one or multiple of a VMA's flags, policy, userfaultfd
+  * context and anonymous VMA name within the range [start, end).
+  *
+  * As a result, we might be able to merge the newly modified VMA range with an
+  * adjacent VMA with identical properties.
+  *
+  * If no merge is possible and the range does not span the entirety of the VMA,
+  * we then need to split the VMA to accommodate the change.
+  *
+  * The function returns either the merged VMA, the original VMA if a split was
+  * required instead, or an error if the split failed.
+  */
+ struct vm_area_struct *vma_modify(struct vma_iterator *vmi,
+                                 struct vm_area_struct *prev,
+                                 struct vm_area_struct *vma,
+                                 unsigned long start, unsigned long end,
+                                 unsigned long vm_flags,
+                                 struct mempolicy *policy,
+                                 struct vm_userfaultfd_ctx uffd_ctx,
+                                 struct anon_vma_name *anon_name)
+ {
+       pgoff_t pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
+       struct vm_area_struct *merged;
+       merged = vma_merge(vmi, vma->vm_mm, prev, start, end, vm_flags,
+                          vma->anon_vma, vma->vm_file, pgoff, policy,
+                          uffd_ctx, anon_name);
+       if (merged)
+               return merged;
+       if (vma->vm_start < start) {
+               int err = split_vma(vmi, vma, start, 1);
+               if (err)
+                       return ERR_PTR(err);
+       }
+       if (vma->vm_end > end) {
+               int err = split_vma(vmi, vma, end, 0);
+               if (err)
+                       return ERR_PTR(err);
+       }
+       return vma;
+ }
+ /*
+  * Attempt to merge a newly mapped VMA with those adjacent to it. The caller
+  * must ensure that [start, end) does not overlap any existing VMA.
+  */
+ static struct vm_area_struct
+ *vma_merge_new_vma(struct vma_iterator *vmi, struct vm_area_struct *prev,
+                  struct vm_area_struct *vma, unsigned long start,
+                  unsigned long end, pgoff_t pgoff)
+ {
+       return vma_merge(vmi, vma->vm_mm, prev, start, end, vma->vm_flags,
+                        vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+                        vma->vm_userfaultfd_ctx, anon_vma_name(vma));
+ }
+ /*
+  * Expand vma by delta bytes, potentially merging with an immediately adjacent
+  * VMA with identical properties.
+  */
+ struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi,
+                                       struct vm_area_struct *vma,
+                                       unsigned long delta)
+ {
+       pgoff_t pgoff = vma->vm_pgoff + vma_pages(vma);
+       /* vma is specified as prev, so case 1 or 2 will apply. */
+       return vma_merge(vmi, vma->vm_mm, vma, vma->vm_end, vma->vm_end + delta,
+                        vma->vm_flags, vma->anon_vma, vma->vm_file, pgoff,
+                        vma_policy(vma), vma->vm_userfaultfd_ctx,
+                        anon_vma_name(vma));
+ }
  /*
   * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
   * @vmi: The vma iterator
@@@ -2670,6 -2752,7 +2752,7 @@@ unsigned long mmap_region(struct file *
        unsigned long charged = 0;
        unsigned long end = addr + len;
        unsigned long merge_start = addr, merge_end = end;
+       bool writable_file_mapping = false;
        pgoff_t vm_pgoff;
        int error;
        VMA_ITERATOR(vmi, mm, addr);
@@@ -2764,17 -2847,19 +2847,19 @@@ cannot_expand
        vma->vm_pgoff = pgoff;
  
        if (file) {
-               if (vm_flags & VM_SHARED) {
-                       error = mapping_map_writable(file->f_mapping);
-                       if (error)
-                               goto free_vma;
-               }
                vma->vm_file = get_file(file);
                error = call_mmap(file, vma);
                if (error)
                        goto unmap_and_free_vma;
  
+               if (vma_is_shared_maywrite(vma)) {
+                       error = mapping_map_writable(file->f_mapping);
+                       if (error)
+                               goto close_and_free_vma;
+                       writable_file_mapping = true;
+               }
                /*
                 * Expansion is handled above, merging is handled below.
                 * Drivers should not alter the address of the VMA.
                 * vma again as we may succeed this time.
                 */
                if (unlikely(vm_flags != vma->vm_flags && prev)) {
-                       merge = vma_merge(&vmi, mm, prev, vma->vm_start,
-                                   vma->vm_end, vma->vm_flags, NULL,
-                                   vma->vm_file, vma->vm_pgoff, NULL,
-                                   NULL_VM_UFFD_CTX, NULL);
+                       merge = vma_merge_new_vma(&vmi, prev, vma,
+                                                 vma->vm_start, vma->vm_end,
+                                                 vma->vm_pgoff);
                        if (merge) {
                                /*
                                 * ->mmap() can change vma->vm_file and fput
        mm->map_count++;
        if (vma->vm_file) {
                i_mmap_lock_write(vma->vm_file->f_mapping);
-               if (vma->vm_flags & VM_SHARED)
+               if (vma_is_shared_maywrite(vma))
                        mapping_allow_writable(vma->vm_file->f_mapping);
  
                flush_dcache_mmap_lock(vma->vm_file->f_mapping);
  
        /* Once vma denies write, undo our temporary denial count */
  unmap_writable:
-       if (file && vm_flags & VM_SHARED)
+       if (writable_file_mapping)
                mapping_unmap_writable(file->f_mapping);
        file = vma->vm_file;
        ksm_add_vma(vma);
@@@ -2904,7 -2988,7 +2988,7 @@@ unmap_and_free_vma
                unmap_region(mm, &vmi.mas, vma, prev, next, vma->vm_start,
                             vma->vm_end, vma->vm_end, true);
        }
-       if (file && (vm_flags & VM_SHARED))
+       if (writable_file_mapping)
                mapping_unmap_writable(file->f_mapping);
  free_vma:
        vm_area_free(vma);
@@@ -3194,6 -3278,12 +3278,6 @@@ limits_failed
  }
  EXPORT_SYMBOL(vm_brk_flags);
  
 -int vm_brk(unsigned long addr, unsigned long len)
 -{
 -      return vm_brk_flags(addr, len, 0);
 -}
 -EXPORT_SYMBOL(vm_brk);
 -
  /* Release all mmaps. */
  void exit_mmap(struct mm_struct *mm)
  {
@@@ -3292,7 -3382,8 +3376,8 @@@ int insert_vm_struct(struct mm_struct *
        }
  
        if (vma_link(mm, vma)) {
-               vm_unacct_memory(charged);
+               if (vma->vm_flags & VM_ACCOUNT)
+                       vm_unacct_memory(charged);
                return -ENOMEM;
        }
  
@@@ -3327,9 -3418,7 +3412,7 @@@ struct vm_area_struct *copy_vma(struct 
        if (new_vma && new_vma->vm_start < addr + len)
                return NULL;    /* should never get here */
  
-       new_vma = vma_merge(&vmi, mm, prev, addr, addr + len, vma->vm_flags,
-                           vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
-                           vma->vm_userfaultfd_ctx, anon_vma_name(vma));
+       new_vma = vma_merge_new_vma(&vmi, prev, vma, addr, addr + len, pgoff);
        if (new_vma) {
                /*
                 * Source vma may have been merged into new_vma
diff --combined mm/nommu.c
index 23c43c208f2b3257fc260f4e9afa6c4780dc2295,fc4afe924ad5ff58f50a540f91a1a79d999c2d53..b6dc558d31440831e51ff12ec68e2f2f69df1633
@@@ -1305,8 -1305,8 +1305,8 @@@ SYSCALL_DEFINE1(old_mmap, struct mmap_a
   * split a vma into two pieces at address 'addr', a new vma is allocated either
   * for the first part or the tail.
   */
- int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
-             unsigned long addr, int new_below)
static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
+                    unsigned long addr, int new_below)
  {
        struct vm_area_struct *new;
        struct vm_region *region;
@@@ -1531,6 -1531,11 +1531,6 @@@ void exit_mmap(struct mm_struct *mm
        mmap_write_unlock(mm);
  }
  
 -int vm_brk(unsigned long addr, unsigned long len)
 -{
 -      return -ENOMEM;
 -}
 -
  /*
   * expand (or shrink) an existing mapping, potentially moving it at the same
   * time (controlled by the MREMAP_MAYMOVE flag and available VM space)
@@@ -1646,8 -1651,8 +1646,8 @@@ vm_fault_t filemap_map_pages(struct vm_
  }
  EXPORT_SYMBOL(filemap_map_pages);
  
int __access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf,
-                      int len, unsigned int gup_flags)
static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
+                             void *buf, int len, unsigned int gup_flags)
  {
        struct vm_area_struct *vma;
        int write = gup_flags & FOLL_WRITE;
diff --combined mm/percpu.c
index 60ed078e4cd06ae59e90f4a29f04dc4008d51234,f53ba692d67a6deb8a3f2bed750fef71e4af2e8b..7b97d31df76766f830c1da97427d54d81cea319b
@@@ -1628,14 -1628,12 +1628,12 @@@ static bool pcpu_memcg_pre_alloc_hook(s
        if (!memcg_kmem_online() || !(gfp & __GFP_ACCOUNT))
                return true;
  
-       objcg = get_obj_cgroup_from_current();
+       objcg = current_obj_cgroup();
        if (!objcg)
                return true;
  
-       if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size))) {
-               obj_cgroup_put(objcg);
+       if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))
                return false;
-       }
  
        *objcgp = objcg;
        return true;
@@@ -1649,6 -1647,7 +1647,7 @@@ static void pcpu_memcg_post_alloc_hook(
                return;
  
        if (likely(chunk && chunk->obj_cgroups)) {
+               obj_cgroup_get(objcg);
                chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
  
                rcu_read_lock();
                rcu_read_unlock();
        } else {
                obj_cgroup_uncharge(objcg, pcpu_obj_full_size(size));
-               obj_cgroup_put(objcg);
        }
  }
  
@@@ -2244,37 -2242,6 +2242,37 @@@ static void pcpu_balance_workfn(struct 
        mutex_unlock(&pcpu_alloc_mutex);
  }
  
 +/**
 + * pcpu_alloc_size - the size of the dynamic percpu area
 + * @ptr: pointer to the dynamic percpu area
 + *
 + * Returns the size of the @ptr allocation.  This is undefined for statically
 + * defined percpu variables as there is no corresponding chunk->bound_map.
 + *
 + * RETURNS:
 + * The size of the dynamic percpu area.
 + *
 + * CONTEXT:
 + * Can be called from atomic context.
 + */
 +size_t pcpu_alloc_size(void __percpu *ptr)
 +{
 +      struct pcpu_chunk *chunk;
 +      unsigned long bit_off, end;
 +      void *addr;
 +
 +      if (!ptr)
 +              return 0;
 +
 +      addr = __pcpu_ptr_to_addr(ptr);
 +      /* No pcpu_lock here: ptr has not been freed, so chunk is still alive */
 +      chunk = pcpu_chunk_addr_search(addr);
 +      bit_off = (addr - chunk->base_addr) / PCPU_MIN_ALLOC_SIZE;
 +      end = find_next_bit(chunk->bound_map, pcpu_chunk_map_bits(chunk),
 +                          bit_off + 1);
 +      return (end - bit_off) * PCPU_MIN_ALLOC_SIZE;
 +}
 +
  /**
   * free_percpu - free percpu area
   * @ptr: pointer to area to free
@@@ -2298,10 -2265,12 +2296,10 @@@ void free_percpu(void __percpu *ptr
        kmemleak_free_percpu(ptr);
  
        addr = __pcpu_ptr_to_addr(ptr);
 -
 -      spin_lock_irqsave(&pcpu_lock, flags);
 -
        chunk = pcpu_chunk_addr_search(addr);
        off = addr - chunk->base_addr;
  
 +      spin_lock_irqsave(&pcpu_lock, flags);
        size = pcpu_free_area(chunk, off);
  
        pcpu_memcg_free_hook(chunk, off, size);
diff --combined mm/shmem.c
index 6b102965d355f6865693c7ea91a5f6386970388b,a314a25aea8cceea80b920eebd1d89734941f333..71b8d957b63bec8384feb8c369289afcd95d65b4
@@@ -146,9 -146,8 +146,8 @@@ static unsigned long shmem_default_max_
  #endif
  
  static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
-                            struct folio **foliop, enum sgp_type sgp,
-                            gfp_t gfp, struct vm_area_struct *vma,
-                            vm_fault_t *fault_type);
+                       struct folio **foliop, enum sgp_type sgp, gfp_t gfp,
+                       struct mm_struct *fault_mm, vm_fault_t *fault_type);
  
  static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
  {
@@@ -189,10 -188,10 +188,10 @@@ static inline int shmem_reacct_size(uns
  /*
   * ... whereas tmpfs objects are accounted incrementally as
   * pages are allocated, in order to allow large sparse files.
-  * shmem_get_folio reports shmem_acct_block failure as -ENOSPC not -ENOMEM,
+  * shmem_get_folio reports shmem_acct_blocks failure as -ENOSPC not -ENOMEM,
   * so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM.
   */
- static inline int shmem_acct_block(unsigned long flags, long pages)
+ static inline int shmem_acct_blocks(unsigned long flags, long pages)
  {
        if (!(flags & VM_NORESERVE))
                return 0;
@@@ -207,26 -206,26 +206,26 @@@ static inline void shmem_unacct_blocks(
                vm_unacct_memory(pages * VM_ACCT(PAGE_SIZE));
  }
  
- static int shmem_inode_acct_block(struct inode *inode, long pages)
+ static int shmem_inode_acct_blocks(struct inode *inode, long pages)
  {
        struct shmem_inode_info *info = SHMEM_I(inode);
        struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
        int err = -ENOSPC;
  
-       if (shmem_acct_block(info->flags, pages))
+       if (shmem_acct_blocks(info->flags, pages))
                return err;
  
        might_sleep();  /* when quotas */
        if (sbinfo->max_blocks) {
-               if (percpu_counter_compare(&sbinfo->used_blocks,
-                                          sbinfo->max_blocks - pages) > 0)
+               if (!percpu_counter_limited_add(&sbinfo->used_blocks,
+                                               sbinfo->max_blocks, pages))
                        goto unacct;
  
                err = dquot_alloc_block_nodirty(inode, pages);
-               if (err)
+               if (err) {
+                       percpu_counter_sub(&sbinfo->used_blocks, pages);
                        goto unacct;
-               percpu_counter_add(&sbinfo->used_blocks, pages);
+               }
        } else {
                err = dquot_alloc_block_nodirty(inode, pages);
                if (err)
@@@ -447,7 -446,7 +446,7 @@@ bool shmem_charge(struct inode *inode, 
  {
        struct address_space *mapping = inode->i_mapping;
  
-       if (shmem_inode_acct_block(inode, pages))
+       if (shmem_inode_acct_blocks(inode, pages))
                return false;
  
        /* nrpages adjustment first, then shmem_recalc_inode() when balanced */
@@@ -756,16 -755,14 +755,14 @@@ static unsigned long shmem_unused_huge_
  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
  
  /*
-  * Like filemap_add_folio, but error if expected item has gone.
+  * Somewhat like filemap_add_folio, but error if expected item has gone.
   */
  static int shmem_add_to_page_cache(struct folio *folio,
                                   struct address_space *mapping,
-                                  pgoff_t index, void *expected, gfp_t gfp,
-                                  struct mm_struct *charge_mm)
+                                  pgoff_t index, void *expected, gfp_t gfp)
  {
        XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
        long nr = folio_nr_pages(folio);
-       int error;
  
        VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
        VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
        folio->mapping = mapping;
        folio->index = index;
  
-       if (!folio_test_swapcache(folio)) {
-               error = mem_cgroup_charge(folio, charge_mm, gfp);
-               if (error) {
-                       if (folio_test_pmd_mappable(folio)) {
-                               count_vm_event(THP_FILE_FALLBACK);
-                               count_vm_event(THP_FILE_FALLBACK_CHARGE);
-                       }
-                       goto error;
-               }
-       }
+       gfp &= GFP_RECLAIM_MASK;
        folio_throttle_swaprate(folio, gfp);
  
        do {
                xas_store(&xas, folio);
                if (xas_error(&xas))
                        goto unlock;
-               if (folio_test_pmd_mappable(folio)) {
-                       count_vm_event(THP_FILE_ALLOC);
+               if (folio_test_pmd_mappable(folio))
                        __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, nr);
-               }
-               mapping->nrpages += nr;
                __lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr);
                __lruvec_stat_mod_folio(folio, NR_SHMEM, nr);
+               mapping->nrpages += nr;
  unlock:
                xas_unlock_irq(&xas);
        } while (xas_nomem(&xas, gfp));
  
        if (xas_error(&xas)) {
-               error = xas_error(&xas);
-               goto error;
+               folio->mapping = NULL;
+               folio_ref_sub(folio, nr);
+               return xas_error(&xas);
        }
  
        return 0;
- error:
-       folio->mapping = NULL;
-       folio_ref_sub(folio, nr);
-       return error;
  }
  
  /*
-  * Like delete_from_page_cache, but substitutes swap for @folio.
+  * Somewhat like filemap_remove_folio, but substitutes swap for @folio.
   */
  static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
  {
@@@ -887,7 -870,6 +870,6 @@@ unsigned long shmem_partial_swap_usage(
                        cond_resched_rcu();
                }
        }
        rcu_read_unlock();
  
        return swapped << PAGE_SHIFT;
@@@ -1112,7 -1094,7 +1094,7 @@@ whole_folios
  void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
  {
        shmem_undo_range(inode, lstart, lend, false);
 -      inode->i_mtime = inode_set_ctime_current(inode);
 +      inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
        inode_inc_iversion(inode);
  }
  EXPORT_SYMBOL_GPL(shmem_truncate_range);
@@@ -1213,7 -1195,6 +1195,6 @@@ static int shmem_setattr(struct mnt_idm
        if (i_uid_needs_update(idmap, attr, inode) ||
            i_gid_needs_update(idmap, attr, inode)) {
                error = dquot_transfer(idmap, inode, attr);
                if (error)
                        return error;
        }
        if (!error && update_ctime) {
                inode_set_ctime_current(inode);
                if (update_mtime)
 -                      inode->i_mtime = inode_get_ctime(inode);
 +                      inode_set_mtime_to_ts(inode, inode_get_ctime(inode));
                inode_inc_iversion(inode);
        }
        return error;
@@@ -1326,10 -1307,8 +1307,8 @@@ static int shmem_unuse_swap_entries(str
  
                if (!xa_is_value(folio))
                        continue;
-               error = shmem_swapin_folio(inode, indices[i],
-                                         &folio, SGP_CACHE,
-                                         mapping_gfp_mask(mapping),
-                                         NULL, NULL);
+               error = shmem_swapin_folio(inode, indices[i], &folio, SGP_CACHE,
+                                       mapping_gfp_mask(mapping), NULL, NULL);
                if (error == 0) {
                        folio_unlock(folio);
                        folio_put(folio);
@@@ -1565,38 -1544,20 +1544,20 @@@ static inline struct mempolicy *shmem_g
        return NULL;
  }
  #endif /* CONFIG_NUMA && CONFIG_TMPFS */
- #ifndef CONFIG_NUMA
- #define vm_policy vm_private_data
- #endif
  
- static void shmem_pseudo_vma_init(struct vm_area_struct *vma,
-               struct shmem_inode_info *info, pgoff_t index)
- {
-       /* Create a pseudo vma that just contains the policy */
-       vma_init(vma, NULL);
-       /* Bias interleave by inode number to distribute better across nodes */
-       vma->vm_pgoff = index + info->vfs_inode.i_ino;
-       vma->vm_policy = mpol_shared_policy_lookup(&info->policy, index);
- }
- static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma)
- {
-       /* Drop reference taken by mpol_shared_policy_lookup() */
-       mpol_cond_put(vma->vm_policy);
- }
+ static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
+                       pgoff_t index, unsigned int order, pgoff_t *ilx);
  
- static struct folio *shmem_swapin(swp_entry_t swap, gfp_t gfp,
+ static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
                        struct shmem_inode_info *info, pgoff_t index)
  {
-       struct vm_area_struct pvma;
+       struct mempolicy *mpol;
+       pgoff_t ilx;
        struct page *page;
-       struct vm_fault vmf = {
-               .vma = &pvma,
-       };
  
-       shmem_pseudo_vma_init(&pvma, info, index);
-       page = swap_cluster_readahead(swap, gfp, &vmf);
-       shmem_pseudo_vma_destroy(&pvma);
+       mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
+       page = swap_cluster_readahead(swap, gfp, mpol, ilx);
+       mpol_cond_put(mpol);
  
        if (!page)
                return NULL;
@@@ -1630,67 -1591,126 +1591,126 @@@ static gfp_t limit_gfp_mask(gfp_t huge_
  static struct folio *shmem_alloc_hugefolio(gfp_t gfp,
                struct shmem_inode_info *info, pgoff_t index)
  {
-       struct vm_area_struct pvma;
-       struct address_space *mapping = info->vfs_inode.i_mapping;
-       pgoff_t hindex;
-       struct folio *folio;
+       struct mempolicy *mpol;
+       pgoff_t ilx;
+       struct page *page;
  
-       hindex = round_down(index, HPAGE_PMD_NR);
-       if (xa_find(&mapping->i_pages, &hindex, hindex + HPAGE_PMD_NR - 1,
-                                                               XA_PRESENT))
-               return NULL;
+       mpol = shmem_get_pgoff_policy(info, index, HPAGE_PMD_ORDER, &ilx);
+       page = alloc_pages_mpol(gfp, HPAGE_PMD_ORDER, mpol, ilx, numa_node_id());
+       mpol_cond_put(mpol);
  
-       shmem_pseudo_vma_init(&pvma, info, hindex);
-       folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, &pvma, 0, true);
-       shmem_pseudo_vma_destroy(&pvma);
-       if (!folio)
-               count_vm_event(THP_FILE_FALLBACK);
-       return folio;
+       return page_rmappable_folio(page);
  }
  
  static struct folio *shmem_alloc_folio(gfp_t gfp,
-                       struct shmem_inode_info *info, pgoff_t index)
+               struct shmem_inode_info *info, pgoff_t index)
  {
-       struct vm_area_struct pvma;
-       struct folio *folio;
+       struct mempolicy *mpol;
+       pgoff_t ilx;
+       struct page *page;
  
-       shmem_pseudo_vma_init(&pvma, info, index);
-       folio = vma_alloc_folio(gfp, 0, &pvma, 0, false);
-       shmem_pseudo_vma_destroy(&pvma);
+       mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
+       page = alloc_pages_mpol(gfp, 0, mpol, ilx, numa_node_id());
+       mpol_cond_put(mpol);
  
-       return folio;
+       return (struct folio *)page;
  }
  
- static struct folio *shmem_alloc_and_acct_folio(gfp_t gfp, struct inode *inode,
-               pgoff_t index, bool huge)
+ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
+               struct inode *inode, pgoff_t index,
+               struct mm_struct *fault_mm, bool huge)
  {
+       struct address_space *mapping = inode->i_mapping;
        struct shmem_inode_info *info = SHMEM_I(inode);
        struct folio *folio;
-       int nr;
-       int err;
+       long pages;
+       int error;
  
        if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
                huge = false;
-       nr = huge ? HPAGE_PMD_NR : 1;
  
-       err = shmem_inode_acct_block(inode, nr);
-       if (err)
-               goto failed;
+       if (huge) {
+               pages = HPAGE_PMD_NR;
+               index = round_down(index, HPAGE_PMD_NR);
+               /*
+                * Check for conflict before waiting on a huge allocation.
+                * Conflict might be that a huge page has just been allocated
+                * and added to page cache by a racing thread, or that there
+                * is already at least one small page in the huge extent.
+                * Be careful to retry when appropriate, but not forever!
+                * Elsewhere -EEXIST would be the right code, but not here.
+                */
+               if (xa_find(&mapping->i_pages, &index,
+                               index + HPAGE_PMD_NR - 1, XA_PRESENT))
+                       return ERR_PTR(-E2BIG);
  
-       if (huge)
                folio = shmem_alloc_hugefolio(gfp, info, index);
-       else
+               if (!folio)
+                       count_vm_event(THP_FILE_FALLBACK);
+       } else {
+               pages = 1;
                folio = shmem_alloc_folio(gfp, info, index);
-       if (folio) {
-               __folio_set_locked(folio);
-               __folio_set_swapbacked(folio);
-               return folio;
        }
+       if (!folio)
+               return ERR_PTR(-ENOMEM);
  
-       err = -ENOMEM;
-       shmem_inode_unacct_blocks(inode, nr);
- failed:
-       return ERR_PTR(err);
+       __folio_set_locked(folio);
+       __folio_set_swapbacked(folio);
+       gfp &= GFP_RECLAIM_MASK;
+       error = mem_cgroup_charge(folio, fault_mm, gfp);
+       if (error) {
+               if (xa_find(&mapping->i_pages, &index,
+                               index + pages - 1, XA_PRESENT)) {
+                       error = -EEXIST;
+               } else if (huge) {
+                       count_vm_event(THP_FILE_FALLBACK);
+                       count_vm_event(THP_FILE_FALLBACK_CHARGE);
+               }
+               goto unlock;
+       }
+       error = shmem_add_to_page_cache(folio, mapping, index, NULL, gfp);
+       if (error)
+               goto unlock;
+       error = shmem_inode_acct_blocks(inode, pages);
+       if (error) {
+               struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+               long freed;
+               /*
+                * Try to reclaim some space by splitting a few
+                * large folios beyond i_size on the filesystem.
+                */
+               shmem_unused_huge_shrink(sbinfo, NULL, 2);
+               /*
+                * And do a shmem_recalc_inode() to account for freed pages:
+                * except our folio is there in cache, so not quite balanced.
+                */
+               spin_lock(&info->lock);
+               freed = pages + info->alloced - info->swapped -
+                       READ_ONCE(mapping->nrpages);
+               if (freed > 0)
+                       info->alloced -= freed;
+               spin_unlock(&info->lock);
+               if (freed > 0)
+                       shmem_inode_unacct_blocks(inode, freed);
+               error = shmem_inode_acct_blocks(inode, pages);
+               if (error) {
+                       filemap_remove_folio(folio);
+                       goto unlock;
+               }
+       }
+       shmem_recalc_inode(inode, pages, 0);
+       folio_add_lru(folio);
+       return folio;
+ unlock:
+       folio_unlock(folio);
+       folio_put(folio);
+       return ERR_PTR(error);
  }
  
  /*
@@@ -1812,12 -1832,11 +1832,11 @@@ static void shmem_set_folio_swapin_erro
   */
  static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
                             struct folio **foliop, enum sgp_type sgp,
-                            gfp_t gfp, struct vm_area_struct *vma,
+                            gfp_t gfp, struct mm_struct *fault_mm,
                             vm_fault_t *fault_type)
  {
        struct address_space *mapping = inode->i_mapping;
        struct shmem_inode_info *info = SHMEM_I(inode);
-       struct mm_struct *charge_mm = vma ? vma->vm_mm : NULL;
        struct swap_info_struct *si;
        struct folio *folio = NULL;
        swp_entry_t swap;
                if (fault_type) {
                        *fault_type |= VM_FAULT_MAJOR;
                        count_vm_event(PGMAJFAULT);
-                       count_memcg_event_mm(charge_mm, PGMAJFAULT);
+                       count_memcg_event_mm(fault_mm, PGMAJFAULT);
                }
                /* Here we actually start the io */
-               folio = shmem_swapin(swap, gfp, info, index);
+               folio = shmem_swapin_cluster(swap, gfp, info, index);
                if (!folio) {
                        error = -ENOMEM;
                        goto failed;
        }
  
        error = shmem_add_to_page_cache(folio, mapping, index,
-                                       swp_to_radix_entry(swap), gfp,
-                                       charge_mm);
+                                       swp_to_radix_entry(swap), gfp);
        if (error)
                goto failed;
  
@@@ -1921,37 -1939,29 +1939,29 @@@ unlock
   * vm. If we swap it in we mark it dirty since we also free the swap
   * entry since a page cannot live in both the swap and page cache.
   *
-  * vma, vmf, and fault_type are only supplied by shmem_fault:
-  * otherwise they are NULL.
+  * vmf and fault_type are only supplied by shmem_fault: otherwise they are NULL.
   */
  static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
                struct folio **foliop, enum sgp_type sgp, gfp_t gfp,
-               struct vm_area_struct *vma, struct vm_fault *vmf,
-               vm_fault_t *fault_type)
+               struct vm_fault *vmf, vm_fault_t *fault_type)
  {
-       struct address_space *mapping = inode->i_mapping;
-       struct shmem_inode_info *info = SHMEM_I(inode);
-       struct shmem_sb_info *sbinfo;
-       struct mm_struct *charge_mm;
+       struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
+       struct mm_struct *fault_mm;
        struct folio *folio;
-       pgoff_t hindex;
-       gfp_t huge_gfp;
        int error;
-       int once = 0;
-       int alloced = 0;
+       bool alloced;
  
        if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
                return -EFBIG;
  repeat:
        if (sgp <= SGP_CACHE &&
-           ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
+           ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode))
                return -EINVAL;
-       }
  
-       sbinfo = SHMEM_SB(inode->i_sb);
-       charge_mm = vma ? vma->vm_mm : NULL;
+       alloced = false;
+       fault_mm = vma ? vma->vm_mm : NULL;
  
-       folio = filemap_get_entry(mapping, index);
+       folio = filemap_get_entry(inode->i_mapping, index);
        if (folio && vma && userfaultfd_minor(vma)) {
                if (!xa_is_value(folio))
                        folio_put(folio);
  
        if (xa_is_value(folio)) {
                error = shmem_swapin_folio(inode, index, &folio,
-                                         sgp, gfp, vma, fault_type);
+                                          sgp, gfp, fault_mm, fault_type);
                if (error == -EEXIST)
                        goto repeat;
  
                folio_lock(folio);
  
                /* Has the folio been truncated or swapped out? */
-               if (unlikely(folio->mapping != mapping)) {
+               if (unlikely(folio->mapping != inode->i_mapping)) {
                        folio_unlock(folio);
                        folio_put(folio);
                        goto repeat;
                return 0;
        }
  
-       if (!shmem_is_huge(inode, index, false,
-                          vma ? vma->vm_mm : NULL, vma ? vma->vm_flags : 0))
-               goto alloc_nohuge;
+       if (shmem_is_huge(inode, index, false, fault_mm,
+                         vma ? vma->vm_flags : 0)) {
+               gfp_t huge_gfp;
  
-       huge_gfp = vma_thp_gfp_mask(vma);
-       huge_gfp = limit_gfp_mask(huge_gfp, gfp);
-       folio = shmem_alloc_and_acct_folio(huge_gfp, inode, index, true);
-       if (IS_ERR(folio)) {
- alloc_nohuge:
-               folio = shmem_alloc_and_acct_folio(gfp, inode, index, false);
+               huge_gfp = vma_thp_gfp_mask(vma);
+               huge_gfp = limit_gfp_mask(huge_gfp, gfp);
+               folio = shmem_alloc_and_add_folio(huge_gfp,
+                               inode, index, fault_mm, true);
+               if (!IS_ERR(folio)) {
+                       count_vm_event(THP_FILE_ALLOC);
+                       goto alloced;
+               }
+               if (PTR_ERR(folio) == -EEXIST)
+                       goto repeat;
        }
-       if (IS_ERR(folio)) {
-               int retry = 5;
  
+       folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false);
+       if (IS_ERR(folio)) {
                error = PTR_ERR(folio);
+               if (error == -EEXIST)
+                       goto repeat;
                folio = NULL;
-               if (error != -ENOSPC)
-                       goto unlock;
-               /*
-                * Try to reclaim some space by splitting a large folio
-                * beyond i_size on the filesystem.
-                */
-               while (retry--) {
-                       int ret;
-                       ret = shmem_unused_huge_shrink(sbinfo, NULL, 1);
-                       if (ret == SHRINK_STOP)
-                               break;
-                       if (ret)
-                               goto alloc_nohuge;
-               }
                goto unlock;
        }
  
-       hindex = round_down(index, folio_nr_pages(folio));
-       if (sgp == SGP_WRITE)
-               __folio_set_referenced(folio);
-       error = shmem_add_to_page_cache(folio, mapping, hindex,
-                                       NULL, gfp & GFP_RECLAIM_MASK,
-                                       charge_mm);
-       if (error)
-               goto unacct;
-       folio_add_lru(folio);
-       shmem_recalc_inode(inode, folio_nr_pages(folio), 0);
+ alloced:
        alloced = true;
        if (folio_test_pmd_mappable(folio) &&
            DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
                                        folio_next_index(folio) - 1) {
+               struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+               struct shmem_inode_info *info = SHMEM_I(inode);
                /*
                 * Part of the large folio is beyond i_size: subject
                 * to shrink under memory pressure.
                spin_unlock(&sbinfo->shrinklist_lock);
        }
  
+       if (sgp == SGP_WRITE)
+               folio_set_referenced(folio);
        /*
         * Let SGP_FALLOC use the SGP_WRITE optimization on a new folio.
         */
@@@ -2100,11 -2092,6 +2092,6 @@@ clear
        /* Perhaps the file has been truncated since we checked */
        if (sgp <= SGP_CACHE &&
            ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
-               if (alloced) {
-                       folio_clear_dirty(folio);
-                       filemap_remove_folio(folio);
-                       shmem_recalc_inode(inode, 0, 0);
-               }
                error = -EINVAL;
                goto unlock;
        }
        /*
         * Error recovery.
         */
- unacct:
-       shmem_inode_unacct_blocks(inode, folio_nr_pages(folio));
-       if (folio_test_large(folio)) {
-               folio_unlock(folio);
-               folio_put(folio);
-               goto alloc_nohuge;
-       }
  unlock:
+       if (alloced)
+               filemap_remove_folio(folio);
+       shmem_recalc_inode(inode, 0, 0);
        if (folio) {
                folio_unlock(folio);
                folio_put(folio);
        }
-       if (error == -ENOSPC && !once++) {
-               shmem_recalc_inode(inode, 0, 0);
-               goto repeat;
-       }
-       if (error == -EEXIST)
-               goto repeat;
        return error;
  }
  
@@@ -2141,7 -2117,7 +2117,7 @@@ int shmem_get_folio(struct inode *inode
                enum sgp_type sgp)
  {
        return shmem_get_folio_gfp(inode, index, foliop, sgp,
-                       mapping_gfp_mask(inode->i_mapping), NULL, NULL, NULL);
+                       mapping_gfp_mask(inode->i_mapping), NULL, NULL);
  }
  
  /*
   * entry unconditionally - even if something else had already woken the
   * target.
   */
- static int synchronous_wake_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
+ static int synchronous_wake_function(wait_queue_entry_t *wait,
+                       unsigned int mode, int sync, void *key)
  {
        int ret = default_wake_function(wait, mode, sync, key);
        list_del_init(&wait->entry);
        return ret;
  }
  
+ /*
+  * Trinity finds that probing a hole which tmpfs is punching can
+  * prevent the hole-punch from ever completing: which in turn
+  * locks writers out with its hold on i_rwsem.  So refrain from
+  * faulting pages into the hole while it's being punched.  Although
+  * shmem_undo_range() does remove the additions, it may be unable to
+  * keep up, as each new page needs its own unmap_mapping_range() call,
+  * and the i_mmap tree grows ever slower to scan if new vmas are added.
+  *
+  * It does not matter if we sometimes reach this check just before the
+  * hole-punch begins, so that one fault then races with the punch:
+  * we just need to make racing faults a rare case.
+  *
+  * The implementation below would be much simpler if we just used a
+  * standard mutex or completion: but we cannot take i_rwsem in fault,
+  * and bloating every shmem inode for this unlikely case would be sad.
+  */
+ static vm_fault_t shmem_falloc_wait(struct vm_fault *vmf, struct inode *inode)
+ {
+       struct shmem_falloc *shmem_falloc;
+       struct file *fpin = NULL;
+       vm_fault_t ret = 0;
+       spin_lock(&inode->i_lock);
+       shmem_falloc = inode->i_private;
+       if (shmem_falloc &&
+           shmem_falloc->waitq &&
+           vmf->pgoff >= shmem_falloc->start &&
+           vmf->pgoff < shmem_falloc->next) {
+               wait_queue_head_t *shmem_falloc_waitq;
+               DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
+               ret = VM_FAULT_NOPAGE;
+               fpin = maybe_unlock_mmap_for_io(vmf, NULL);
+               shmem_falloc_waitq = shmem_falloc->waitq;
+               prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
+                               TASK_UNINTERRUPTIBLE);
+               spin_unlock(&inode->i_lock);
+               schedule();
+               /*
+                * shmem_falloc_waitq points into the shmem_fallocate()
+                * stack of the hole-punching task: shmem_falloc_waitq
+                * is usually invalid by the time we reach here, but
+                * finish_wait() does not dereference it in that case;
+                * though i_lock needed lest racing with wake_up_all().
+                */
+               spin_lock(&inode->i_lock);
+               finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
+       }
+       spin_unlock(&inode->i_lock);
+       if (fpin) {
+               fput(fpin);
+               ret = VM_FAULT_RETRY;
+       }
+       return ret;
+ }
  static vm_fault_t shmem_fault(struct vm_fault *vmf)
  {
-       struct vm_area_struct *vma = vmf->vma;
-       struct inode *inode = file_inode(vma->vm_file);
+       struct inode *inode = file_inode(vmf->vma->vm_file);
        gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
        struct folio *folio = NULL;
+       vm_fault_t ret = 0;
        int err;
-       vm_fault_t ret = VM_FAULT_LOCKED;
  
        /*
         * Trinity finds that probing a hole which tmpfs is punching can
-        * prevent the hole-punch from ever completing: which in turn
-        * locks writers out with its hold on i_rwsem.  So refrain from
-        * faulting pages into the hole while it's being punched.  Although
-        * shmem_undo_range() does remove the additions, it may be unable to
-        * keep up, as each new page needs its own unmap_mapping_range() call,
-        * and the i_mmap tree grows ever slower to scan if new vmas are added.
-        *
-        * It does not matter if we sometimes reach this check just before the
-        * hole-punch begins, so that one fault then races with the punch:
-        * we just need to make racing faults a rare case.
-        *
-        * The implementation below would be much simpler if we just used a
-        * standard mutex or completion: but we cannot take i_rwsem in fault,
-        * and bloating every shmem inode for this unlikely case would be sad.
+        * prevent the hole-punch from ever completing: noted in i_private.
         */
        if (unlikely(inode->i_private)) {
-               struct shmem_falloc *shmem_falloc;
-               spin_lock(&inode->i_lock);
-               shmem_falloc = inode->i_private;
-               if (shmem_falloc &&
-                   shmem_falloc->waitq &&
-                   vmf->pgoff >= shmem_falloc->start &&
-                   vmf->pgoff < shmem_falloc->next) {
-                       struct file *fpin;
-                       wait_queue_head_t *shmem_falloc_waitq;
-                       DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
-                       ret = VM_FAULT_NOPAGE;
-                       fpin = maybe_unlock_mmap_for_io(vmf, NULL);
-                       if (fpin)
-                               ret = VM_FAULT_RETRY;
-                       shmem_falloc_waitq = shmem_falloc->waitq;
-                       prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
-                                       TASK_UNINTERRUPTIBLE);
-                       spin_unlock(&inode->i_lock);
-                       schedule();
-                       /*
-                        * shmem_falloc_waitq points into the shmem_fallocate()
-                        * stack of the hole-punching task: shmem_falloc_waitq
-                        * is usually invalid by the time we reach here, but
-                        * finish_wait() does not dereference it in that case;
-                        * though i_lock needed lest racing with wake_up_all().
-                        */
-                       spin_lock(&inode->i_lock);
-                       finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
-                       spin_unlock(&inode->i_lock);
-                       if (fpin)
-                               fput(fpin);
+               ret = shmem_falloc_wait(vmf, inode);
+               if (ret)
                        return ret;
-               }
-               spin_unlock(&inode->i_lock);
        }
  
+       WARN_ON_ONCE(vmf->page != NULL);
        err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
-                                 gfp, vma, vmf, &ret);
+                                 gfp, vmf, &ret);
        if (err)
                return vmf_error(err);
-       if (folio)
+       if (folio) {
                vmf->page = folio_file_page(folio, vmf->pgoff);
+               ret |= VM_FAULT_LOCKED;
+       }
        return ret;
  }
  
@@@ -2330,15 -2318,41 +2318,41 @@@ static int shmem_set_policy(struct vm_a
  }
  
  static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
-                                         unsigned long addr)
+                                         unsigned long addr, pgoff_t *ilx)
  {
        struct inode *inode = file_inode(vma->vm_file);
        pgoff_t index;
  
+       /*
+        * Bias interleave by inode number to distribute better across nodes;
+        * but this interface is independent of which page order is used, so
+        * supplies only that bias, letting caller apply the offset (adjusted
+        * by page order, as in shmem_get_pgoff_policy() and get_vma_policy()).
+        */
+       *ilx = inode->i_ino;
        index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
        return mpol_shared_policy_lookup(&SHMEM_I(inode)->policy, index);
  }
- #endif
+ static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
+                       pgoff_t index, unsigned int order, pgoff_t *ilx)
+ {
+       struct mempolicy *mpol;
+       /* Bias interleave by inode number to distribute better across nodes */
+       *ilx = info->vfs_inode.i_ino + (index >> order);
+       mpol = mpol_shared_policy_lookup(&info->policy, index);
+       return mpol ? mpol : get_task_policy(current);
+ }
+ #else
+ static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
+                       pgoff_t index, unsigned int order, pgoff_t *ilx)
+ {
+       *ilx = 0;
+       return NULL;
+ }
+ #endif /* CONFIG_NUMA */
  
  int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
  {
@@@ -2374,7 -2388,7 +2388,7 @@@ static int shmem_mmap(struct file *file
        struct shmem_inode_info *info = SHMEM_I(inode);
        int ret;
  
-       ret = seal_check_future_write(info->seals, vma);
+       ret = seal_check_write(info->seals, vma);
        if (ret)
                return ret;
  
@@@ -2445,7 -2459,6 +2459,6 @@@ static struct inode *__shmem_get_inode(
        if (err)
                return ERR_PTR(err);
  
        inode = new_inode(sb);
        if (!inode) {
                shmem_free_inode(sb, 0);
        inode->i_ino = ino;
        inode_init_owner(idmap, inode, dir, mode);
        inode->i_blocks = 0;
 -      inode->i_atime = inode->i_mtime = inode_set_ctime_current(inode);
 +      simple_inode_init_ts(inode);
        inode->i_generation = get_random_u32();
        info = SHMEM_I(inode);
        memset(info, 0, (char *)inode - (char *)info);
        atomic_set(&info->stop_eviction, 0);
        info->seals = F_SEAL_SEAL;
        info->flags = flags & VM_NORESERVE;
 -      info->i_crtime = inode->i_mtime;
 +      info->i_crtime = inode_get_mtime(inode);
        info->fsflags = (dir == NULL) ? 0 :
                SHMEM_I(dir)->fsflags & SHMEM_FL_INHERITED;
        if (info->fsflags)
                shmem_set_inode_flags(inode, info->fsflags);
        INIT_LIST_HEAD(&info->shrinklist);
        INIT_LIST_HEAD(&info->swaplist);
-       INIT_LIST_HEAD(&info->swaplist);
-       if (sbinfo->noswap)
-               mapping_set_unevictable(inode->i_mapping);
        simple_xattrs_init(&info->xattrs);
        cache_no_acl(inode);
+       if (sbinfo->noswap)
+               mapping_set_unevictable(inode->i_mapping);
        mapping_set_large_folios(inode->i_mapping);
  
        switch (mode & S_IFMT) {
@@@ -2565,7 -2577,7 +2577,7 @@@ int shmem_mfill_atomic_pte(pmd_t *dst_p
        int ret;
        pgoff_t max_off;
  
-       if (shmem_inode_acct_block(inode, 1)) {
+       if (shmem_inode_acct_blocks(inode, 1)) {
                /*
                 * We may have got a page, returned -ENOENT triggering a retry,
                 * and now we find ourselves with -ENOMEM. Release the page, to
        if (unlikely(pgoff >= max_off))
                goto out_release;
  
-       ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL,
-                                     gfp & GFP_RECLAIM_MASK, dst_vma->vm_mm);
+       ret = mem_cgroup_charge(folio, dst_vma->vm_mm, gfp);
+       if (ret)
+               goto out_release;
+       ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp);
        if (ret)
                goto out_release;
  
@@@ -2686,7 -2700,6 +2700,6 @@@ shmem_write_begin(struct file *file, st
        }
  
        ret = shmem_get_folio(inode, index, &folio, SGP_WRITE);
        if (ret)
                return ret;
  
@@@ -3218,8 -3231,7 +3231,7 @@@ shmem_mknod(struct mnt_idmap *idmap, st
        error = simple_acl_create(dir, inode);
        if (error)
                goto out_iput;
-       error = security_inode_init_security(inode, dir,
-                                            &dentry->d_name,
+       error = security_inode_init_security(inode, dir, &dentry->d_name,
                                             shmem_initxattrs, NULL);
        if (error && error != -EOPNOTSUPP)
                goto out_iput;
                goto out_iput;
  
        dir->i_size += BOGO_DIRENT_SIZE;
 -      dir->i_mtime = inode_set_ctime_current(dir);
 +      inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
        inode_inc_iversion(dir);
        d_instantiate(dentry, inode);
        dget(dentry); /* Extra count - pin the dentry in core */
@@@ -3248,14 -3260,11 +3260,11 @@@ shmem_tmpfile(struct mnt_idmap *idmap, 
        int error;
  
        inode = shmem_get_inode(idmap, dir->i_sb, dir, mode, 0, VM_NORESERVE);
        if (IS_ERR(inode)) {
                error = PTR_ERR(inode);
                goto err_out;
        }
-       error = security_inode_init_security(inode, dir,
-                                            NULL,
+       error = security_inode_init_security(inode, dir, NULL,
                                             shmem_initxattrs, NULL);
        if (error && error != -EOPNOTSUPP)
                goto out_iput;
@@@ -3292,7 -3301,8 +3301,8 @@@ static int shmem_create(struct mnt_idma
  /*
   * Link a file..
   */
- static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
+ static int shmem_link(struct dentry *old_dentry, struct inode *dir,
+                     struct dentry *dentry)
  {
        struct inode *inode = d_inode(old_dentry);
        int ret = 0;
        }
  
        dir->i_size += BOGO_DIRENT_SIZE;
 -      dir->i_mtime = inode_set_ctime_to_ts(dir,
 -                                           inode_set_ctime_current(inode));
 +      inode_set_mtime_to_ts(dir,
 +                            inode_set_ctime_to_ts(dir, inode_set_ctime_current(inode)));
        inode_inc_iversion(dir);
        inc_nlink(inode);
        ihold(inode);   /* New dentry reference */
-       dget(dentry);           /* Extra pinning count for the created dentry */
+       dget(dentry);   /* Extra pinning count for the created dentry */
        d_instantiate(dentry, inode);
  out:
        return ret;
@@@ -3339,11 -3349,11 +3349,11 @@@ static int shmem_unlink(struct inode *d
        simple_offset_remove(shmem_get_offset_ctx(dir), dentry);
  
        dir->i_size -= BOGO_DIRENT_SIZE;
 -      dir->i_mtime = inode_set_ctime_to_ts(dir,
 -                                           inode_set_ctime_current(inode));
 +      inode_set_mtime_to_ts(dir,
 +                            inode_set_ctime_to_ts(dir, inode_set_ctime_current(inode)));
        inode_inc_iversion(dir);
        drop_nlink(inode);
-       dput(dentry);   /* Undo the count from "create" - this does all the work */
+       dput(dentry);   /* Undo the count from "create" - does all the work */
        return 0;
  }
  
@@@ -3453,7 -3463,6 +3463,6 @@@ static int shmem_symlink(struct mnt_idm
  
        inode = shmem_get_inode(idmap, dir->i_sb, dir, S_IFLNK | 0777, 0,
                                VM_NORESERVE);
        if (IS_ERR(inode))
                return PTR_ERR(inode);
  
                folio_put(folio);
        }
        dir->i_size += BOGO_DIRENT_SIZE;
 -      dir->i_mtime = inode_set_ctime_current(dir);
 +      inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
        inode_inc_iversion(dir);
        d_instantiate(dentry, inode);
        dget(dentry);
@@@ -3507,8 -3516,7 +3516,7 @@@ static void shmem_put_link(void *arg
        folio_put(arg);
  }
  
- static const char *shmem_get_link(struct dentry *dentry,
-                                 struct inode *inode,
+ static const char *shmem_get_link(struct dentry *dentry, struct inode *inode,
                                  struct delayed_call *done)
  {
        struct folio *folio = NULL;
@@@ -3582,8 -3590,7 +3590,7 @@@ static int shmem_fileattr_set(struct mn
   * Callback for security_inode_init_security() for acquiring xattrs.
   */
  static int shmem_initxattrs(struct inode *inode,
-                           const struct xattr *xattr_array,
-                           void *fs_info)
+                           const struct xattr *xattr_array, void *fs_info)
  {
        struct shmem_inode_info *info = SHMEM_I(inode);
        struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
@@@ -3714,7 -3721,7 +3721,7 @@@ static const struct xattr_handler shmem
        .set = shmem_xattr_handler_set,
  };
  
 -static const struct xattr_handler *shmem_xattr_handlers[] = {
 +static const struct xattr_handler * const shmem_xattr_handlers[] = {
        &shmem_security_xattr_handler,
        &shmem_trusted_xattr_handler,
        &shmem_user_xattr_handler,
@@@ -3767,7 -3774,6 +3774,6 @@@ static struct dentry *shmem_find_alias(
        return alias ?: d_find_any_alias(inode);
  }
  
  static struct dentry *shmem_fh_to_dentry(struct super_block *sb,
                struct fid *fid, int fh_len, int fh_type)
  {
@@@ -4351,8 -4357,8 +4357,8 @@@ static int shmem_fill_super(struct supe
        }
  #endif /* CONFIG_TMPFS_QUOTA */
  
-       inode = shmem_get_inode(&nop_mnt_idmap, sb, NULL, S_IFDIR | sbinfo->mode, 0,
-                               VM_NORESERVE);
+       inode = shmem_get_inode(&nop_mnt_idmap, sb, NULL,
+                               S_IFDIR | sbinfo->mode, 0, VM_NORESERVE);
        if (IS_ERR(inode)) {
                error = PTR_ERR(inode);
                goto failed;
@@@ -4585,11 -4591,7 +4591,7 @@@ static struct file_system_type shmem_fs
        .parameters     = shmem_fs_parameters,
  #endif
        .kill_sb        = kill_litter_super,
- #ifdef CONFIG_SHMEM
        .fs_flags       = FS_USERNS_MOUNT | FS_ALLOW_IDMAP,
- #else
-       .fs_flags       = FS_USERNS_MOUNT,
- #endif
  };
  
  void __init shmem_init(void)
@@@ -4655,11 -4657,9 +4657,9 @@@ static ssize_t shmem_enabled_show(struc
  
        for (i = 0; i < ARRAY_SIZE(values); i++) {
                len += sysfs_emit_at(buf, len,
-                                    shmem_huge == values[i] ? "%s[%s]" : "%s%s",
-                                    i ? " " : "",
-                                    shmem_format_huge(values[i]));
+                               shmem_huge == values[i] ? "%s[%s]" : "%s%s",
+                               i ? " " : "", shmem_format_huge(values[i]));
        }
        len += sysfs_emit_at(buf, len, "\n");
  
        return len;
@@@ -4756,8 -4756,9 +4756,9 @@@ EXPORT_SYMBOL_GPL(shmem_truncate_range)
  #define shmem_acct_size(flags, size)          0
  #define shmem_unacct_size(flags, size)                do {} while (0)
  
- static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block *sb, struct inode *dir,
-                                           umode_t mode, dev_t dev, unsigned long flags)
+ static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap,
+                               struct super_block *sb, struct inode *dir,
+                               umode_t mode, dev_t dev, unsigned long flags)
  {
        struct inode *inode = ramfs_get_inode(sb, dir, mode, dev);
        return inode ? inode : ERR_PTR(-ENOSPC);
  
  /* common code */
  
- static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name, loff_t size,
-                                      unsigned long flags, unsigned int i_flags)
+ static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name,
+                       loff_t size, unsigned long flags, unsigned int i_flags)
  {
        struct inode *inode;
        struct file *res;
  
        inode = shmem_get_inode(&nop_mnt_idmap, mnt->mnt_sb, NULL,
                                S_IFREG | S_IRWXUGO, 0, flags);
        if (IS_ERR(inode)) {
                shmem_unacct_size(flags, size);
                return ERR_CAST(inode);
@@@ -4897,7 -4897,7 +4897,7 @@@ struct folio *shmem_read_folio_gfp(stru
  
        BUG_ON(!shmem_mapping(mapping));
        error = shmem_get_folio_gfp(inode, index, &folio, SGP_CACHE,
-                                 gfp, NULL, NULL, NULL);
+                                   gfp, NULL, NULL);
        if (error)
                return ERR_PTR(error);
  
diff --combined mm/util.c
index 6eddd891198eec0f68f6dd1764e3f7a6e7308e5f,eefa0336d38c2a84d24695e3a0896512209a434d..aa01f6ea5a75b7add33836dbe1d66c98dbd2c2e6
+++ b/mm/util.c
@@@ -799,6 -799,7 +799,7 @@@ void folio_copy(struct folio *dst, stru
                cond_resched();
        }
  }
+ EXPORT_SYMBOL(folio_copy);
  
  int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
  int sysctl_overcommit_ratio __read_mostly = 50;
@@@ -1060,8 -1061,10 +1061,8 @@@ void mem_dump_obj(void *object
  {
        const char *type;
  
 -      if (kmem_valid_obj(object)) {
 -              kmem_dump_obj(object);
 +      if (kmem_dump_obj(object))
                return;
 -      }
  
        if (vmalloc_dump_obj(object))
                return;
diff --combined net/sunrpc/auth.c
index 814b0169f972304e105e1a63dfcf6ce59f3428cd,c9c270eececc42153dd503a8d3573fab7e7f8c86..7bfe7d9a32aa601d352038b105dbcfe3604a0534
@@@ -769,14 -769,9 +769,14 @@@ int rpcauth_wrap_req(struct rpc_task *t
   * @task: controlling RPC task
   * @xdr: xdr_stream containing RPC Reply header
   *
 - * On success, @xdr is updated to point past the verifier and
 - * zero is returned. Otherwise, @xdr is in an undefined state
 - * and a negative errno is returned.
 + * Return values:
 + *   %0: Verifier is valid. @xdr now points past the verifier.
 + *   %-EIO: Verifier is corrupted or message ended early.
 + *   %-EACCES: Verifier is intact but not valid.
 + *   %-EPROTONOSUPPORT: Server does not support the requested auth type.
 + *
 + * When a negative errno is returned, @xdr is left in an undefined
 + * state.
   */
  int
  rpcauth_checkverf(struct rpc_task *task, struct xdr_stream *xdr)
@@@ -866,11 -861,7 +866,7 @@@ rpcauth_uptodatecred(struct rpc_task *t
                test_bit(RPCAUTH_CRED_UPTODATE, &cred->cr_flags) != 0;
  }
  
- static struct shrinker rpc_cred_shrinker = {
-       .count_objects = rpcauth_cache_shrink_count,
-       .scan_objects = rpcauth_cache_shrink_scan,
-       .seeks = DEFAULT_SEEKS,
- };
+ static struct shrinker *rpc_cred_shrinker;
  
  int __init rpcauth_init_module(void)
  {
        err = rpc_init_authunix();
        if (err < 0)
                goto out1;
-       err = register_shrinker(&rpc_cred_shrinker, "sunrpc_cred");
-       if (err < 0)
+       rpc_cred_shrinker = shrinker_alloc(0, "sunrpc_cred");
+       if (!rpc_cred_shrinker) {
+               err = -ENOMEM;
                goto out2;
+       }
+       rpc_cred_shrinker->count_objects = rpcauth_cache_shrink_count;
+       rpc_cred_shrinker->scan_objects = rpcauth_cache_shrink_scan;
+       shrinker_register(rpc_cred_shrinker);
        return 0;
  out2:
        rpc_destroy_authunix();
@@@ -892,5 -891,5 +896,5 @@@ out1
  void rpcauth_remove_module(void)
  {
        rpc_destroy_authunix();
-       unregister_shrinker(&rpc_cred_shrinker);
+       shrinker_free(rpc_cred_shrinker);
  }
index 9429d361059e0a07cf0d34c872c99903bdb347be,1c61e3c022cb84e3b6f1b49bafd70af12fa5239d..3c9bf0cd82a80dfe4189e273efbd9693f4f61b22
@@@ -7,7 -7,6 +7,7 @@@
  #include <inttypes.h>
  #include <linux/types.h>
  #include <linux/sched.h>
 +#include <stdbool.h>
  #include <stdint.h>
  #include <stdio.h>
  #include <stdlib.h>
@@@ -104,8 -103,8 +104,8 @@@ static int call_clone3(uint64_t flags, 
        return 0;
  }
  
 -static void test_clone3(uint64_t flags, size_t size, int expected,
 -                     enum test_mode test_mode)
 +static bool test_clone3(uint64_t flags, size_t size, int expected,
 +                      enum test_mode test_mode)
  {
        int ret;
  
        ret = call_clone3(flags, size, test_mode);
        ksft_print_msg("[%d] clone3() with flags says: %d expected %d\n",
                        getpid(), ret, expected);
 -      if (ret != expected)
 -              ksft_test_result_fail(
 +      if (ret != expected) {
 +              ksft_print_msg(
                        "[%d] Result (%d) is different than expected (%d)\n",
                        getpid(), ret, expected);
 -      else
 -              ksft_test_result_pass(
 -                      "[%d] Result (%d) matches expectation (%d)\n",
 -                      getpid(), ret, expected);
 -}
 -
 -int main(int argc, char *argv[])
 -{
 -      uid_t uid = getuid();
 -
 -      ksft_print_header();
 -      ksft_set_plan(19);
 -      test_clone3_supported();
 -
 -      /* Just a simple clone3() should return 0.*/
 -      test_clone3(0, 0, 0, CLONE3_ARGS_NO_TEST);
 -
 -      /* Do a clone3() in a new PID NS.*/
 -      if (uid == 0)
 -              test_clone3(CLONE_NEWPID, 0, 0, CLONE3_ARGS_NO_TEST);
 -      else
 -              ksft_test_result_skip("Skipping clone3() with CLONE_NEWPID\n");
 +              return false;
 +      }
  
 -      /* Do a clone3() with CLONE_ARGS_SIZE_VER0. */
 -      test_clone3(0, CLONE_ARGS_SIZE_VER0, 0, CLONE3_ARGS_NO_TEST);
 +      return true;
 +}
  
 -      /* Do a clone3() with CLONE_ARGS_SIZE_VER0 - 8 */
 -      test_clone3(0, CLONE_ARGS_SIZE_VER0 - 8, -EINVAL, CLONE3_ARGS_NO_TEST);
 +typedef bool (*filter_function)(void);
 +typedef size_t (*size_function)(void);
  
 -      /* Do a clone3() with sizeof(struct clone_args) + 8 */
 -      test_clone3(0, sizeof(struct __clone_args) + 8, 0, CLONE3_ARGS_NO_TEST);
 +static bool not_root(void)
 +{
 +      if (getuid() != 0) {
 +              ksft_print_msg("Not running as root\n");
 +              return true;
 +      }
  
 -      /* Do a clone3() with exit_signal having highest 32 bits non-zero */
 -      test_clone3(0, 0, -EINVAL, CLONE3_ARGS_INVAL_EXIT_SIGNAL_BIG);
 +      return false;
 +}
  
 -      /* Do a clone3() with negative 32-bit exit_signal */
 -      test_clone3(0, 0, -EINVAL, CLONE3_ARGS_INVAL_EXIT_SIGNAL_NEG);
++static bool no_timenamespace(void)
++{
++      if (not_root())
++              return true;
 -      /* Do a clone3() with exit_signal not fitting into CSIGNAL mask */
 -      test_clone3(0, 0, -EINVAL, CLONE3_ARGS_INVAL_EXIT_SIGNAL_CSIG);
++      if (!access("/proc/self/ns/time", F_OK))
++              return false;
 -      /* Do a clone3() with NSIG < exit_signal < CSIG */
 -      test_clone3(0, 0, -EINVAL, CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG);
++      ksft_print_msg("Time namespaces are not supported\n");
++      return true;
++}
 -      test_clone3(0, sizeof(struct __clone_args) + 8, 0, CLONE3_ARGS_ALL_0);
 +static size_t page_size_plus_8(void)
 +{
 +      return getpagesize() + 8;
 +}
  
 -      test_clone3(0, sizeof(struct __clone_args) + 16, -E2BIG,
 -                      CLONE3_ARGS_ALL_0);
 +struct test {
 +      const char *name;
 +      uint64_t flags;
 +      size_t size;
 +      size_function size_function;
 +      int expected;
 +      enum test_mode test_mode;
 +      filter_function filter;
 +};
  
 -      test_clone3(0, sizeof(struct __clone_args) * 2, -E2BIG,
 -                      CLONE3_ARGS_ALL_0);
 +static const struct test tests[] = {
 +      {
 +              .name = "simple clone3()",
 +              .flags = 0,
 +              .size = 0,
 +              .expected = 0,
 +              .test_mode = CLONE3_ARGS_NO_TEST,
 +      },
 +      {
 +              .name = "clone3() in a new PID_NS",
 +              .flags = CLONE_NEWPID,
 +              .size = 0,
 +              .expected = 0,
 +              .test_mode = CLONE3_ARGS_NO_TEST,
 +              .filter = not_root,
 +      },
 +      {
 +              .name = "CLONE_ARGS_SIZE_VER0",
 +              .flags = 0,
 +              .size = CLONE_ARGS_SIZE_VER0,
 +              .expected = 0,
 +              .test_mode = CLONE3_ARGS_NO_TEST,
 +      },
 +      {
 +              .name = "CLONE_ARGS_SIZE_VER0 - 8",
 +              .flags = 0,
 +              .size = CLONE_ARGS_SIZE_VER0 - 8,
 +              .expected = -EINVAL,
 +              .test_mode = CLONE3_ARGS_NO_TEST,
 +      },
 +      {
 +              .name = "sizeof(struct clone_args) + 8",
 +              .flags = 0,
 +              .size = sizeof(struct __clone_args) + 8,
 +              .expected = 0,
 +              .test_mode = CLONE3_ARGS_NO_TEST,
 +      },
 +      {
 +              .name = "exit_signal with highest 32 bits non-zero",
 +              .flags = 0,
 +              .size = 0,
 +              .expected = -EINVAL,
 +              .test_mode = CLONE3_ARGS_INVAL_EXIT_SIGNAL_BIG,
 +      },
 +      {
 +              .name = "negative 32-bit exit_signal",
 +              .flags = 0,
 +              .size = 0,
 +              .expected = -EINVAL,
 +              .test_mode = CLONE3_ARGS_INVAL_EXIT_SIGNAL_NEG,
 +      },
 +      {
 +              .name = "exit_signal not fitting into CSIGNAL mask",
 +              .flags = 0,
 +              .size = 0,
 +              .expected = -EINVAL,
 +              .test_mode = CLONE3_ARGS_INVAL_EXIT_SIGNAL_CSIG,
 +      },
 +      {
 +              .name = "NSIG < exit_signal < CSIG",
 +              .flags = 0,
 +              .size = 0,
 +              .expected = -EINVAL,
 +              .test_mode = CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
 +      },
 +      {
 +              .name = "Arguments sizeof(struct clone_args) + 8",
 +              .flags = 0,
 +              .size = sizeof(struct __clone_args) + 8,
 +              .expected = 0,
 +              .test_mode = CLONE3_ARGS_ALL_0,
 +      },
 +      {
 +              .name = "Arguments sizeof(struct clone_args) + 16",
 +              .flags = 0,
 +              .size = sizeof(struct __clone_args) + 16,
 +              .expected = -E2BIG,
 +              .test_mode = CLONE3_ARGS_ALL_0,
 +      },
 +      {
 +              .name = "Arguments sizeof(struct clone_arg) * 2",
 +              .flags = 0,
 +              .size = sizeof(struct __clone_args) + 16,
 +              .expected = -E2BIG,
 +              .test_mode = CLONE3_ARGS_ALL_0,
 +      },
 +      {
 +              .name = "Arguments > page size",
 +              .flags = 0,
 +              .size_function = page_size_plus_8,
 +              .expected = -E2BIG,
 +              .test_mode = CLONE3_ARGS_NO_TEST,
 +      },
 +      {
 +              .name = "CLONE_ARGS_SIZE_VER0 in a new PID NS",
 +              .flags = CLONE_NEWPID,
 +              .size = CLONE_ARGS_SIZE_VER0,
 +              .expected = 0,
 +              .test_mode = CLONE3_ARGS_NO_TEST,
 +              .filter = not_root,
 +      },
 +      {
 +              .name = "CLONE_ARGS_SIZE_VER0 - 8 in a new PID NS",
 +              .flags = CLONE_NEWPID,
 +              .size = CLONE_ARGS_SIZE_VER0 - 8,
 +              .expected = -EINVAL,
 +              .test_mode = CLONE3_ARGS_NO_TEST,
 +      },
 +      {
 +              .name = "sizeof(struct clone_args) + 8 in a new PID NS",
 +              .flags = CLONE_NEWPID,
 +              .size = sizeof(struct __clone_args) + 8,
 +              .expected = 0,
 +              .test_mode = CLONE3_ARGS_NO_TEST,
 +              .filter = not_root,
 +      },
 +      {
 +              .name = "Arguments > page size in a new PID NS",
 +              .flags = CLONE_NEWPID,
 +              .size_function = page_size_plus_8,
 +              .expected = -E2BIG,
 +              .test_mode = CLONE3_ARGS_NO_TEST,
 +      },
 +      {
 +              .name = "New time NS",
 +              .flags = CLONE_NEWTIME,
 +              .size = 0,
 +              .expected = 0,
 +              .test_mode = CLONE3_ARGS_NO_TEST,
++              .filter = no_timenamespace,
 +      },
 +      {
 +              .name = "exit signal (SIGCHLD) in flags",
 +              .flags = SIGCHLD,
 +              .size = 0,
 +              .expected = -EINVAL,
 +              .test_mode = CLONE3_ARGS_NO_TEST,
 +      },
 +};
  
 -      /* Do a clone3() with > page size */
 -      test_clone3(0, getpagesize() + 8, -E2BIG, CLONE3_ARGS_NO_TEST);
 +int main(int argc, char *argv[])
 +{
 +      size_t size;
 +      int i;
  
 -      /* Do a clone3() with CLONE_ARGS_SIZE_VER0 in a new PID NS. */
 -      if (uid == 0)
 -              test_clone3(CLONE_NEWPID, CLONE_ARGS_SIZE_VER0, 0,
 -                              CLONE3_ARGS_NO_TEST);
 -      else
 -              ksft_test_result_skip("Skipping clone3() with CLONE_NEWPID\n");
 +      ksft_print_header();
 +      ksft_set_plan(ARRAY_SIZE(tests));
 +      test_clone3_supported();
  
 -      /* Do a clone3() with CLONE_ARGS_SIZE_VER0 - 8 in a new PID NS */
 -      test_clone3(CLONE_NEWPID, CLONE_ARGS_SIZE_VER0 - 8, -EINVAL,
 -                      CLONE3_ARGS_NO_TEST);
 +      for (i = 0; i < ARRAY_SIZE(tests); i++) {
 +              if (tests[i].filter && tests[i].filter()) {
 +                      ksft_test_result_skip("%s\n", tests[i].name);
 +                      continue;
 +              }
  
 -      /* Do a clone3() with sizeof(struct clone_args) + 8 in a new PID NS */
 -      if (uid == 0)
 -              test_clone3(CLONE_NEWPID, sizeof(struct __clone_args) + 8, 0,
 -                              CLONE3_ARGS_NO_TEST);
 -      else
 -              ksft_test_result_skip("Skipping clone3() with CLONE_NEWPID\n");
 +              if (tests[i].size_function)
 +                      size = tests[i].size_function();
 +              else
 +                      size = tests[i].size;
  
 -      /* Do a clone3() with > page size in a new PID NS */
 -      test_clone3(CLONE_NEWPID, getpagesize() + 8, -E2BIG,
 -                      CLONE3_ARGS_NO_TEST);
 +              ksft_print_msg("Running test '%s'\n", tests[i].name);
  
 -      /* Do a clone3() in a new time namespace */
 -      if (access("/proc/self/ns/time", F_OK) == 0) {
 -              test_clone3(CLONE_NEWTIME, 0, 0, CLONE3_ARGS_NO_TEST);
 -      } else {
 -              ksft_print_msg("Time namespaces are not supported\n");
 -              ksft_test_result_skip("Skipping clone3() with CLONE_NEWTIME\n");
 +              ksft_test_result(test_clone3(tests[i].flags, size,
 +                                           tests[i].expected,
 +                                           tests[i].test_mode),
 +                               "%s\n", tests[i].name);
        }
  
 -      /* Do a clone3() with exit signal (SIGCHLD) in flags */
 -      test_clone3(SIGCHLD, 0, -EINVAL, CLONE3_ARGS_NO_TEST);
 -
        ksft_finished();
  }
index 60a9a305aef071ee7aad5210fe456701d240c28a,56f0230a8b92d37a4eb499ba69062f16393b02d4..56f0230a8b92d37a4eb499ba69062f16393b02d4
mode 100755,100644..100755
@@@ -175,6 -175,7 +175,7 @@@ test_scheme(
        ensure_dir "$scheme_dir" "exist"
        ensure_file "$scheme_dir/action" "exist" "600"
        test_access_pattern "$scheme_dir/access_pattern"
+       ensure_file "$scheme_dir/apply_interval_us" "exist" "600"
        test_quotas "$scheme_dir/quotas"
        test_watermarks "$scheme_dir/watermarks"
        test_filters "$scheme_dir/filters"
index 1dbfcf6df2558558399354292a4470c5897f940f,1f836e670a37087890197bda13084537d9c77c4f..1d4c1589c3055d3bb22eebe2c02fa7b015e4a665
  #define VALIDATION_NO_THRESHOLD 0     /* Verify the entire region */
  
  #define MIN(X, Y) ((X) < (Y) ? (X) : (Y))
+ #define SIZE_MB(m) ((size_t)m * (1024 * 1024))
+ #define SIZE_KB(k) ((size_t)k * 1024)
  
  struct config {
        unsigned long long src_alignment;
        unsigned long long dest_alignment;
        unsigned long long region_size;
        int overlapping;
+       int dest_preamble_size;
  };
  
  struct test {
@@@ -44,6 -47,7 +47,7 @@@ enum 
        _1MB = 1ULL << 20,
        _2MB = 2ULL << 20,
        _4MB = 4ULL << 20,
+       _5MB = 5ULL << 20,
        _1GB = 1ULL << 30,
        _2GB = 2ULL << 30,
        PMD = _2MB,
@@@ -145,6 -149,60 +149,60 @@@ static bool is_range_mapped(FILE *maps_
        return success;
  }
  
+ /*
+  * Returns the start address of the mapping on success, else returns
+  * NULL on failure.
+  */
+ static void *get_source_mapping(struct config c)
+ {
+       unsigned long long addr = 0ULL;
+       void *src_addr = NULL;
+       unsigned long long mmap_min_addr;
+       mmap_min_addr = get_mmap_min_addr();
+       /*
+        * For some tests, we need to not have any mappings below the
+        * source mapping. Add some headroom to mmap_min_addr for this.
+        */
+       mmap_min_addr += 10 * _4MB;
+ retry:
+       addr += c.src_alignment;
+       if (addr < mmap_min_addr)
+               goto retry;
+       src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE,
+                                       MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
+                                       -1, 0);
+       if (src_addr == MAP_FAILED) {
+               if (errno == EPERM || errno == EEXIST)
+                       goto retry;
+               goto error;
+       }
+       /*
+        * Check that the address is aligned to the specified alignment.
+        * Addresses which have alignments that are multiples of that
+        * specified are not considered valid. For instance, 1GB address is
+        * 2MB-aligned, however it will not be considered valid for a
+        * requested alignment of 2MB. This is done to reduce coincidental
+        * alignment in the tests.
+        */
+       if (((unsigned long long) src_addr & (c.src_alignment - 1)) ||
+                       !((unsigned long long) src_addr & c.src_alignment)) {
+               munmap(src_addr, c.region_size);
+               goto retry;
+       }
+       if (!src_addr)
+               goto error;
+       return src_addr;
+ error:
+       ksft_print_msg("Failed to map source region: %s\n",
+                       strerror(errno));
+       return NULL;
+ }
  /*
   * This test validates that merge is called when expanding a mapping.
   * Mapping containing three pages is created, middle page is unmapped
@@@ -225,59 -283,83 +283,83 @@@ out
  }
  
  /*
-  * Returns the start address of the mapping on success, else returns
-  * NULL on failure.
+  * Verify that an mremap within a range does not cause corruption
+  * of unrelated part of range.
+  *
+  * Consider the following range which is 2MB aligned and is
+  * a part of a larger 20MB range which is not shown. Each
+  * character is 256KB below making the source and destination
+  * 2MB each. The lower case letters are moved (s to d) and the
+  * upper case letters are not moved. The below test verifies
+  * that the upper case S letters are not corrupted by the
+  * adjacent mremap.
+  *
+  * |DDDDddddSSSSssss|
   */
- static void *get_source_mapping(struct config c)
+ static void mremap_move_within_range(char pattern_seed)
  {
-       unsigned long long addr = 0ULL;
-       void *src_addr = NULL;
-       unsigned long long mmap_min_addr;
+       char *test_name = "mremap mremap move within range";
+       void *src, *dest;
+       int i, success = 1;
+       size_t size = SIZE_MB(20);
+       void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+       if (ptr == MAP_FAILED) {
+               perror("mmap");
+               success = 0;
+               goto out;
+       }
+       memset(ptr, 0, size);
  
-       mmap_min_addr = get_mmap_min_addr();
+       src = ptr + SIZE_MB(6);
+       src = (void *)((unsigned long)src & ~(SIZE_MB(2) - 1));
  
- retry:
-       addr += c.src_alignment;
-       if (addr < mmap_min_addr)
-               goto retry;
+       /* Set byte pattern for source block. */
+       srand(pattern_seed);
+       for (i = 0; i < SIZE_MB(2); i++) {
+               ((char *)src)[i] = (char) rand();
+       }
  
-       src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE,
-                                       MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
-                                       -1, 0);
-       if (src_addr == MAP_FAILED) {
-               if (errno == EPERM || errno == EEXIST)
-                       goto retry;
-               goto error;
+       dest = src - SIZE_MB(2);
+       void *new_ptr = mremap(src + SIZE_MB(1), SIZE_MB(1), SIZE_MB(1),
+                                                  MREMAP_MAYMOVE | MREMAP_FIXED, dest + SIZE_MB(1));
+       if (new_ptr == MAP_FAILED) {
+               perror("mremap");
+               success = 0;
+               goto out;
        }
-       /*
-        * Check that the address is aligned to the specified alignment.
-        * Addresses which have alignments that are multiples of that
-        * specified are not considered valid. For instance, 1GB address is
-        * 2MB-aligned, however it will not be considered valid for a
-        * requested alignment of 2MB. This is done to reduce coincidental
-        * alignment in the tests.
-        */
-       if (((unsigned long long) src_addr & (c.src_alignment - 1)) ||
-                       !((unsigned long long) src_addr & c.src_alignment)) {
-               munmap(src_addr, c.region_size);
-               goto retry;
+       /* Verify byte pattern after remapping */
+       srand(pattern_seed);
+       for (i = 0; i < SIZE_MB(1); i++) {
+               char c = (char) rand();
+               if (((char *)src)[i] != c) {
+                       ksft_print_msg("Data at src at %d got corrupted due to unrelated mremap\n",
+                                      i);
+                       ksft_print_msg("Expected: %#x\t Got: %#x\n", c & 0xff,
+                                       ((char *) src)[i] & 0xff);
+                       success = 0;
+               }
        }
  
-       if (!src_addr)
-               goto error;
+ out:
+       if (munmap(ptr, size) == -1)
+               perror("munmap");
  
-       return src_addr;
- error:
-       ksft_print_msg("Failed to map source region: %s\n",
-                       strerror(errno));
-       return NULL;
+       if (success)
+               ksft_test_result_pass("%s\n", test_name);
+       else
+               ksft_test_result_fail("%s\n", test_name);
  }
  
  /* Returns the time taken for the remap on success else returns -1. */
  static long long remap_region(struct config c, unsigned int threshold_mb,
                              char pattern_seed)
  {
-       void *addr, *src_addr, *dest_addr;
+       void *addr, *src_addr, *dest_addr, *dest_preamble_addr;
        unsigned long long i;
        struct timespec t_start = {0, 0}, t_end = {0, 0};
        long long  start_ns, end_ns, align_mask, ret, offset;
                goto out;
        }
  
-       /* Set byte pattern */
+       /* Set byte pattern for source block. */
        srand(pattern_seed);
        for (i = 0; i < threshold; i++)
                memset((char *) src_addr + i, (char) rand(), 1);
        addr = (void *) (((unsigned long long) src_addr + c.region_size
                          + offset) & align_mask);
  
+       /* Remap after the destination block preamble. */
+       addr += c.dest_preamble_size;
        /* See comment in get_source_mapping() */
        if (!((unsigned long long) addr & c.dest_alignment))
                addr = (void *) ((unsigned long long) addr | c.dest_alignment);
                if (addr + c.dest_alignment < addr) {
                        ksft_print_msg("Couldn't find a valid region to remap to\n");
                        ret = -1;
-                       goto out;
+                       goto clean_up_src;
                }
                addr += c.dest_alignment;
        }
  
+       if (c.dest_preamble_size) {
+               dest_preamble_addr = mmap((void *) addr - c.dest_preamble_size, c.dest_preamble_size,
+                                         PROT_READ | PROT_WRITE,
+                                         MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
+                                                       -1, 0);
+               if (dest_preamble_addr == MAP_FAILED) {
+                       ksft_print_msg("Failed to map dest preamble region: %s\n",
+                                       strerror(errno));
+                       ret = -1;
+                       goto clean_up_src;
+               }
+               /* Set byte pattern for the dest preamble block. */
+               srand(pattern_seed);
+               for (i = 0; i < c.dest_preamble_size; i++)
+                       memset((char *) dest_preamble_addr + i, (char) rand(), 1);
+       }
        clock_gettime(CLOCK_MONOTONIC, &t_start);
        dest_addr = mremap(src_addr, c.region_size, c.region_size,
                                          MREMAP_MAYMOVE|MREMAP_FIXED, (char *) addr);
        if (dest_addr == MAP_FAILED) {
                ksft_print_msg("mremap failed: %s\n", strerror(errno));
                ret = -1;
-               goto clean_up_src;
+               goto clean_up_dest_preamble;
        }
  
        /* Verify byte pattern after remapping */
                char c = (char) rand();
  
                if (((char *) dest_addr)[i] != c) {
 -                      ksft_print_msg("Data after remap doesn't match at offset %d\n",
 +                      ksft_print_msg("Data after remap doesn't match at offset %llu\n",
                                       i);
                        ksft_print_msg("Expected: %#x\t Got: %#x\n", c & 0xff,
                                        ((char *) dest_addr)[i] & 0xff);
                }
        }
  
+       /* Verify the dest preamble byte pattern after remapping */
+       if (c.dest_preamble_size) {
+               srand(pattern_seed);
+               for (i = 0; i < c.dest_preamble_size; i++) {
+                       char c = (char) rand();
+                       if (((char *) dest_preamble_addr)[i] != c) {
+                               ksft_print_msg("Preamble data after remap doesn't match at offset %d\n",
+                                              i);
+                               ksft_print_msg("Expected: %#x\t Got: %#x\n", c & 0xff,
+                                              ((char *) dest_preamble_addr)[i] & 0xff);
+                               ret = -1;
+                               goto clean_up_dest;
+                       }
+               }
+       }
        start_ns = t_start.tv_sec * NS_PER_SEC + t_start.tv_nsec;
        end_ns = t_end.tv_sec * NS_PER_SEC + t_end.tv_nsec;
        ret = end_ns - start_ns;
   */
  clean_up_dest:
        munmap(dest_addr, c.region_size);
+ clean_up_dest_preamble:
+       if (c.dest_preamble_size && dest_preamble_addr)
+               munmap(dest_preamble_addr, c.dest_preamble_size);
  clean_up_src:
        munmap(src_addr, c.region_size);
  out:
        return ret;
  }
  
+ /*
+  * Verify that an mremap aligning down does not destroy
+  * the beginning of the mapping just because the aligned
+  * down address landed on a mapping that maybe does not exist.
+  */
+ static void mremap_move_1mb_from_start(char pattern_seed)
+ {
+       char *test_name = "mremap move 1mb from start at 1MB+256KB aligned src";
+       void *src = NULL, *dest = NULL;
+       int i, success = 1;
+       /* Config to reuse get_source_mapping() to do an aligned mmap. */
+       struct config c = {
+               .src_alignment = SIZE_MB(1) + SIZE_KB(256),
+               .region_size = SIZE_MB(6)
+       };
+       src = get_source_mapping(c);
+       if (!src) {
+               success = 0;
+               goto out;
+       }
+       c.src_alignment = SIZE_MB(1) + SIZE_KB(256);
+       dest = get_source_mapping(c);
+       if (!dest) {
+               success = 0;
+               goto out;
+       }
+       /* Set byte pattern for source block. */
+       srand(pattern_seed);
+       for (i = 0; i < SIZE_MB(2); i++) {
+               ((char *)src)[i] = (char) rand();
+       }
+       /*
+        * Unmap the beginning of dest so that the aligned address
+        * falls on no mapping.
+        */
+       munmap(dest, SIZE_MB(1));
+       void *new_ptr = mremap(src + SIZE_MB(1), SIZE_MB(1), SIZE_MB(1),
+                                                  MREMAP_MAYMOVE | MREMAP_FIXED, dest + SIZE_MB(1));
+       if (new_ptr == MAP_FAILED) {
+               perror("mremap");
+               success = 0;
+               goto out;
+       }
+       /* Verify byte pattern after remapping */
+       srand(pattern_seed);
+       for (i = 0; i < SIZE_MB(1); i++) {
+               char c = (char) rand();
+               if (((char *)src)[i] != c) {
+                       ksft_print_msg("Data at src at %d got corrupted due to unrelated mremap\n",
+                                      i);
+                       ksft_print_msg("Expected: %#x\t Got: %#x\n", c & 0xff,
+                                       ((char *) src)[i] & 0xff);
+                       success = 0;
+               }
+       }
+ out:
+       if (src && munmap(src, c.region_size) == -1)
+               perror("munmap src");
+       if (dest && munmap(dest, c.region_size) == -1)
+               perror("munmap dest");
+       if (success)
+               ksft_test_result_pass("%s\n", test_name);
+       else
+               ksft_test_result_fail("%s\n", test_name);
+ }
  static void run_mremap_test_case(struct test test_case, int *failures,
                                 unsigned int threshold_mb,
                                 unsigned int pattern_seed)
@@@ -434,7 -634,7 +634,7 @@@ static int parse_args(int argc, char **
        return 0;
  }
  
- #define MAX_TEST 13
+ #define MAX_TEST 15
  #define MAX_PERF_TEST 3
  int main(int argc, char **argv)
  {
        unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD;
        unsigned int pattern_seed;
        int num_expand_tests = 2;
-       struct test test_cases[MAX_TEST];
+       int num_misc_tests = 2;
+       struct test test_cases[MAX_TEST] = {};
        struct test perf_test_cases[MAX_PERF_TEST];
        int page_size;
        time_t t;
        test_cases[12] = MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
                                   "2GB mremap - Source PUD-aligned, Destination PUD-aligned");
  
+       /* Src and Dest addr 1MB aligned. 5MB mremap. */
+       test_cases[13] = MAKE_TEST(_1MB, _1MB, _5MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+                                 "5MB mremap - Source 1MB-aligned, Destination 1MB-aligned");
+       /* Src and Dest addr 1MB aligned. 5MB mremap. */
+       test_cases[14] = MAKE_TEST(_1MB, _1MB, _5MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+                                 "5MB mremap - Source 1MB-aligned, Dest 1MB-aligned with 40MB Preamble");
+       test_cases[14].config.dest_preamble_size = 10 * _4MB;
        perf_test_cases[0] =  MAKE_TEST(page_size, page_size, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
                                        "1GB mremap - Source PTE-aligned, Destination PTE-aligned");
        /*
                                (threshold_mb * _1MB >= _1GB);
  
        ksft_set_plan(ARRAY_SIZE(test_cases) + (run_perf_tests ?
-                     ARRAY_SIZE(perf_test_cases) : 0) + num_expand_tests);
+                     ARRAY_SIZE(perf_test_cases) : 0) + num_expand_tests + num_misc_tests);
  
        for (i = 0; i < ARRAY_SIZE(test_cases); i++)
                run_mremap_test_case(test_cases[i], &failures, threshold_mb,
  
        fclose(maps_fp);
  
+       mremap_move_within_range(pattern_seed);
+       mremap_move_1mb_from_start(pattern_seed);
        if (run_perf_tests) {
                ksft_print_msg("\n%s\n",
                 "mremap HAVE_MOVE_PMD/PUD optimization time comparison for 1GB region:");
This page took 0.958288 seconds and 4 git commands to generate.