relying on the original semantics (e.g. specifying bogusly
high 'bypass' protection values at higher tree levels).
+ memory_hugetlb_accounting
+ Count HugeTLB memory usage towards the cgroup's overall
+ memory usage for the memory controller (for the purpose of
+ statistics reporting and memory protetion). This is a new
+ behavior that could regress existing setups, so it must be
+ explicitly opted in with this mount option.
+
+ A few caveats to keep in mind:
+
+ * There is no HugeTLB pool management involved in the memory
+ controller. The pre-allocated pool does not belong to anyone.
+ Specifically, when a new HugeTLB folio is allocated to
+ the pool, it is not accounted for from the perspective of the
+ memory controller. It is only charged to a cgroup when it is
+ actually used (for e.g at page fault time). Host memory
+ overcommit management has to consider this when configuring
+ hard limits. In general, HugeTLB pool management should be
+ done via other mechanisms (such as the HugeTLB controller).
+ * Failure to charge a HugeTLB folio to the memory controller
+ results in SIGBUS. This could happen even if the HugeTLB pool
+ still has pages available (but the cgroup limit is hit and
+ reclaim attempt fails).
+ * Charging HugeTLB memory towards the memory controller affects
+ memory protection and reclaim dynamics. Any userspace tuning
+ (of low, min limits for e.g) needs to take this into account.
+ * HugeTLB pages utilized while this option is not selected
+ will not be tracked by the memory controller (even if cgroup
+ v2 is remounted later on).
+
Organizing Processes and Threads
--------------------------------
between threads in a non-leaf cgroup and its child cgroups. Each
threaded controller defines how such competitions are handled.
+Currently, the following controllers are threaded and can be enabled
+in a threaded cgroup::
+
+- cpu
+- cpuset
+- perf_event
+- pids
[Un]populated Notification
--------------------------
collapsing an existing range of pages. This counter is not
present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
+ thp_swpout (npn)
+ Number of transparent hugepages which are swapout in one piece
+ without splitting.
+
+ thp_swpout_fallback (npn)
+ Number of transparent hugepages which were split before swapout.
+ Usually because failed to allocate some continuous swap space
+ for the huge page.
+
memory.numa_stat
A read-only nested-keyed file which exists on non-root cgroups.
~~~~~~~~~~~
A single attribute controls the behavior of the I/O priority cgroup policy,
-namely the blkio.prio.class attribute. The following values are accepted for
+namely the io.prio.class attribute. The following values are accepted for
that attribute:
no-change
+----------------+---+
| no-change | 0 |
+----------------+---+
-| rt-to-be | 2 |
+| promote-to-rt | 1 |
++----------------+---+
+| restrict-to-be | 2 |
+----------------+---+
-| all-to-idle | 3 |
+| idle | 3 |
+----------------+---+
The numerical value that corresponds to each I/O priority class is as follows:
- If I/O priority class policy is promote-to-rt, change the request I/O
priority class to IOPRIO_CLASS_RT and change the request I/O priority
level to 4.
-- If I/O priorityt class is not promote-to-rt, translate the I/O priority
+- If I/O priority class policy is not promote-to-rt, translate the I/O priority
class policy into a number, then change the request I/O priority class
into the maximum of the I/O priority class policy number and the numerical
I/O priority class.
Its value will be affected by memory nodes hotplug events.
+ cpuset.cpus.exclusive
+ A read-write multiple values file which exists on non-root
+ cpuset-enabled cgroups.
+
+ It lists all the exclusive CPUs that are allowed to be used
+ to create a new cpuset partition. Its value is not used
+ unless the cgroup becomes a valid partition root. See the
+ "cpuset.cpus.partition" section below for a description of what
+ a cpuset partition is.
+
+ When the cgroup becomes a partition root, the actual exclusive
+ CPUs that are allocated to that partition are listed in
+ "cpuset.cpus.exclusive.effective" which may be different
+ from "cpuset.cpus.exclusive". If "cpuset.cpus.exclusive"
+ has previously been set, "cpuset.cpus.exclusive.effective"
+ is always a subset of it.
+
+ Users can manually set it to a value that is different from
+ "cpuset.cpus". The only constraint in setting it is that the
+ list of CPUs must be exclusive with respect to its sibling.
+
+ For a parent cgroup, any one of its exclusive CPUs can only
+ be distributed to at most one of its child cgroups. Having an
+ exclusive CPU appearing in two or more of its child cgroups is
+ not allowed (the exclusivity rule). A value that violates the
+ exclusivity rule will be rejected with a write error.
+
+ The root cgroup is a partition root and all its available CPUs
+ are in its exclusive CPU set.
+
+ cpuset.cpus.exclusive.effective
+ A read-only multiple values file which exists on all non-root
+ cpuset-enabled cgroups.
+
+ This file shows the effective set of exclusive CPUs that
+ can be used to create a partition root. The content of this
+ file will always be a subset of "cpuset.cpus" and its parent's
+ "cpuset.cpus.exclusive.effective" if its parent is not the root
+ cgroup. It will also be a subset of "cpuset.cpus.exclusive"
+ if it is set. If "cpuset.cpus.exclusive" is not set, it is
+ treated to have an implicit value of "cpuset.cpus" in the
+ formation of local partition.
+
cpuset.cpus.partition
A read-write single value file which exists on non-root
cpuset-enabled cgroups. This flag is owned by the parent cgroup
"isolated" Partition root without load balancing
========== =====================================
- The root cgroup is always a partition root and its state
- cannot be changed. All other non-root cgroups start out as
- "member".
+ A cpuset partition is a collection of cpuset-enabled cgroups with
+ a partition root at the top of the hierarchy and its descendants
+ except those that are separate partition roots themselves and
+ their descendants. A partition has exclusive access to the
+ set of exclusive CPUs allocated to it. Other cgroups outside
+ of that partition cannot use any CPUs in that set.
+
+ There are two types of partitions - local and remote. A local
+ partition is one whose parent cgroup is also a valid partition
+ root. A remote partition is one whose parent cgroup is not a
+ valid partition root itself. Writing to "cpuset.cpus.exclusive"
+ is optional for the creation of a local partition as its
+ "cpuset.cpus.exclusive" file will assume an implicit value that
+ is the same as "cpuset.cpus" if it is not set. Writing the
+ proper "cpuset.cpus.exclusive" values down the cgroup hierarchy
+ before the target partition root is mandatory for the creation
+ of a remote partition.
+
+ Currently, a remote partition cannot be created under a local
+ partition. All the ancestors of a remote partition root except
+ the root cgroup cannot be a partition root.
+
+ The root cgroup is always a partition root and its state cannot
+ be changed. All other non-root cgroups start out as "member".
When set to "root", the current cgroup is the root of a new
- partition or scheduling domain that comprises itself and all
- its descendants except those that are separate partition roots
- themselves and their descendants.
+ partition or scheduling domain. The set of exclusive CPUs is
+ determined by the value of its "cpuset.cpus.exclusive.effective".
- When set to "isolated", the CPUs in that partition root will
+ When set to "isolated", the CPUs in that partition will
be in an isolated state without any load balancing from the
scheduler. Tasks placed in such a partition with multiple
CPUs should be carefully distributed and bound to each of the
individual CPUs for optimal performance.
- The value shown in "cpuset.cpus.effective" of a partition root
- is the CPUs that the partition root can dedicate to a potential
- new child partition root. The new child subtracts available
- CPUs from its parent "cpuset.cpus.effective".
-
A partition root ("root" or "isolated") can be in one of the
two possible states - valid or invalid. An invalid partition
root is in a degraded state where some state information may
In the case of an invalid partition root, a descriptive string on
why the partition is invalid is included within parentheses.
- For a partition root to become valid, the following conditions
+ For a local partition root to be valid, the following conditions
must be met.
- 1) The "cpuset.cpus" is exclusive with its siblings , i.e. they
- are not shared by any of its siblings (exclusivity rule).
- 2) The parent cgroup is a valid partition root.
- 3) The "cpuset.cpus" is not empty and must contain at least
- one of the CPUs from parent's "cpuset.cpus", i.e. they overlap.
- 4) The "cpuset.cpus.effective" cannot be empty unless there is
+ 1) The parent cgroup is a valid partition root.
+ 2) The "cpuset.cpus.exclusive.effective" file cannot be empty,
+ though it may contain offline CPUs.
+ 3) The "cpuset.cpus.effective" cannot be empty unless there is
no task associated with this partition.
- External events like hotplug or changes to "cpuset.cpus" can
- cause a valid partition root to become invalid and vice versa.
- Note that a task cannot be moved to a cgroup with empty
- "cpuset.cpus.effective".
+ For a remote partition root to be valid, all the above conditions
+ except the first one must be met.
- For a valid partition root with the sibling cpu exclusivity
- rule enabled, changes made to "cpuset.cpus" that violate the
- exclusivity rule will invalidate the partition as well as its
- sibling partitions with conflicting cpuset.cpus values. So
- care must be taking in changing "cpuset.cpus".
+ External events like hotplug or changes to "cpuset.cpus" or
+ "cpuset.cpus.exclusive" can cause a valid partition root to
+ become invalid and vice versa. Note that a task cannot be
+ moved to a cgroup with empty "cpuset.cpus.effective".
A valid non-root parent partition may distribute out all its CPUs
- to its child partitions when there is no task associated with it.
+ to its child local partitions when there is no task associated
+ with it.
- Care must be taken to change a valid partition root to
- "member" as all its child partitions, if present, will become
+ Care must be taken to change a valid partition root to "member"
+ as all its child local partitions, if present, will become
invalid causing disruption to tasks running in those child
partitions. These inactivated partitions could be recovered if
their parent is switched back to a partition root with a proper
- set of "cpuset.cpus".
+ value in "cpuset.cpus" or "cpuset.cpus.exclusive".
Poll and inotify events are triggered whenever the state of
"cpuset.cpus.partition" changes. That includes changes caused
to "cpuset.cpus.partition" without the need to do continuous
polling.
+ A user can pre-configure certain CPUs to an isolated state
+ with load balancing disabled at boot time with the "isolcpus"
+ kernel boot command line option. If those CPUs are to be put
+ into a partition, they have to be used in an isolated partition.
+
Device controller
-----------------
F: include/linux/acpi_viot.h
ACPI WMI DRIVER
-S: Orphan
+S: Maintained
F: Documentation/driver-api/wmi.rst
F: Documentation/wmi/
F: drivers/platform/x86/wmi.c
ADM8211 WIRELESS DRIVER
S: Orphan
-W: https://wireless.wiki.kernel.org/
F: drivers/net/wireless/admtek/adm8211.*
ADP1653 FLASH CONTROLLER DRIVER
F: include/linux/ccp.h
AMD CRYPTOGRAPHIC COPROCESSOR (CCP) DRIVER - SEV SUPPORT
-M: Brijesh Singh <brijesh.singh@amd.com>
+M: Ashish Kalra <ashish.kalra@amd.com>
S: Supported
APPLETALK NETWORK LAYER
S: Odd fixes
-F: drivers/net/appletalk/
F: include/linux/atalk.h
F: include/uapi/linux/atalk.h
F: net/appletalk/
F: arch/arm64/include/asm/arch_timer.h
F: drivers/clocksource/arm_arch_timer.c
+ARM GENERIC INTERRUPT CONTROLLER DRIVERS
+S: Maintained
+F: Documentation/devicetree/bindings/interrupt-controller/arm,gic*
+F: arch/arm/include/asm/arch_gicv3.h
+F: arch/arm64/include/asm/arch_gicv3.h
+F: drivers/irqchip/irq-gic*.[ch]
+F: include/linux/irqchip/arm-gic*.h
+F: include/linux/irqchip/arm-vgic-info.h
+
ARM HDLCD DRM DRIVER
S: Supported
F: drivers/gpu/drm/arm/display/komeda/
ARM MALI PANFROST DRM DRIVER
S: Supported
T: git git://anongit.freedesktop.org/drm/drm-misc
+F: Documentation/gpu/panfrost.rst
F: drivers/gpu/drm/panfrost/
F: include/uapi/drm/panfrost_drm.h
F: drivers/mmc/host/owl-mmc.c
F: drivers/net/ethernet/actions/
F: drivers/pinctrl/actions/*
-F: drivers/soc/actions/
+F: drivers/pmdomain/actions/
F: include/dt-bindings/power/owl-*
F: include/dt-bindings/reset/actions,*
F: include/linux/soc/actions/
N: sun[x456789]i
N: sun[25]0i
+ARM/AMD PENSANDO ARM64 ARCHITECTURE
+S: Supported
+F: Documentation/devicetree/bindings/*/amd,pensando*
+F: arch/arm64/boot/dts/amd/elba*
+
ARM/Amlogic Meson SoC CLOCK FRAMEWORK
ARM/INTEL IXP4XX ARM ARCHITECTURE
S: Maintained
F: Documentation/devicetree/bindings/arm/intel-ixp4xx.yaml
-F: Documentation/devicetree/bindings/gpio/intel,ixp4xx-gpio.txt
+F: Documentation/devicetree/bindings/gpio/intel,ixp4xx-gpio.yaml
F: Documentation/devicetree/bindings/interrupt-controller/intel,ixp4xx-interrupt.yaml
F: Documentation/devicetree/bindings/memory-controllers/intel,ixp4xx-expansion*
+F: Documentation/devicetree/bindings/rng/intel,ixp46x-rng.yaml
F: Documentation/devicetree/bindings/timer/intel,ixp4xx-timer.yaml
F: arch/arm/boot/dts/intel/ixp/
F: arch/arm/mach-ixp4xx/
F: drivers/bus/intel-ixp4xx-eb.c
+F: drivers/char/hw_random/ixp4xx-rng.c
F: drivers/clocksource/timer-ixp4xx.c
F: drivers/crypto/intel/ixp4xx/ixp4xx_crypto.c
F: drivers/gpio/gpio-ixp4xx.c
F: drivers/irqchip/irq-ixp4xx.c
+F: drivers/net/ethernet/xscale/ixp4xx_eth.c
+F: drivers/net/wan/ixp4xx_hss.c
+F: drivers/soc/ixp4xx/ixp4xx-npe.c
+F: drivers/soc/ixp4xx/ixp4xx-qmgr.c
+F: include/linux/soc/ixp4xx/npe.h
+F: include/linux/soc/ixp4xx/qmgr.h
ARM/INTEL KEEMBAY ARCHITECTURE
ARM/Mediatek SoC support
C: irc://irc.oftc.net/bcache
F: drivers/md/bcache/
+BCACHEFS
+S: Supported
+C: irc://irc.oftc.net/bcache
+F: fs/bcachefs/
+
BDISP ST MEDIA DRIVER
F: drivers/iio/accel/bma400*
BPF JIT for ARM
-S: Odd Fixes
+S: Maintained
F: arch/arm/net/
BPF JIT for ARM64
S: Odd Fixes
K: (?:\b|_)bpf(?:\b|_)
+BPF [NETKIT] (BPF-programmable network device)
+S: Supported
+F: drivers/net/netkit.c
+F: include/net/netkit.h
+
BPF [NETWORKING] (struct_ops, reuseport)
F: drivers/net/ethernet/broadcom/unimac.h
BROADCOM TG3 GIGABIT ETHERNET DRIVER
S: Supported
X: drivers/char/random.c
X: drivers/char/tpm/
+CHARGERLAB POWER-Z HARDWARE MONITOR DRIVER
+S: Maintained
+F: Documentation/hwmon/powerz.rst
+F: drivers/hwmon/powerz.c
+
CHECKPATCH
F: include/dt-bindings/sound/cs*
F: include/linux/mfd/cs42l43*
F: include/sound/cs*
+F: sound/pci/hda/cirrus*
F: sound/pci/hda/cs*
F: sound/pci/hda/hda_cs_dsp_ctl.*
F: sound/soc/codecs/cs*
F: Documentation/devicetree/bindings/timer/
F: drivers/clocksource/
+CLOSURES
+S: Supported
+C: irc://irc.oftc.net/bcache
+F: include/linux/closure.h
+F: lib/closure.c
+
CMPC ACPI DRIVER
W: http://accessrunner.sourceforge.net/
F: drivers/usb/atm/cxacru.c
+CONFIDENTIAL COMPUTING THREAT MODEL FOR X86 VIRTUALIZATION (SNP/TDX)
+S: Maintained
+F: Documentation/security/snp-tdx-threat-model.rst
+
CONFIGFS
F: mm/memcontrol.c
F: mm/swap_cgroup.c
F: tools/testing/selftests/cgroup/memcg_protection.m
+ F: tools/testing/selftests/cgroup/test_hugetlb_memcg.c
F: tools/testing/selftests/cgroup/test_kmem.c
F: tools/testing/selftests/cgroup/test_memcontrol.c
S: Maintained
F: drivers/i2c/busses/i2c-cp2615.c
-CPMAC ETHERNET DRIVER
-S: Maintained
-F: drivers/net/ethernet/ti/cpmac.c
-
CPU FREQUENCY DRIVERS - VEXPRESS SPC ARM BIG LITTLE
S: Supported
F: Documentation/ABI/testing/sysfs-class-cxl
-F: Documentation/powerpc/cxl.rst
+F: Documentation/arch/powerpc/cxl.rst
F: arch/powerpc/platforms/powernv/pci-cxl.c
F: drivers/misc/cxl/
F: include/misc/cxl*
S: Supported
-F: Documentation/powerpc/cxlflash.rst
+F: Documentation/arch/powerpc/cxlflash.rst
F: drivers/scsi/cxlflash/
F: include/uapi/scsi/cxlflash_ioctl.h
DEVICE-MAPPER (LVM)
S: Maintained
W: http://sources.redhat.com/dm
Q: http://patchwork.kernel.org/project/dm-devel/list/
F: include/video/udlfb.h
DISTRIBUTED LOCK MANAGER (DLM)
-M: Christine Caulfield <ccaulfie@redhat.com>
+M: Alexander Aring <aahringo@redhat.com>
S: Supported
-W: http://sources.redhat.com/cluster/
+W: https://pagure.io/dlm
T: git git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm.git
F: fs/dlm/
S: Maintained
T: git git://anongit.freedesktop.org/drm/drm-misc
F: Documentation/driver-api/dma-buf.rst
+F: Documentation/userspace-api/dma-buf-alloc-exchange.rst
F: drivers/dma-buf/
F: include/linux/*fence.h
F: include/linux/dma-buf.h
F: drivers/net/ethernet/freescale/dpaa2/dpaa2-switch*
F: drivers/net/ethernet/freescale/dpaa2/dpsw*
+DPLL SUBSYSTEM
+S: Supported
+F: Documentation/driver-api/dpll.rst
+F: drivers/dpll/*
+F: include/linux/dpll.h
+F: include/uapi/linux/dpll.h
+
DRBD DRIVER
B: https://gitlab.freedesktop.org/drm/msm/-/issues
T: git https://gitlab.freedesktop.org/drm/msm.git
F: Documentation/devicetree/bindings/display/msm/
+F: drivers/gpu/drm/ci/xfails/msm*
F: drivers/gpu/drm/msm/
F: include/uapi/drm/msm_drm.h
S: Maintained
T: git git://anongit.freedesktop.org/drm/drm-misc
-F: Documentation/devicetree/bindings/display/solomon,ssd1307fb.yaml
+F: Documentation/devicetree/bindings/display/solomon,ssd-common.yaml
+F: Documentation/devicetree/bindings/display/solomon,ssd13*.yaml
F: drivers/gpu/drm/solomon/ssd130x*
DRM DRIVER FOR ST-ERICSSON MCDE
S: Maintained
W: https://01.org/linuxgraphics/gfx-docs/maintainer-tools/drm-misc.html
T: git git://anongit.freedesktop.org/drm/drm-misc
+F: Documentation/devicetree/bindings/display/
+F: Documentation/devicetree/bindings/gpu/
F: Documentation/gpu/
-F: drivers/gpu/drm/*
+F: drivers/gpu/drm/
F: drivers/gpu/vga/
-F: include/drm/drm*
+F: include/drm/drm
F: include/linux/vga*
-F: include/uapi/drm/drm*
+F: include/uapi/drm/
+X: drivers/gpu/drm/amd/
+X: drivers/gpu/drm/armada/
+X: drivers/gpu/drm/etnaviv/
+X: drivers/gpu/drm/exynos/
+X: drivers/gpu/drm/i915/
+X: drivers/gpu/drm/kmb/
+X: drivers/gpu/drm/mediatek/
+X: drivers/gpu/drm/msm/
+X: drivers/gpu/drm/nouveau/
+X: drivers/gpu/drm/radeon/
+X: drivers/gpu/drm/renesas/
+X: drivers/gpu/drm/tegra/
DRM DRIVERS FOR ALLWINNER A10
F: Documentation/devicetree/bindings/display/amlogic,meson-dw-hdmi.yaml
F: Documentation/devicetree/bindings/display/amlogic,meson-vpu.yaml
F: Documentation/gpu/meson.rst
+F: drivers/gpu/drm/ci/xfails/meson*
F: drivers/gpu/drm/meson/
DRM DRIVERS FOR ATMEL HLCDC
F: Documentation/devicetree/bindings/display/bridge/
F: drivers/gpu/drm/bridge/
F: drivers/gpu/drm/drm_bridge.c
+F: drivers/gpu/drm/drm_bridge_connector.c
F: include/drm/drm_bridge.h
+F: include/drm/drm_bridge_connector.h
DRM DRIVERS FOR EXYNOS
F: Documentation/devicetree/bindings/display/fsl,tcon.txt
F: drivers/gpu/drm/fsl-dcu/
-DRM DRIVERS FOR FREESCALE IMX
+DRM DRIVERS FOR FREESCALE IMX 5/6
S: Maintained
+T: git git://anongit.freedesktop.org/drm/drm-misc
+T: git git://git.pengutronix.de/git/pza/linux
F: Documentation/devicetree/bindings/display/imx/
F: drivers/gpu/drm/imx/ipuv3/
F: drivers/gpu/ipu-v3/
S: Maintained
-T: git git://github.com/patjak/drm-gma500
+T: git git://anongit.freedesktop.org/drm/drm-misc
F: drivers/gpu/drm/gma500/
DRM DRIVERS FOR HISILICON
S: Supported
F: Documentation/devicetree/bindings/display/mediatek/
+F: drivers/gpu/drm/ci/xfails/mediatek*
F: drivers/gpu/drm/mediatek/
F: drivers/phy/mediatek/phy-mtk-dp.c
F: drivers/phy/mediatek/phy-mtk-hdmi*
S: Maintained
T: git git://anongit.freedesktop.org/drm/drm-misc
F: Documentation/devicetree/bindings/display/rockchip/
+F: drivers/gpu/drm/ci/xfails/rockchip*
F: drivers/gpu/drm/rockchip/
DRM DRIVERS FOR STI
F: drivers/gpu/drm/xlnx/
DRM GPU SCHEDULER
-M: Luben Tuikov <luben.tuikov@amd.com>
+M: Luben Tuikov <ltuikov89@gmail.com>
S: Maintained
T: git git://anongit.freedesktop.org/drm/drm-misc
DRM PANEL DRIVERS
S: Maintained
FIRMWARE LOADER (request_firmware)
S: Maintained
F: Documentation/firmware_class/
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git
-F: sound/usb/mixer_scarlett_gen2.c
+F: sound/usb/mixer_scarlett2.c
FORCEDETH GIGABIT ETHERNET DRIVER
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/hardening
F: Documentation/kbuild/gcc-plugins.rst
+F: include/linux/stackleak.h
+F: kernel/stackleak.c
F: scripts/Makefile.gcc-plugins
F: scripts/gcc-plugins/
T: git git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm.git
F: drivers/pmdomain/
+GENERIC RADIX TREE
+S: Supported
+C: irc://irc.oftc.net/bcache
+F: include/linux/generic-radix-tree.h
+F: lib/generic-radix-tree.c
+
GENERIC RESISTIVE TOUCHSCREEN ADC DRIVER
F: Documentation/ABI/testing/debugfs-driver-habanalabs
F: Documentation/ABI/testing/sysfs-driver-habanalabs
F: drivers/accel/habanalabs/
+F: include/linux/habanalabs/
F: include/trace/events/habanalabs.h
F: include/uapi/drm/habanalabs_accel.h
F: drivers/iio/pressure/mprls0025pa.c
HOST AP DRIVER
S: Obsolete
-W: http://w1.fi/hostap-driver.html
F: drivers/net/wireless/intersil/hostap/
HP BIOSCFG DRIVER
F: mm/hugetlb.c
F: mm/hugetlb_vmemmap.c
F: mm/hugetlb_vmemmap.h
+ F: tools/testing/selftests/cgroup/test_hugetlb_memcg.c
HVA ST MEDIA DRIVER
F: drivers/i3c/
F: include/linux/i3c/
-IA64 (Itanium) PLATFORM
-S: Orphan
-F: Documentation/arch/ia64/
-F: arch/ia64/
-
IBM Operation Panel Input Driver
INTEL BIOS SAR INT1092 DRIVER
S: Maintained
F: drivers/platform/x86/intel/int1092/
T: git git://anongit.freedesktop.org/drm-intel
F: Documentation/ABI/testing/sysfs-driver-intel-i915-hwmon
F: Documentation/gpu/i915.rst
+F: drivers/gpu/drm/ci/xfails/i915*
F: drivers/gpu/drm/i915/
F: include/drm/i915*
F: include/uapi/drm/i915_drm.h
S: Maintained
F: drivers/crypto/intel/ixp4xx/ixp4xx_crypto.c
-INTEL IXP4XX QMGR, NPE, ETHERNET and HSS SUPPORT
-S: Maintained
-F: drivers/net/ethernet/xscale/ixp4xx_eth.c
-F: drivers/net/wan/ixp4xx_hss.c
-F: drivers/soc/ixp4xx/ixp4xx-npe.c
-F: drivers/soc/ixp4xx/ixp4xx-qmgr.c
-F: include/linux/soc/ixp4xx/npe.h
-F: include/linux/soc/ixp4xx/qmgr.h
-
-INTEL IXP4XX RANDOM NUMBER GENERATOR SUPPORT
-S: Maintained
-F: Documentation/devicetree/bindings/rng/intel,ixp46x-rng.yaml
-F: drivers/char/hw_random/ixp4xx-rng.c
-
INTEL KEEM BAY DRM DRIVER
F: include/linux/mfd/intel-m10-bmc.h
INTEL MAX10 BMC SECURE UPDATES
-M: Russ Weight <russell.h.weight@intel.com>
+M: Peter Colberg <peter.colberg@intel.com>
S: Maintained
F: Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-sec-update
INTEL WWAN IOSM DRIVER
S: Maintained
F: drivers/net/wwan/iosm/
F: sound/soc/codecs/sma*
IRQ DOMAINS (IRQ NUMBER MAPPING LIBRARY)
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core
F: Documentation/core-api/irq/irq-domain.rst
IRQCHIP DRIVERS
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core
S: Supported
-W: http://www.linux-iscsi.org
T: git git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git master
F: drivers/infiniband/ulp/isert
KERNEL HARDENING (not covered by other areas)
S: Supported
T: git git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/hardening
F: Documentation/ABI/testing/sysfs-kernel-oops_count
F: Documentation/ABI/testing/sysfs-kernel-warn_count
+F: arch/*/configs/hardening.config
F: include/linux/overflow.h
F: include/linux/randomize_kstack.h
+F: kernel/configs/hardening.config
F: mm/usercopy.c
K: \b(add|choose)_random_kstack_offset\b
K: \b__check_(object_size|heap_object)\b
+K: \b__counted_by\b
KERNEL JANITORS
F: tools/testing/selftests/kvm/*/aarch64/
F: tools/testing/selftests/kvm/aarch64/
+KERNEL VIRTUAL MACHINE FOR LOONGARCH (KVM/LoongArch)
+S: Maintained
+T: git git://git.kernel.org/pub/scm/virt/kvm/kvm.git
+F: arch/loongarch/include/asm/kvm*
+F: arch/loongarch/include/uapi/asm/kvm*
+F: arch/loongarch/kvm/
+
KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips)
F: arch/riscv/include/uapi/asm/kvm*
F: arch/riscv/kvm/
F: tools/testing/selftests/kvm/*/riscv/
+F: tools/testing/selftests/kvm/riscv/
KERNEL VIRTUAL MACHINE for s390 (KVM/s390)
F: Documentation/devicetree/bindings/i2c/i2c-opal.txt
F: Documentation/devicetree/bindings/powerpc/
F: Documentation/devicetree/bindings/rtc/rtc-opal.txt
-F: Documentation/powerpc/
+F: Documentation/arch/powerpc/
F: arch/powerpc/
F: drivers/*/*/*pasemi*
F: drivers/*/*pasemi*
F: drivers/hwmon/ltc2947-spi.c
F: drivers/hwmon/ltc2947.h
+LTC2991 HARDWARE MONITOR DRIVER
+S: Supported
+W: https://ez.analog.com/linux-software-drivers
+F: Documentation/devicetree/bindings/hwmon/adi,ltc2991.yaml
+F: drivers/hwmon/ltc2991.c
+
LTC2983 IIO TEMPERATURE DRIVER
MEDIATEK T7XX 5G WWAN MODEM DRIVER
MEGACHIPS STDPXXXX-GE-B850V3-FW LVDS/DP++ BRIDGES
S: Maintained
F: Documentation/devicetree/bindings/display/bridge/megachips-stdpxxxx-ge-b850v3-fw.txt
S: Maintained
F: drivers/staging/media/meson/vdec/
METHODE UDPU SUPPORT
-M: Vladimir Vid <vladimir.vid@sartura.hr>
+M: Robert Marko <robert.marko@sartura.hr>
S: Maintained
-F: arch/arm64/boot/dts/marvell/armada-3720-uDPU.dts
+F: arch/arm64/boot/dts/marvell/armada-3720-eDPU.dts
+F: arch/arm64/boot/dts/marvell/armada-3720-uDPU.*
MHI BUS
F: drivers/iio/adc/mcp3911.c
MICROCHIP MMC/SD/SDIO MCI DRIVER
S: Maintained
F: drivers/mmc/host/atmel-mci.c
S: Maintained
+F: Documentation/devicetree/bindings/*/loongson,ls1*.yaml
F: arch/mips/include/asm/mach-loongson32/
F: arch/mips/loongson32/
F: drivers/*/*loongson1*
+F: drivers/net/ethernet/stmicro/stmmac/dwmac-loongson1.c
MIPS/LOONGSON2EF ARCHITECTURE
T: git git://linuxtv.org/media_tree.git
F: drivers/media/radio/radio-miropcm20*
+MITSUMI MM8013 FG DRIVER
+F: Documentation/devicetree/bindings/power/supply/mitsumi,mm8013.yaml
+F: drivers/power/supply/mm8013.c
+
MMP SUPPORT
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git modules-next
F: include/linux/kmod.h
-F: include/linux/module.h
+F: include/linux/module*.h
F: kernel/module/
F: lib/test_kmod.c
F: scripts/module*
K: \bmdo_
NETWORKING [MPTCP]
B: https://github.com/multipath-tcp/mptcp_net-next/issues
T: git https://github.com/multipath-tcp/mptcp_net-next.git export-net
T: git https://github.com/multipath-tcp/mptcp_net-next.git export
+F: Documentation/netlink/specs/mptcp.yaml
F: Documentation/networking/mptcp-sysctl.rst
F: include/net/mptcp.h
F: include/trace/events/mptcp.h
-F: include/uapi/linux/mptcp.h
+F: include/uapi/linux/mptcp*.h
F: net/mptcp/
F: tools/testing/selftests/bpf/*/*mptcp*.c
F: tools/testing/selftests/net/mptcp/
S: Maintained
-T: git git://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/nolibc.git
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/nolibc/linux-nolibc.git
F: tools/include/nolibc/
F: tools/testing/selftests/nolibc/
S: Maintained
+T: git git://anongit.freedesktop.org/drm/drm-misc
F: Documentation/devicetree/bindings/display/imx/nxp,imx8mq-dcss.yaml
F: drivers/gpu/drm/imx/dcss/
S: Maintained
-F: Documentation/devicetree/bindings/sound/tfa9879.txt
+F: Documentation/devicetree/bindings/sound/nxp,tfa9879.yaml
F: sound/soc/codecs/tfa9879*
NXP-NCI NFC DRIVER
F: lib/objagg.c
F: lib/test_objagg.c
+OBJPOOL
+S: Supported
+F: include/linux/objpool.h
+F: lib/objpool.c
+F: lib/test_objpool.c
+
OBJTOOL
F: drivers/of/
F: include/linux/of*.h
F: scripts/dtc/
+F: tools/testing/selftests/dt/
K: of_overlay_notifier_
K: of_overlay_fdt_apply
K: of_overlay_remove
S: Maintained
F: drivers/i2c/muxes/i2c-mux-pca9541.c
-PCDP - PRIMARY CONSOLE AND DEBUG PORT
-S: Maintained
-F: drivers/firmware/pcdp.*
-
PCI DRIVER FOR AARDVARK (Marvell Armada 3700)
S: Maintained
F: Documentation/devicetree/bindings/pci/*rcar*
F: drivers/pci/controller/*rcar*
+F: drivers/pci/controller/dwc/*rcar*
PCI DRIVER FOR SAMSUNG EXYNOS
S: Supported
F: Documentation/PCI/pci-error-recovery.rst
-F: Documentation/powerpc/eeh-pci-error-recovery.rst
+F: Documentation/arch/powerpc/eeh-pci-error-recovery.rst
F: arch/powerpc/include/*/eeh*.h
F: arch/powerpc/kernel/eeh*.c
F: arch/powerpc/platforms/*/eeh*.c
S: Supported
+W: https://wireless.wiki.kernel.org/en/users/Drivers/ath12k
T: git git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git
F: drivers/net/wireless/ath/ath12k/
F: Documentation/devicetree/bindings/mtd/qcom,nandc.yaml
F: drivers/mtd/nand/raw/qcom_nandc.c
+QUALCOMM QSEECOM DRIVER
+S: Maintained
+F: drivers/firmware/qcom/qcom_qseecom.c
+
+QUALCOMM QSEECOM UEFISECAPP DRIVER
+S: Maintained
+F: drivers/firmware/qcom/qcom_qseecom_uefisecapp.c
+
QUALCOMM RMNET DRIVER
T: git https://gitlab.freedesktop.org/agd5f/linux.git
F: Documentation/gpu/amdgpu/
F: drivers/gpu/drm/amd/
+F: drivers/gpu/drm/ci/xfails/amd*
F: drivers/gpu/drm/radeon/
F: include/uapi/drm/amdgpu_drm.h
F: include/uapi/drm/radeon_drm.h
RALINK RT2X00 WIRELESS LAN DRIVER
S: Maintained
F: drivers/net/wireless/ralink/rt2x00/
S: Maintained
-W: https://wireless.wiki.kernel.org/
-T: git git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-testing.git
F: drivers/net/wireless/realtek/rtlwifi/
REALTEK WIRELESS DRIVER (rtw88)
S: Supported
Q: https://patchwork.kernel.org/project/linux-riscv/list/
C: irc://irc.libera.chat/riscv
-P: Documentation/riscv/patch-acceptance.rst
+P: Documentation/arch/riscv/patch-acceptance.rst
T: git git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git
F: arch/riscv/
N: riscv
RTL8180 WIRELESS DRIVER
S: Orphan
-W: https://wireless.wiki.kernel.org/
F: drivers/net/wireless/realtek/rtl818x/rtl8180/
RTL8187 WIRELESS DRIVER
S: Maintained
-W: https://wireless.wiki.kernel.org/
F: drivers/net/wireless/realtek/rtl818x/rtl8187/
RTL8XXXU WIRELESS DRIVER (rtl8xxxu)
S: Maintained
-T: git git://git.kernel.org/pub/scm/linux/kernel/git/jes/linux.git rtl8xxxu-devel
F: drivers/net/wireless/realtek/rtl8xxxu/
RTRS TRANSPORT DRIVERS
S: Supported
-W: https://github.com/Rust-for-Linux/linux
+W: https://rust-for-linux.com
B: https://github.com/Rust-for-Linux/linux/issues
C: zulip://rust-for-linux.zulipchat.com
+P: https://rust-for-linux.com/contributing
T: git https://github.com/Rust-for-Linux/linux.git rust-next
F: Documentation/rust/
F: rust/
S: Supported
-W: http://www.linux-iscsi.org
Q: https://patchwork.kernel.org/project/target-devel/list/
T: git git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git
F: Documentation/target/
F: drivers/mmc/host/sdhci*
SECURE DIGITAL HOST CONTROLLER INTERFACE (SDHCI) MICROCHIP DRIVER
S: Supported
F: drivers/mmc/host/sdhci-of-at91.c
SFCTEMP HWMON DRIVER
S: Maintained
F: Documentation/devicetree/bindings/hwmon/starfive,jh71x0-temp.yaml
F: drivers/platform/x86/sony-laptop.c
F: include/linux/sony-laptop.h
+SOPHGO DEVICETREES
+S: Maintained
+F: arch/riscv/boot/dts/sophgo/
+F: Documentation/devicetree/bindings/riscv/sophgo.yaml
+
SOUND
S: Maintained
W: http://www.alsa-project.org/
Q: http://patchwork.kernel.org/project/alsa-devel/list/
SOUND - ALSA SELFTESTS
S: Supported
F: tools/testing/selftests/alsa
SOUND - SOC LAYER / DYNAMIC AUDIO POWER MANAGEMENT (ASoC)
S: Supported
W: http://alsa-project.org/main/index.php/ASoC
T: git git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git
F: Documentation/sound/soc/
F: include/dt-bindings/sound/
F: include/sound/soc*
+F: include/sound/sof.h
+F: include/sound/sof/
+F: include/trace/events/sof*.h
+F: include/uapi/sound/asoc.h
F: sound/soc/
SOUND - SOUND OPEN FIRMWARE (SOF) DRIVERS
F: Documentation/devicetree/bindings/clock/starfive,jh7110-pll.yaml
F: drivers/clk/starfive/clk-starfive-jh7110-pll.c
+STARFIVE JH7110 PWMDAC DRIVER
+S: Supported
+F: Documentation/devicetree/bindings/sound/starfive,jh7110-pwmdac.yaml
+F: sound/soc/starfive/jh7110_pwmdac.c
+
STARFIVE JH7110 SYSCON
STARFIVE JH71X0 PINCTRL DRIVERS
S: Maintained
F: Documentation/devicetree/bindings/pinctrl/starfive,jh71*.yaml
STARFIVE JH71XX PMU CONTROLLER DRIVER
S: Supported
F: Documentation/devicetree/bindings/power/starfive*
-F: drivers/pmdomain/starfive/jh71xx-pmu.c
+F: drivers/pmdomain/starfive/
F: include/dt-bindings/power/starfive,jh7110-pmu.h
STARFIVE SOC DRIVERS
S: Maintained
T: git https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux.git/
F: Documentation/devicetree/bindings/soc/starfive/
-F: drivers/soc/starfive/
STARFIVE TRNG DRIVER
F: drivers/cpufreq/sc[mp]i-cpufreq.c
F: drivers/firmware/arm_scmi/
F: drivers/firmware/arm_scpi.c
+F: drivers/pmdomain/arm/
F: drivers/powercap/arm_scmi_powercap.c
F: drivers/regulator/scmi-regulator.c
F: drivers/reset/reset-scmi.c
THERMAL
S: Supported
Q: https://patchwork.kernel.org/project/linux-pm/list/
S: Orphan
W: https://wireless.wiki.kernel.org/en/users/Drivers/wl12xx
W: https://wireless.wiki.kernel.org/en/users/Drivers/wl1251
-T: git git://git.kernel.org/pub/scm/linux/kernel/git/luca/wl12xx.git
F: drivers/net/wireless/ti/
TIMEKEEPING, CLOCKSOURCE CORE, NTP, ALARMTIMER
F: arch/arm/boot/dts/imx*mba*.dts*
F: arch/arm/boot/dts/imx*tqma*.dts*
F: arch/arm/boot/dts/mba*.dtsi
+F: arch/arm64/boot/dts/freescale/fsl-*tqml*.dts*
F: arch/arm64/boot/dts/freescale/imx*mba*.dts*
F: arch/arm64/boot/dts/freescale/imx*tqma*.dts*
F: arch/arm64/boot/dts/freescale/mba*.dtsi
+F: arch/arm64/boot/dts/freescale/tqml*.dts*
F: drivers/gpio/gpio-tqmx86.c
F: drivers/mfd/tqmx86.c
F: drivers/watchdog/tqmx86_wdt.c
S: Maintained
T: git git://anongit.freedesktop.org/drm/drm-misc
+F: drivers/gpu/drm/ci/xfails/virtio*
F: drivers/gpu/drm/virtio/
F: include/uapi/linux/virtio_gpu.h
VIRTUAL PCM TEST DRIVER
-L: alsa-devel@alsa-project.org
S: Maintained
F: Documentation/sound/cards/pcmtest.rst
F: sound/drivers/pcmtest.c
F: drivers/scsi/vmw_pvscsi.h
VMWARE VIRTUAL PTP CLOCK DRIVER
-M: Deep Shah <sdeep@vmware.com>
+M: Jeff Sipek <jsipek@vmware.com>
F: drivers/gpio/gpio-xilinx.c
F: drivers/gpio/gpio-zynq.c
+XILINX LL TEMAC ETHERNET DRIVER
+S: Orphan
+F: drivers/net/ethernet/xilinx/ll_temac*
+
XILINX PWM DRIVER
S: Maintained
F: drivers/media/platform/xilinx/
F: include/uapi/linux/xilinx-v4l2-controls.h
+XILINX VERSAL EDAC DRIVER
+S: Maintained
+F: Documentation/devicetree/bindings/memory-controllers/xlnx,versal-ddrmc-edac.yaml
+F: drivers/edac/versal_edac.c
+
XILINX WATCHDOG DRIVER
EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
#endif
-void mte_sync_tags(pte_t pte)
+void mte_sync_tags(pte_t pte, unsigned int nr_pages)
{
struct page *page = pte_page(pte);
- long i, nr_pages = compound_nr(page);
+ unsigned int i;
/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
struct page *page = get_user_page_vma_remote(mm, addr,
gup_flags, &vma);
- if (IS_ERR_OR_NULL(page)) {
- err = page == NULL ? -EIO : PTR_ERR(page);
+ if (IS_ERR(page)) {
+ err = PTR_ERR(page);
break;
}
asm volatile(__ASM_SIZE(btr) " %1,%0" : : ADDR, "Ir" (nr) : "memory");
}
- static __always_inline bool
- arch_clear_bit_unlock_is_negative_byte(long nr, volatile unsigned long *addr)
+ static __always_inline bool arch_xor_unlock_is_negative_byte(unsigned long mask,
+ volatile unsigned long *addr)
{
bool negative;
- asm volatile(LOCK_PREFIX "andb %2,%1"
+ asm volatile(LOCK_PREFIX "xorb %2,%1"
CC_SET(s)
: CC_OUT(s) (negative), WBYTE_ADDR(addr)
- : "ir" ((char) ~(1 << nr)) : "memory");
+ : "iq" ((char)mask) : "memory");
return negative;
}
- #define arch_clear_bit_unlock_is_negative_byte \
- arch_clear_bit_unlock_is_negative_byte
+ #define arch_xor_unlock_is_negative_byte arch_xor_unlock_is_negative_byte
static __always_inline void
arch___clear_bit_unlock(long nr, volatile unsigned long *addr)
*/
static __always_inline unsigned long __fls(unsigned long word)
{
+ if (__builtin_constant_p(word))
+ return BITS_PER_LONG - 1 - __builtin_clzl(word);
+
asm("bsr %1,%0"
: "=r" (word)
: "rm" (word));
{
int r;
+ if (__builtin_constant_p(x))
+ return x ? 32 - __builtin_clz(x) : 0;
+
#ifdef CONFIG_X86_64
/*
* AMD64 says BSRL won't clobber the dest reg if x==0; Intel64 says the
static __always_inline int fls64(__u64 x)
{
int bitpos = -1;
+
+ if (__builtin_constant_p(x))
+ return x ? 64 - __builtin_clzll(x) : 0;
/*
* AMD64 says BSRQ won't clobber the dest reg if x==0; Intel64 says the
* dest reg is undefined if x==0, but their CPU architect says its
{
struct kvm_mmu_page *sp;
int ret = RET_PF_INVALID;
- u64 spte = 0ull;
- u64 *sptep = NULL;
+ u64 spte;
+ u64 *sptep;
uint retry_count = 0;
if (!page_fault_can_be_fast(fault))
else
sptep = fast_pf_get_last_sptep(vcpu, fault->addr, &spte);
+ /*
+ * It's entirely possible for the mapping to have been zapped
+ * by a different task, but the root page should always be
+ * available as the vCPU holds a reference to its root(s).
+ */
+ if (WARN_ON_ONCE(!sptep))
+ spte = REMOVED_SPTE;
+
if (!is_shadow_present_pte(spte))
break;
}
#endif
-int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+bool __kvm_mmu_honors_guest_mtrrs(bool vm_has_noncoherent_dma)
{
/*
- * If the guest's MTRRs may be used to compute the "real" memtype,
- * restrict the mapping level to ensure KVM uses a consistent memtype
- * across the entire mapping. If the host MTRRs are ignored by TDP
- * (shadow_memtype_mask is non-zero), and the VM has non-coherent DMA
- * (DMA doesn't snoop CPU caches), KVM's ABI is to honor the memtype
- * from the guest's MTRRs so that guest accesses to memory that is
- * DMA'd aren't cached against the guest's wishes.
+ * If host MTRRs are ignored (shadow_memtype_mask is non-zero), and the
+ * VM has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is
+ * to honor the memtype from the guest's MTRRs so that guest accesses
+ * to memory that is DMA'd aren't cached against the guest's wishes.
*
* Note, KVM may still ultimately ignore guest MTRRs for certain PFNs,
* e.g. KVM will force UC memtype for host MMIO.
*/
- if (shadow_memtype_mask && kvm_arch_has_noncoherent_dma(vcpu->kvm)) {
+ return vm_has_noncoherent_dma && shadow_memtype_mask;
+}
+
+int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+ /*
+ * If the guest's MTRRs may be used to compute the "real" memtype,
+ * restrict the mapping level to ensure KVM uses a consistent memtype
+ * across the entire mapping.
+ */
+ if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
gfn_t base = gfn_round_for_level(fault->gfn,
return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
}
- static struct shrinker mmu_shrinker = {
- .count_objects = mmu_shrink_count,
- .scan_objects = mmu_shrink_scan,
- .seeks = DEFAULT_SEEKS * 10,
- };
+ static struct shrinker *mmu_shrinker;
static void mmu_destroy_caches(void)
{
if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
goto out;
- ret = register_shrinker(&mmu_shrinker, "x86-mmu");
- if (ret)
+ mmu_shrinker = shrinker_alloc(0, "x86-mmu");
+ if (!mmu_shrinker)
goto out_shrinker;
+ mmu_shrinker->count_objects = mmu_shrink_count;
+ mmu_shrinker->scan_objects = mmu_shrink_scan;
+ mmu_shrinker->seeks = DEFAULT_SEEKS * 10;
+
+ shrinker_register(mmu_shrinker);
+
return 0;
out_shrinker:
{
mmu_destroy_caches();
percpu_counter_destroy(&kvm_total_used_mmu_pages);
- unregister_shrinker(&mmu_shrinker);
+ shrinker_free(mmu_shrinker);
}
/*
#include <linux/slab.h>
#include <linux/acpi.h>
#include <linux/perf_event.h>
+#include <linux/platform_device.h>
#include <asm/mwait.h>
#include <xen/xen.h>
for_each_cpu(cpu, pad_busy_cpus)
cpumask_or(tmp, tmp, topology_sibling_cpumask(cpu));
cpumask_andnot(tmp, cpu_online_mask, tmp);
- /* avoid HT sibilings if possible */
+ /* avoid HT siblings if possible */
if (cpumask_empty(tmp))
cpumask_andnot(tmp, cpu_online_mask, pad_busy_cpus);
if (cpumask_empty(tmp)) {
static DEVICE_ATTR_RW(idlecpus);
-static int acpi_pad_add_sysfs(struct acpi_device *device)
-{
- int result;
-
- result = device_create_file(&device->dev, &dev_attr_idlecpus);
- if (result)
- return -ENODEV;
- result = device_create_file(&device->dev, &dev_attr_idlepct);
- if (result) {
- device_remove_file(&device->dev, &dev_attr_idlecpus);
- return -ENODEV;
- }
- result = device_create_file(&device->dev, &dev_attr_rrtime);
- if (result) {
- device_remove_file(&device->dev, &dev_attr_idlecpus);
- device_remove_file(&device->dev, &dev_attr_idlepct);
- return -ENODEV;
- }
- return 0;
-}
+static struct attribute *acpi_pad_attrs[] = {
+ &dev_attr_idlecpus.attr,
+ &dev_attr_idlepct.attr,
+ &dev_attr_rrtime.attr,
+ NULL
+};
-static void acpi_pad_remove_sysfs(struct acpi_device *device)
-{
- device_remove_file(&device->dev, &dev_attr_idlecpus);
- device_remove_file(&device->dev, &dev_attr_idlepct);
- device_remove_file(&device->dev, &dev_attr_rrtime);
-}
+ATTRIBUTE_GROUPS(acpi_pad);
/*
* Query firmware how many CPUs should be idle
static void acpi_pad_notify(acpi_handle handle, u32 event,
void *data)
{
- struct acpi_device *device = data;
+ struct acpi_device *adev = data;
switch (event) {
case ACPI_PROCESSOR_AGGREGATOR_NOTIFY:
acpi_pad_handle_notify(handle);
- acpi_bus_generate_netlink_event(device->pnp.device_class,
- dev_name(&device->dev), event, 0);
+ acpi_bus_generate_netlink_event(adev->pnp.device_class,
+ dev_name(&adev->dev), event, 0);
break;
default:
pr_warn("Unsupported event [0x%x]\n", event);
}
}
-static int acpi_pad_add(struct acpi_device *device)
+static int acpi_pad_probe(struct platform_device *pdev)
{
+ struct acpi_device *adev = ACPI_COMPANION(&pdev->dev);
acpi_status status;
- strcpy(acpi_device_name(device), ACPI_PROCESSOR_AGGREGATOR_DEVICE_NAME);
- strcpy(acpi_device_class(device), ACPI_PROCESSOR_AGGREGATOR_CLASS);
+ strcpy(acpi_device_name(adev), ACPI_PROCESSOR_AGGREGATOR_DEVICE_NAME);
+ strcpy(acpi_device_class(adev), ACPI_PROCESSOR_AGGREGATOR_CLASS);
- if (acpi_pad_add_sysfs(device))
- return -ENODEV;
+ status = acpi_install_notify_handler(adev->handle,
+ ACPI_DEVICE_NOTIFY, acpi_pad_notify, adev);
- status = acpi_install_notify_handler(device->handle,
- ACPI_DEVICE_NOTIFY, acpi_pad_notify, device);
- if (ACPI_FAILURE(status)) {
- acpi_pad_remove_sysfs(device);
+ if (ACPI_FAILURE(status))
return -ENODEV;
- }
return 0;
}
-static void acpi_pad_remove(struct acpi_device *device)
+static void acpi_pad_remove(struct platform_device *pdev)
{
+ struct acpi_device *adev = ACPI_COMPANION(&pdev->dev);
+
mutex_lock(&isolated_cpus_lock);
acpi_pad_idle_cpus(0);
mutex_unlock(&isolated_cpus_lock);
- acpi_remove_notify_handler(device->handle,
+ acpi_remove_notify_handler(adev->handle,
ACPI_DEVICE_NOTIFY, acpi_pad_notify);
- acpi_pad_remove_sysfs(device);
}
static const struct acpi_device_id pad_device_ids[] = {
};
MODULE_DEVICE_TABLE(acpi, pad_device_ids);
-static struct acpi_driver acpi_pad_driver = {
- .name = "processor_aggregator",
- .class = ACPI_PROCESSOR_AGGREGATOR_CLASS,
- .ids = pad_device_ids,
- .ops = {
- .add = acpi_pad_add,
- .remove = acpi_pad_remove,
+static struct platform_driver acpi_pad_driver = {
+ .probe = acpi_pad_probe,
+ .remove_new = acpi_pad_remove,
+ .driver = {
+ .dev_groups = acpi_pad_groups,
+ .name = "processor_aggregator",
+ .acpi_match_table = pad_device_ids,
},
};
if (power_saving_mwait_eax == 0)
return -EINVAL;
- return acpi_bus_register_driver(&acpi_pad_driver);
+ return platform_driver_register(&acpi_pad_driver);
}
static void __exit acpi_pad_exit(void)
{
- acpi_bus_unregister_driver(&acpi_pad_driver);
+ platform_driver_unregister(&acpi_pad_driver);
}
module_init(acpi_pad_init);
#include <linux/efi.h>
#include <linux/memblock.h>
#include <linux/spinlock.h>
+ #include <linux/crash_dump.h>
#include <asm/unaccepted_memory.h>
-/* Protects unaccepted memory bitmap */
+/* Protects unaccepted memory bitmap and accepting_list */
static DEFINE_SPINLOCK(unaccepted_memory_lock);
+struct accept_range {
+ struct list_head list;
+ unsigned long start;
+ unsigned long end;
+};
+
+static LIST_HEAD(accepting_list);
+
/*
* accept_memory() -- Consult bitmap and accept the memory if needed.
*
{
struct efi_unaccepted_memory *unaccepted;
unsigned long range_start, range_end;
+ struct accept_range range, *entry;
unsigned long flags;
u64 unit_size;
if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
end = unaccepted->size * unit_size * BITS_PER_BYTE;
- range_start = start / unit_size;
-
+ range.start = start / unit_size;
+ range.end = DIV_ROUND_UP(end, unit_size);
+retry:
spin_lock_irqsave(&unaccepted_memory_lock, flags);
+
+ /*
+ * Check if anybody works on accepting the same range of the memory.
+ *
+ * The check is done with unit_size granularity. It is crucial to catch
+ * all accept requests to the same unit_size block, even if they don't
+ * overlap on physical address level.
+ */
+ list_for_each_entry(entry, &accepting_list, list) {
+ if (entry->end < range.start)
+ continue;
+ if (entry->start >= range.end)
+ continue;
+
+ /*
+ * Somebody else accepting the range. Or at least part of it.
+ *
+ * Drop the lock and retry until it is complete.
+ */
+ spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+ goto retry;
+ }
+
+ /*
+ * Register that the range is about to be accepted.
+ * Make sure nobody else will accept it.
+ */
+ list_add(&range.list, &accepting_list);
+
+ range_start = range.start;
for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
- DIV_ROUND_UP(end, unit_size)) {
+ range.end) {
unsigned long phys_start, phys_end;
unsigned long len = range_end - range_start;
phys_start = range_start * unit_size + unaccepted->phys_base;
phys_end = range_end * unit_size + unaccepted->phys_base;
+ /*
+ * Keep interrupts disabled until the accept operation is
+ * complete in order to prevent deadlocks.
+ *
+ * Enabling interrupts before calling arch_accept_memory()
+ * creates an opportunity for an interrupt handler to request
+ * acceptance for the same memory. The handler will continuously
+ * spin with interrupts disabled, preventing other task from
+ * making progress with the acceptance process.
+ */
+ spin_unlock(&unaccepted_memory_lock);
+
arch_accept_memory(phys_start, phys_end);
+
+ spin_lock(&unaccepted_memory_lock);
bitmap_clear(unaccepted->bitmap, range_start, len);
}
+
+ list_del(&range.list);
spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
}
return ret;
}
+
+ #ifdef CONFIG_PROC_VMCORE
+ static bool unaccepted_memory_vmcore_pfn_is_ram(struct vmcore_cb *cb,
+ unsigned long pfn)
+ {
+ return !pfn_is_unaccepted_memory(pfn);
+ }
+
+ static struct vmcore_cb vmcore_cb = {
+ .pfn_is_ram = unaccepted_memory_vmcore_pfn_is_ram,
+ };
+
+ static int __init unaccepted_memory_init_kdump(void)
+ {
+ register_vmcore_cb(&vmcore_cb);
+ return 0;
+ }
+ core_initcall(unaccepted_memory_init_kdump);
+ #endif /* CONFIG_PROC_VMCORE */
#include <linux/vmalloc.h>
#include "gt/intel_gt_requests.h"
+#include "gt/intel_gt.h"
#include "i915_trace.h"
intel_wakeref_t wakeref = 0;
unsigned long count = 0;
unsigned long scanned = 0;
- int err = 0;
+ int err = 0, i = 0;
+ struct intel_gt *gt;
/* CHV + VTD workaround use stop_machine(); need to trylock vm->mutex */
bool trylock_vm = !ww && intel_vm_no_concurrent_access_wa(i915);
* what we can do is give them a kick so that we do not keep idle
* contexts around longer than is necessary.
*/
- if (shrink & I915_SHRINK_ACTIVE)
- /* Retire requests to unpin all idle contexts */
- intel_gt_retire_requests(to_gt(i915));
+ if (shrink & I915_SHRINK_ACTIVE) {
+ for_each_gt(gt, i915, i)
+ /* Retire requests to unpin all idle contexts */
+ intel_gt_retire_requests(gt);
+ }
/*
* As we may completely rewrite the (un)bound list whilst unbinding
static unsigned long
i915_gem_shrinker_count(struct shrinker *shrinker, struct shrink_control *sc)
{
- struct drm_i915_private *i915 =
- container_of(shrinker, struct drm_i915_private, mm.shrinker);
+ struct drm_i915_private *i915 = shrinker->private_data;
unsigned long num_objects;
unsigned long count;
if (num_objects) {
unsigned long avg = 2 * count / num_objects;
- i915->mm.shrinker.batch =
- max((i915->mm.shrinker.batch + avg) >> 1,
+ i915->mm.shrinker->batch =
+ max((i915->mm.shrinker->batch + avg) >> 1,
128ul /* default SHRINK_BATCH */);
}
static unsigned long
i915_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
{
- struct drm_i915_private *i915 =
- container_of(shrinker, struct drm_i915_private, mm.shrinker);
+ struct drm_i915_private *i915 = shrinker->private_data;
unsigned long freed;
sc->nr_scanned = 0;
struct i915_vma *vma, *next;
unsigned long freed_pages = 0;
intel_wakeref_t wakeref;
+ struct intel_gt *gt;
+ int i;
with_intel_runtime_pm(&i915->runtime_pm, wakeref)
freed_pages += i915_gem_shrink(NULL, i915, -1UL, NULL,
I915_SHRINK_VMAPS);
/* We also want to clear any cached iomaps as they wrap vmap */
- mutex_lock(&to_gt(i915)->ggtt->vm.mutex);
- list_for_each_entry_safe(vma, next,
- &to_gt(i915)->ggtt->vm.bound_list, vm_link) {
- unsigned long count = i915_vma_size(vma) >> PAGE_SHIFT;
- struct drm_i915_gem_object *obj = vma->obj;
-
- if (!vma->iomap || i915_vma_is_active(vma))
- continue;
+ for_each_gt(gt, i915, i) {
+ mutex_lock(>->ggtt->vm.mutex);
+ list_for_each_entry_safe(vma, next,
+ >->ggtt->vm.bound_list, vm_link) {
+ unsigned long count = i915_vma_size(vma) >> PAGE_SHIFT;
+ struct drm_i915_gem_object *obj = vma->obj;
+
+ if (!vma->iomap || i915_vma_is_active(vma))
+ continue;
- if (!i915_gem_object_trylock(obj, NULL))
- continue;
+ if (!i915_gem_object_trylock(obj, NULL))
+ continue;
- if (__i915_vma_unbind(vma) == 0)
- freed_pages += count;
+ if (__i915_vma_unbind(vma) == 0)
+ freed_pages += count;
- i915_gem_object_unlock(obj);
+ i915_gem_object_unlock(obj);
+ }
+ mutex_unlock(>->ggtt->vm.mutex);
}
- mutex_unlock(&to_gt(i915)->ggtt->vm.mutex);
*(unsigned long *)ptr += freed_pages;
return NOTIFY_DONE;
void i915_gem_driver_register__shrinker(struct drm_i915_private *i915)
{
- i915->mm.shrinker.scan_objects = i915_gem_shrinker_scan;
- i915->mm.shrinker.count_objects = i915_gem_shrinker_count;
- i915->mm.shrinker.seeks = DEFAULT_SEEKS;
- i915->mm.shrinker.batch = 4096;
- drm_WARN_ON(&i915->drm, register_shrinker(&i915->mm.shrinker,
- "drm-i915_gem"));
+ i915->mm.shrinker = shrinker_alloc(0, "drm-i915_gem");
+ if (!i915->mm.shrinker) {
+ drm_WARN_ON(&i915->drm, 1);
+ } else {
+ i915->mm.shrinker->scan_objects = i915_gem_shrinker_scan;
+ i915->mm.shrinker->count_objects = i915_gem_shrinker_count;
+ i915->mm.shrinker->batch = 4096;
+ i915->mm.shrinker->private_data = i915;
+
+ shrinker_register(i915->mm.shrinker);
+ }
i915->mm.oom_notifier.notifier_call = i915_gem_shrinker_oom;
drm_WARN_ON(&i915->drm, register_oom_notifier(&i915->mm.oom_notifier));
unregister_vmap_purge_notifier(&i915->mm.vmap_notifier));
drm_WARN_ON(&i915->drm,
unregister_oom_notifier(&i915->mm.oom_notifier));
- unregister_shrinker(&i915->mm.shrinker);
+ shrinker_free(i915->mm.shrinker);
}
void i915_gem_shrinker_taints_mutex(struct drm_i915_private *i915,
struct notifier_block oom_notifier;
struct notifier_block vmap_notifier;
- struct shrinker shrinker;
+ struct shrinker *shrinker;
#ifdef CONFIG_MMU_NOTIFIER
/**
bool mchbar_need_disable;
} gmch;
- struct rb_root uabi_engines;
+ /*
+ * Chaining user engines happens in multiple stages, starting with a
+ * simple lock-less linked list created by intel_engine_add_user(),
+ * which later gets sorted and converted to an intermediate regular
+ * list, just to be converted once again to its final rb tree structure
+ * in intel_engines_driver_register().
+ *
+ * Make sure to use the right iterator helper, depending on if the code
+ * in question runs before or after intel_engines_driver_register() --
+ * for_each_uabi_engine() can only be used afterwards!
+ */
+ union {
+ struct llist_head uabi_engines_llist;
+ struct list_head uabi_engines_list;
+ struct rb_root uabi_engines;
+ };
unsigned int engine_uabi_class_count[I915_LAST_UABI_ENGINE_CLASS + 1];
/* protects the irq masks */
struct i915_hwmon *hwmon;
- /* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
- struct intel_gt gt0;
-
- /*
- * i915->gt[0] == &i915->gt0
- */
struct intel_gt *gt[I915_MAX_GT];
struct kobject *sysfs_gt;
return pci_get_drvdata(pdev);
}
-static inline struct intel_gt *to_gt(struct drm_i915_private *i915)
+static inline struct intel_gt *to_gt(const struct drm_i915_private *i915)
{
- return &i915->gt0;
+ return i915->gt[0];
}
/* Simple iterator over all initialised engines */
#define INTEL_INFO(i915) ((i915)->__info)
#define RUNTIME_INFO(i915) (&(i915)->__runtime)
-#define DISPLAY_INFO(i915) ((i915)->display.info.__device_info)
-#define DISPLAY_RUNTIME_INFO(i915) (&(i915)->display.info.__runtime_info)
#define DRIVER_CAPS(i915) (&(i915)->caps)
#define INTEL_DEVID(i915) (RUNTIME_INFO(i915)->device_id)
#define IS_MEDIA_VER(i915, from, until) \
(MEDIA_VER(i915) >= (from) && MEDIA_VER(i915) <= (until))
-#define DISPLAY_VER(i915) (DISPLAY_RUNTIME_INFO(i915)->ip.ver)
-#define IS_DISPLAY_VER(i915, from, until) \
- (DISPLAY_VER(i915) >= (from) && DISPLAY_VER(i915) <= (until))
-
#define INTEL_REVID(i915) (to_pci_dev((i915)->drm.dev)->revision)
#define INTEL_DISPLAY_STEP(__i915) (RUNTIME_INFO(__i915)->step.display_step)
#define IS_PONTEVECCHIO(i915) IS_PLATFORM(i915, INTEL_PONTEVECCHIO)
#define IS_METEORLAKE(i915) IS_PLATFORM(i915, INTEL_METEORLAKE)
-#define IS_METEORLAKE_M(i915) \
- IS_SUBPLATFORM(i915, INTEL_METEORLAKE, INTEL_SUBPLATFORM_M)
-#define IS_METEORLAKE_P(i915) \
- IS_SUBPLATFORM(i915, INTEL_METEORLAKE, INTEL_SUBPLATFORM_P)
#define IS_DG2_G10(i915) \
IS_SUBPLATFORM(i915, INTEL_DG2, INTEL_SUBPLATFORM_G10)
#define IS_DG2_G11(i915) \
#define IS_TIGERLAKE_UY(i915) \
IS_SUBPLATFORM(i915, INTEL_TIGERLAKE, INTEL_SUBPLATFORM_UY)
-
-
-
-
-
-
-
#define IS_XEHPSDV_GRAPHICS_STEP(__i915, since, until) \
(IS_XEHPSDV(__i915) && IS_GRAPHICS_STEP(__i915, since, until))
-#define IS_MTL_GRAPHICS_STEP(__i915, variant, since, until) \
- (IS_SUBPLATFORM(__i915, INTEL_METEORLAKE, INTEL_SUBPLATFORM_##variant) && \
- IS_GRAPHICS_STEP(__i915, since, until))
-
-#define IS_MTL_DISPLAY_STEP(__i915, since, until) \
- (IS_METEORLAKE(__i915) && \
- IS_DISPLAY_STEP(__i915, since, until))
-
-#define IS_MTL_MEDIA_STEP(__i915, since, until) \
- (IS_METEORLAKE(__i915) && \
- IS_MEDIA_STEP(__i915, since, until))
-
-/*
- * DG2 hardware steppings are a bit unusual. The hardware design was forked to
- * create three variants (G10, G11, and G12) which each have distinct
- * workaround sets. The G11 and G12 forks of the DG2 design reset the GT
- * stepping back to "A0" for their first iterations, even though they're more
- * similar to a G10 B0 stepping and G10 C0 stepping respectively in terms of
- * functionality and workarounds. However the display stepping does not reset
- * in the same manner --- a specific stepping like "B0" has a consistent
- * meaning regardless of whether it belongs to a G10, G11, or G12 DG2.
- *
- * TLDR: All GT workarounds and stepping-specific logic must be applied in
- * relation to a specific subplatform (G10/G11/G12), whereas display workarounds
- * and stepping-specific logic will be applied with a general DG2-wide stepping
- * number.
- */
-#define IS_DG2_GRAPHICS_STEP(__i915, variant, since, until) \
- (IS_SUBPLATFORM(__i915, INTEL_DG2, INTEL_SUBPLATFORM_##variant) && \
- IS_GRAPHICS_STEP(__i915, since, until))
-
-#define IS_DG2_DISPLAY_STEP(__i915, since, until) \
- (IS_DG2(__i915) && \
- IS_DISPLAY_STEP(__i915, since, until))
-
#define IS_PVC_BD_STEP(__i915, since, until) \
(IS_PONTEVECCHIO(__i915) && \
IS_BASEDIE_STEP(__i915, since, until))
#define CMDPARSER_USES_GGTT(i915) (GRAPHICS_VER(i915) == 7)
#define HAS_LLC(i915) (INTEL_INFO(i915)->has_llc)
-#define HAS_4TILE(i915) (INTEL_INFO(i915)->has_4tile)
#define HAS_SNOOP(i915) (INTEL_INFO(i915)->has_snoop)
#define HAS_EDRAM(i915) ((i915)->edram_size_mb)
#define HAS_SECURE_BATCHES(i915) (GRAPHICS_VER(i915) < 6)
#define NUM_L3_SLICES(i915) (IS_HASWELL_GT3(i915) ? \
2 : HAS_L3_DPF(i915))
-/* Only valid when HAS_DISPLAY() is true */
-#define INTEL_DISPLAY_ENABLED(i915) \
- (drm_WARN_ON(&(i915)->drm, !HAS_DISPLAY(i915)), \
- !(i915)->params.disable_display && \
- !intel_opregion_headless_sku(i915))
-
#define HAS_GUC_DEPRIVILEGE(i915) \
(INTEL_INFO(i915)->has_guc_deprivilege)
+#define HAS_GUC_TLB_INVALIDATION(i915) (INTEL_INFO(i915)->has_guc_tlb_invalidation)
+
#define HAS_3D_PIPELINE(i915) (INTEL_INFO(i915)->has_3d_pipeline)
#define HAS_ONE_EU_PER_FUSE_BIT(i915) (INTEL_INFO(i915)->has_one_eu_per_fuse_bit)
#include <linux/dma-mapping.h>
#include <linux/fault-inject.h>
-#include <linux/kthread.h>
#include <linux/of_address.h>
-#include <linux/sched/mm.h>
#include <linux/uaccess.h>
-#include <uapi/linux/sched/types.h>
-#include <drm/drm_aperture.h>
-#include <drm/drm_bridge.h>
#include <drm/drm_drv.h>
#include <drm/drm_file.h>
#include <drm/drm_ioctl.h>
-#include <drm/drm_prime.h>
#include <drm/drm_of.h>
-#include <drm/drm_vblank.h>
-#include "disp/msm_disp_snapshot.h"
#include "msm_drv.h"
#include "msm_debugfs.h"
-#include "msm_fence.h"
-#include "msm_gem.h"
-#include "msm_gpu.h"
#include "msm_kms.h"
-#include "msm_mmu.h"
#include "adreno/adreno_gpu.h"
/*
static void msm_deinit_vram(struct drm_device *ddev);
-static const struct drm_mode_config_funcs mode_config_funcs = {
- .fb_create = msm_framebuffer_create,
- .atomic_check = msm_atomic_check,
- .atomic_commit = drm_atomic_helper_commit,
-};
-
-static const struct drm_mode_config_helper_funcs mode_config_helper_funcs = {
- .atomic_commit_tail = msm_atomic_commit_tail,
-};
-
static char *vram = "16m";
MODULE_PARM_DESC(vram, "Configure VRAM size (for devices without IOMMU/GPUMMU)");
module_param(vram, charp, 0);
DECLARE_FAULT_ATTR(fail_gem_iova);
#endif
-static irqreturn_t msm_irq(int irq, void *arg)
-{
- struct drm_device *dev = arg;
- struct msm_drm_private *priv = dev->dev_private;
- struct msm_kms *kms = priv->kms;
-
- BUG_ON(!kms);
-
- return kms->funcs->irq(kms);
-}
-
-static void msm_irq_preinstall(struct drm_device *dev)
-{
- struct msm_drm_private *priv = dev->dev_private;
- struct msm_kms *kms = priv->kms;
-
- BUG_ON(!kms);
-
- kms->funcs->irq_preinstall(kms);
-}
-
-static int msm_irq_postinstall(struct drm_device *dev)
-{
- struct msm_drm_private *priv = dev->dev_private;
- struct msm_kms *kms = priv->kms;
-
- BUG_ON(!kms);
-
- if (kms->funcs->irq_postinstall)
- return kms->funcs->irq_postinstall(kms);
-
- return 0;
-}
-
-static int msm_irq_install(struct drm_device *dev, unsigned int irq)
-{
- struct msm_drm_private *priv = dev->dev_private;
- struct msm_kms *kms = priv->kms;
- int ret;
-
- if (irq == IRQ_NOTCONNECTED)
- return -ENOTCONN;
-
- msm_irq_preinstall(dev);
-
- ret = request_irq(irq, msm_irq, 0, dev->driver->name, dev);
- if (ret)
- return ret;
-
- kms->irq_requested = true;
-
- ret = msm_irq_postinstall(dev);
- if (ret) {
- free_irq(irq, dev);
- return ret;
- }
-
- return 0;
-}
-
-static void msm_irq_uninstall(struct drm_device *dev)
-{
- struct msm_drm_private *priv = dev->dev_private;
- struct msm_kms *kms = priv->kms;
-
- kms->funcs->irq_uninstall(kms);
- if (kms->irq_requested)
- free_irq(kms->irq, dev);
-}
-
-struct msm_vblank_work {
- struct work_struct work;
- struct drm_crtc *crtc;
- bool enable;
- struct msm_drm_private *priv;
-};
-
-static void vblank_ctrl_worker(struct work_struct *work)
-{
- struct msm_vblank_work *vbl_work = container_of(work,
- struct msm_vblank_work, work);
- struct msm_drm_private *priv = vbl_work->priv;
- struct msm_kms *kms = priv->kms;
-
- if (vbl_work->enable)
- kms->funcs->enable_vblank(kms, vbl_work->crtc);
- else
- kms->funcs->disable_vblank(kms, vbl_work->crtc);
-
- kfree(vbl_work);
-}
-
-static int vblank_ctrl_queue_work(struct msm_drm_private *priv,
- struct drm_crtc *crtc, bool enable)
-{
- struct msm_vblank_work *vbl_work;
-
- vbl_work = kzalloc(sizeof(*vbl_work), GFP_ATOMIC);
- if (!vbl_work)
- return -ENOMEM;
-
- INIT_WORK(&vbl_work->work, vblank_ctrl_worker);
-
- vbl_work->crtc = crtc;
- vbl_work->enable = enable;
- vbl_work->priv = priv;
-
- queue_work(priv->wq, &vbl_work->work);
-
- return 0;
-}
-
static int msm_drm_uninit(struct device *dev)
{
struct platform_device *pdev = to_platform_device(dev);
struct msm_drm_private *priv = platform_get_drvdata(pdev);
struct drm_device *ddev = priv->dev;
- struct msm_kms *kms = priv->kms;
- int i;
/*
* Shutdown the hw if we're far enough along where things might be on.
*/
if (ddev->registered) {
drm_dev_unregister(ddev);
- drm_atomic_helper_shutdown(ddev);
+ if (priv->kms)
+ drm_atomic_helper_shutdown(ddev);
}
/* We must cancel and cleanup any pending vblank enable/disable
flush_workqueue(priv->wq);
- /* clean up event worker threads */
- for (i = 0; i < priv->num_crtcs; i++) {
- if (priv->event_thread[i].worker)
- kthread_destroy_worker(priv->event_thread[i].worker);
- }
-
msm_gem_shrinker_cleanup(ddev);
- drm_kms_helper_poll_fini(ddev);
-
msm_perf_debugfs_cleanup(priv);
msm_rd_debugfs_cleanup(priv);
- if (kms)
- msm_disp_snapshot_destroy(ddev);
-
- drm_mode_config_cleanup(ddev);
-
- for (i = 0; i < priv->num_bridges; i++)
- drm_bridge_remove(priv->bridges[i]);
- priv->num_bridges = 0;
-
- if (kms) {
- pm_runtime_get_sync(dev);
- msm_irq_uninstall(ddev);
- pm_runtime_put_sync(dev);
- }
-
- if (kms && kms->funcs)
- kms->funcs->destroy(kms);
+ if (priv->kms)
+ msm_drm_kms_uninit(dev);
msm_deinit_vram(ddev);
return 0;
}
-struct msm_gem_address_space *msm_kms_init_aspace(struct drm_device *dev)
-{
- struct msm_gem_address_space *aspace;
- struct msm_mmu *mmu;
- struct device *mdp_dev = dev->dev;
- struct device *mdss_dev = mdp_dev->parent;
- struct device *iommu_dev;
-
- /*
- * IOMMUs can be a part of MDSS device tree binding, or the
- * MDP/DPU device.
- */
- if (device_iommu_mapped(mdp_dev))
- iommu_dev = mdp_dev;
- else
- iommu_dev = mdss_dev;
-
- mmu = msm_iommu_new(iommu_dev, 0);
- if (IS_ERR(mmu))
- return ERR_CAST(mmu);
-
- if (!mmu) {
- drm_info(dev, "no IOMMU, fallback to phys contig buffers for scanout\n");
- return NULL;
- }
-
- aspace = msm_gem_address_space_create(mmu, "mdp_kms",
- 0x1000, 0x100000000 - 0x1000);
- if (IS_ERR(aspace)) {
- dev_err(mdp_dev, "aspace create, error %pe\n", aspace);
- mmu->funcs->destroy(mmu);
- }
-
- return aspace;
-}
-
bool msm_use_mmu(struct drm_device *dev)
{
struct msm_drm_private *priv = dev->dev_private;
{
struct msm_drm_private *priv = dev_get_drvdata(dev);
struct drm_device *ddev;
- struct msm_kms *kms;
- struct drm_crtc *crtc;
int ret;
if (drm_firmware_drivers_only())
might_lock(&priv->lru.lock);
fs_reclaim_release(GFP_KERNEL);
- drm_mode_config_init(ddev);
+ if (priv->kms_init) {
+ ret = drmm_mode_config_init(ddev);
+ if (ret)
+ goto err_destroy_wq;
+ }
ret = msm_init_vram(ddev);
if (ret)
- goto err_cleanup_mode_config;
+ goto err_destroy_wq;
dma_set_max_seg_size(dev, UINT_MAX);
if (ret)
goto err_deinit_vram;
- msm_gem_shrinker_init(ddev);
- /* the fw fb could be anywhere in memory */
- ret = drm_aperture_remove_framebuffers(drv);
- if (ret)
- goto err_msm_uninit;
-
+ ret = msm_gem_shrinker_init(ddev);
+ if (ret)
+ goto err_msm_uninit;
if (priv->kms_init) {
- ret = priv->kms_init(ddev);
- if (ret) {
- DRM_DEV_ERROR(dev, "failed to load kms\n");
- priv->kms = NULL;
+ ret = msm_drm_kms_init(dev, drv);
+ if (ret)
goto err_msm_uninit;
- }
- kms = priv->kms;
} else {
/* valid only for the dummy headless case, where of_node=NULL */
WARN_ON(dev->of_node);
- kms = NULL;
- }
-
- /* Enable normalization of plane zpos */
- ddev->mode_config.normalize_zpos = true;
-
- if (kms) {
- kms->dev = ddev;
- ret = kms->funcs->hw_init(kms);
- if (ret) {
- DRM_DEV_ERROR(dev, "kms hw init failed: %d\n", ret);
- goto err_msm_uninit;
- }
- }
-
- drm_helper_move_panel_connectors_to_head(ddev);
-
- ddev->mode_config.funcs = &mode_config_funcs;
- ddev->mode_config.helper_private = &mode_config_helper_funcs;
-
- drm_for_each_crtc(crtc, ddev) {
- struct msm_drm_thread *ev_thread;
-
- /* initialize event thread */
- ev_thread = &priv->event_thread[drm_crtc_index(crtc)];
- ev_thread->dev = ddev;
- ev_thread->worker = kthread_create_worker(0, "crtc_event:%d", crtc->base.id);
- if (IS_ERR(ev_thread->worker)) {
- ret = PTR_ERR(ev_thread->worker);
- DRM_DEV_ERROR(dev, "failed to create crtc_event kthread\n");
- ev_thread->worker = NULL;
- goto err_msm_uninit;
- }
-
- sched_set_fifo(ev_thread->worker->task);
- }
-
- ret = drm_vblank_init(ddev, priv->num_crtcs);
- if (ret < 0) {
- DRM_DEV_ERROR(dev, "failed to initialize vblank\n");
- goto err_msm_uninit;
- }
-
- if (kms) {
- pm_runtime_get_sync(dev);
- ret = msm_irq_install(ddev, kms->irq);
- pm_runtime_put_sync(dev);
- if (ret < 0) {
- DRM_DEV_ERROR(dev, "failed to install IRQ handler\n");
- goto err_msm_uninit;
- }
+ ddev->driver_features &= ~DRIVER_MODESET;
+ ddev->driver_features &= ~DRIVER_ATOMIC;
}
ret = drm_dev_register(ddev, 0);
if (ret)
goto err_msm_uninit;
- if (kms) {
- ret = msm_disp_snapshot_init(ddev);
- if (ret)
- DRM_DEV_ERROR(dev, "msm_disp_snapshot_init failed ret = %d\n", ret);
- }
- drm_mode_config_reset(ddev);
-
ret = msm_debugfs_late_init(ddev);
if (ret)
goto err_msm_uninit;
drm_kms_helper_poll_init(ddev);
- if (kms)
+ if (priv->kms_init) {
+ drm_kms_helper_poll_init(ddev);
msm_fbdev_setup(ddev);
+ }
return 0;
err_deinit_vram:
msm_deinit_vram(ddev);
-err_cleanup_mode_config:
- drm_mode_config_cleanup(ddev);
+err_destroy_wq:
destroy_workqueue(priv->wq);
err_put_dev:
drm_dev_put(ddev);
context_close(ctx);
}
-int msm_crtc_enable_vblank(struct drm_crtc *crtc)
-{
- struct drm_device *dev = crtc->dev;
- struct msm_drm_private *priv = dev->dev_private;
- struct msm_kms *kms = priv->kms;
- if (!kms)
- return -ENXIO;
- drm_dbg_vbl(dev, "crtc=%u", crtc->base.id);
- return vblank_ctrl_queue_work(priv, crtc, true);
-}
-
-void msm_crtc_disable_vblank(struct drm_crtc *crtc)
-{
- struct drm_device *dev = crtc->dev;
- struct msm_drm_private *priv = dev->dev_private;
- struct msm_kms *kms = priv->kms;
- if (!kms)
- return;
- drm_dbg_vbl(dev, "crtc=%u", crtc->base.id);
- vblank_ctrl_queue_work(priv, crtc, false);
-}
-
/*
* DRM ioctls:
*/
.patchlevel = MSM_VERSION_PATCHLEVEL,
};
-int msm_pm_prepare(struct device *dev)
-{
- struct msm_drm_private *priv = dev_get_drvdata(dev);
- struct drm_device *ddev = priv ? priv->dev : NULL;
-
- if (!priv || !priv->kms)
- return 0;
-
- return drm_mode_config_helper_suspend(ddev);
-}
-
-void msm_pm_complete(struct device *dev)
-{
- struct msm_drm_private *priv = dev_get_drvdata(dev);
- struct drm_device *ddev = priv ? priv->dev : NULL;
-
- if (!priv || !priv->kms)
- return;
-
- drm_mode_config_helper_resume(ddev);
-}
-
-static const struct dev_pm_ops msm_pm_ops = {
- .prepare = msm_pm_prepare,
- .complete = msm_pm_complete,
-};
-
/*
* Componentized driver support:
*/
};
int msm_drv_probe(struct device *master_dev,
- int (*kms_init)(struct drm_device *dev))
+ int (*kms_init)(struct drm_device *dev),
+ struct msm_kms *kms)
{
struct msm_drm_private *priv;
struct component_match *match = NULL;
if (!priv)
return -ENOMEM;
+ priv->kms = kms;
priv->kms_init = kms_init;
dev_set_drvdata(master_dev, priv);
static int msm_pdev_probe(struct platform_device *pdev)
{
- return msm_drv_probe(&pdev->dev, NULL);
+ return msm_drv_probe(&pdev->dev, NULL, NULL);
}
-static int msm_pdev_remove(struct platform_device *pdev)
+static void msm_pdev_remove(struct platform_device *pdev)
{
component_master_del(&pdev->dev, &msm_drm_ops);
-
- return 0;
-}
-
-void msm_drv_shutdown(struct platform_device *pdev)
-{
- struct msm_drm_private *priv = platform_get_drvdata(pdev);
- struct drm_device *drm = priv ? priv->dev : NULL;
-
- /*
- * Shutdown the hw if we're far enough along where things might be on.
- * If we run this too early, we'll end up panicking in any variety of
- * places. Since we don't register the drm device until late in
- * msm_drm_init, drm_dev->registered is used as an indicator that the
- * shutdown will be successful.
- */
- if (drm && drm->registered && priv->kms)
- drm_atomic_helper_shutdown(drm);
}
static struct platform_driver msm_platform_driver = {
.probe = msm_pdev_probe,
- .remove = msm_pdev_remove,
- .shutdown = msm_drv_shutdown,
+ .remove_new = msm_pdev_remove,
.driver = {
.name = "msm",
- .pm = &msm_pm_ops,
},
};
struct msm_drm_thread event_thread[MAX_CRTCS];
- unsigned int num_bridges;
- struct drm_bridge *bridges[MAX_BRIDGES];
-
/* VRAM carveout, used when no IOMMU: */
struct {
unsigned long size;
} vram;
struct notifier_block vmap_notifier;
- struct shrinker shrinker;
+ struct shrinker *shrinker;
struct drm_atomic_state *pm_state;
unsigned long msm_gem_shrinker_shrink(struct drm_device *dev, unsigned long nr_to_scan);
#endif
- void msm_gem_shrinker_init(struct drm_device *dev);
+ int msm_gem_shrinker_init(struct drm_device *dev);
void msm_gem_shrinker_cleanup(struct drm_device *dev);
struct sg_table *msm_gem_prime_get_sg_table(struct drm_gem_object *obj);
bool msm_dsi_is_cmd_mode(struct msm_dsi *msm_dsi);
bool msm_dsi_is_bonded_dsi(struct msm_dsi *msm_dsi);
bool msm_dsi_is_master_dsi(struct msm_dsi *msm_dsi);
+bool msm_dsi_wide_bus_enabled(struct msm_dsi *msm_dsi);
struct drm_dsc_config *msm_dsi_get_dsc_config(struct msm_dsi *msm_dsi);
#else
static inline void __init msm_dsi_register(void)
{
return false;
}
+static inline bool msm_dsi_wide_bus_enabled(struct msm_dsi *msm_dsi)
+{
+ return false;
+}
static inline struct drm_dsc_config *msm_dsi_get_dsc_config(struct msm_dsi *msm_dsi)
{
extern const struct component_master_ops msm_drm_ops;
-int msm_pm_prepare(struct device *dev);
-void msm_pm_complete(struct device *dev);
+int msm_kms_pm_prepare(struct device *dev);
+void msm_kms_pm_complete(struct device *dev);
int msm_drv_probe(struct device *dev,
- int (*kms_init)(struct drm_device *dev));
-void msm_drv_shutdown(struct platform_device *pdev);
+ int (*kms_init)(struct drm_device *dev),
+ struct msm_kms *kms);
+void msm_kms_shutdown(struct platform_device *pdev);
#endif /* __MSM_DRV_H__ */
struct list_head scheduled_jobs;
struct panfrost_perfcnt *perfcnt;
+ atomic_t profile_mode;
struct mutex sched_lock;
struct mutex shrinker_lock;
struct list_head shrinker_list;
- struct shrinker shrinker;
+ struct shrinker *shrinker;
struct panfrost_devfreq pfdevfreq;
+
+ struct {
+ atomic_t use_count;
+ spinlock_t lock;
+ } cycle_counter;
};
struct panfrost_mmu {
struct list_head list;
};
+struct panfrost_engine_usage {
+ unsigned long long elapsed_ns[NUM_JOB_SLOTS];
+ unsigned long long cycles[NUM_JOB_SLOTS];
+};
+
struct panfrost_file_priv {
struct panfrost_device *pfdev;
struct drm_sched_entity sched_entity[NUM_JOB_SLOTS];
struct panfrost_mmu *mmu;
+
+ struct panfrost_engine_usage engine_usage;
};
static inline struct panfrost_device *to_panfrost_device(struct drm_device *ddev)
#include "panfrost_job.h"
#include "panfrost_gpu.h"
#include "panfrost_perfcnt.h"
+#include "panfrost_debugfs.h"
static bool unstable_ioctls;
module_param_unsafe(unstable_ioctls, bool, 0600);
job->requirements = args->requirements;
job->flush_id = panfrost_gpu_get_latest_flush_id(pfdev);
job->mmu = file_priv->mmu;
+ job->engine_usage = &file_priv->engine_usage;
slot = panfrost_job_get_slot(job);
PANFROST_IOCTL(MADVISE, madvise, DRM_RENDER_ALLOW),
};
-DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
+static void panfrost_gpu_show_fdinfo(struct panfrost_device *pfdev,
+ struct panfrost_file_priv *panfrost_priv,
+ struct drm_printer *p)
+{
+ int i;
+
+ /*
+ * IMPORTANT NOTE: drm-cycles and drm-engine measurements are not
+ * accurate, as they only provide a rough estimation of the number of
+ * GPU cycles and CPU time spent in a given context. This is due to two
+ * different factors:
+ * - Firstly, we must consider the time the CPU and then the kernel
+ * takes to process the GPU interrupt, which means additional time and
+ * GPU cycles will be added in excess to the real figure.
+ * - Secondly, the pipelining done by the Job Manager (2 job slots per
+ * engine) implies there is no way to know exactly how much time each
+ * job spent on the GPU.
+ */
+
+ static const char * const engine_names[] = {
+ "fragment", "vertex-tiler", "compute-only"
+ };
+
+ BUILD_BUG_ON(ARRAY_SIZE(engine_names) != NUM_JOB_SLOTS);
+
+ for (i = 0; i < NUM_JOB_SLOTS - 1; i++) {
+ drm_printf(p, "drm-engine-%s:\t%llu ns\n",
+ engine_names[i], panfrost_priv->engine_usage.elapsed_ns[i]);
+ drm_printf(p, "drm-cycles-%s:\t%llu\n",
+ engine_names[i], panfrost_priv->engine_usage.cycles[i]);
+ drm_printf(p, "drm-maxfreq-%s:\t%lu Hz\n",
+ engine_names[i], pfdev->pfdevfreq.fast_rate);
+ drm_printf(p, "drm-curfreq-%s:\t%lu Hz\n",
+ engine_names[i], pfdev->pfdevfreq.current_frequency);
+ }
+}
+
+static void panfrost_show_fdinfo(struct drm_printer *p, struct drm_file *file)
+{
+ struct drm_device *dev = file->minor->dev;
+ struct panfrost_device *pfdev = dev->dev_private;
+
+ panfrost_gpu_show_fdinfo(pfdev, file->driver_priv, p);
+
+ drm_show_memory_stats(p, file);
+}
+
+static const struct file_operations panfrost_drm_driver_fops = {
+ .owner = THIS_MODULE,
+ DRM_GEM_FOPS,
+ .show_fdinfo = drm_show_fdinfo,
+};
/*
* Panfrost driver version:
.driver_features = DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ,
.open = panfrost_open,
.postclose = panfrost_postclose,
+ .show_fdinfo = panfrost_show_fdinfo,
.ioctls = panfrost_drm_driver_ioctls,
.num_ioctls = ARRAY_SIZE(panfrost_drm_driver_ioctls),
.fops = &panfrost_drm_driver_fops,
.gem_create_object = panfrost_gem_create_object,
.gem_prime_import_sg_table = panfrost_gem_prime_import_sg_table,
+
+#ifdef CONFIG_DEBUG_FS
+ .debugfs_init = panfrost_debugfs_init,
+#endif
};
static int panfrost_probe(struct platform_device *pdev)
if (err < 0)
goto err_out1;
- panfrost_gem_shrinker_init(ddev);
+ err = panfrost_gem_shrinker_init(ddev);
+ if (err)
+ goto err_out2;
return 0;
+ err_out2:
+ drm_dev_unregister(ddev);
err_out1:
pm_runtime_disable(pfdev->dev);
panfrost_device_fini(pfdev);
*/
atomic_t gpu_usecount;
+ /*
+ * Object chunk size currently mapped onto physical memory
+ */
+ size_t heap_rss_size;
+
bool noexec :1;
bool is_heap :1;
};
void panfrost_gem_mapping_put(struct panfrost_gem_mapping *mapping);
void panfrost_gem_teardown_mappings_locked(struct panfrost_gem_object *bo);
- void panfrost_gem_shrinker_init(struct drm_device *dev);
+ int panfrost_gem_shrinker_init(struct drm_device *dev);
void panfrost_gem_shrinker_cleanup(struct drm_device *dev);
#endif /* __PANFROST_GEM_H__ */
#define pr_fmt(fmt) "bcache: %s() " fmt, __func__
#include <linux/bio.h>
+#include <linux/closure.h>
#include <linux/kobject.h>
#include <linux/list.h>
#include <linux/mutex.h>
#include "bcache_ondisk.h"
#include "bset.h"
#include "util.h"
-#include "closure.h"
struct bucket {
atomic_t pin;
struct list_head list;
struct bcache_device disk;
struct block_device *bdev;
+ struct bdev_handle *bdev_handle;
struct cache_sb sb;
struct cache_sb_disk *sb_disk;
struct kobject kobj;
struct block_device *bdev;
+ struct bdev_handle *bdev_handle;
struct task_struct *alloc_thread;
struct bio_set bio_split;
/* For the btree cache */
- struct shrinker shrink;
+ struct shrinker *shrink;
/* For the btree cache and anything allocation related */
struct mutex bucket_lock;
cmd->discard_nr_blocks = to_dblock(le64_to_cpu(disk_super->discard_nr_blocks));
cmd->data_block_size = le32_to_cpu(disk_super->data_block_size);
cmd->cache_blocks = to_cblock(le32_to_cpu(disk_super->cache_blocks));
- strncpy(cmd->policy_name, disk_super->policy_name, sizeof(cmd->policy_name));
+ strscpy(cmd->policy_name, disk_super->policy_name, sizeof(cmd->policy_name));
cmd->policy_version[0] = le32_to_cpu(disk_super->policy_version[0]);
cmd->policy_version[1] = le32_to_cpu(disk_super->policy_version[1]);
cmd->policy_version[2] = le32_to_cpu(disk_super->policy_version[2]);
disk_super->discard_block_size = cpu_to_le64(cmd->discard_block_size);
disk_super->discard_nr_blocks = cpu_to_le64(from_dblock(cmd->discard_nr_blocks));
disk_super->cache_blocks = cpu_to_le32(from_cblock(cmd->cache_blocks));
- strncpy(disk_super->policy_name, cmd->policy_name, sizeof(disk_super->policy_name));
+ strscpy(disk_super->policy_name, cmd->policy_name, sizeof(disk_super->policy_name));
disk_super->policy_version[0] = cpu_to_le32(cmd->policy_version[0]);
disk_super->policy_version[1] = cpu_to_le32(cmd->policy_version[1]);
disk_super->policy_version[2] = cpu_to_le32(cmd->policy_version[2]);
(strlen(policy_name) > sizeof(cmd->policy_name) - 1))
return -EINVAL;
- strncpy(cmd->policy_name, policy_name, sizeof(cmd->policy_name));
+ strscpy(cmd->policy_name, policy_name, sizeof(cmd->policy_name));
memcpy(cmd->policy_version, policy_version, sizeof(cmd->policy_version));
hint_size = dm_cache_policy_get_hint_size(policy);
* Replacement block manager (new_bm) is created and old_bm destroyed outside of
* cmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
* shrinker associated with the block manager's bufio client vs cmd root_lock).
- * - must take shrinker_rwsem without holding cmd->root_lock
+ * - must take shrinker_mutex without holding cmd->root_lock
*/
new_bm = dm_block_manager_create(cmd->bdev, DM_CACHE_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
CACHE_MAX_CONCURRENT_LOCKS);
"Set to Y if all devices in each array reliably return zeroes on reads from discarded regions");
static struct workqueue_struct *raid5_wq;
+static void raid5_quiesce(struct mddev *mddev, int quiesce);
+
static inline struct hlist_head *stripe_hash(struct r5conf *conf, sector_t sect)
{
int hash = (sect >> RAID5_STRIPE_SHIFT(conf)) & HASH_MASK;
set_bit(R5_INACTIVE_BLOCKED, &conf->cache_state);
r5l_wake_reclaim(conf->log, 0);
+
+ /* release batch_last before wait to avoid risk of deadlock */
+ if (ctx && ctx->batch_last) {
+ raid5_release_stripe(ctx->batch_last);
+ ctx->batch_last = NULL;
+ }
+
wait_event_lock_irq(conf->wait_for_stripe,
is_inactive_blocked(conf, hash),
*(conf->hash_locks + hash));
unsigned long cpu;
int err = 0;
- /*
- * Never shrink. And mddev_suspend() could deadlock if this is called
- * from raid5d. In that case, scribble_disks and scribble_sectors
- * should equal to new_disks and new_sectors
- */
+ /* Never shrink. */
if (conf->scribble_disks >= new_disks &&
conf->scribble_sectors >= new_sectors)
return 0;
- mddev_suspend(conf->mddev);
+
+ raid5_quiesce(conf->mddev, true);
cpus_read_lock();
for_each_present_cpu(cpu) {
}
cpus_read_unlock();
- mddev_resume(conf->mddev);
+ raid5_quiesce(conf->mddev, false);
+
if (!err) {
conf->scribble_disks = new_disks;
conf->scribble_sectors = new_sectors;
return ret;
}
-static bool reshape_inprogress(struct mddev *mddev)
-{
- return test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
- test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
- !test_bit(MD_RECOVERY_DONE, &mddev->recovery) &&
- !test_bit(MD_RECOVERY_INTR, &mddev->recovery);
-}
-
-static bool reshape_disabled(struct mddev *mddev)
-{
- return is_md_suspended(mddev) || !md_is_rdwr(mddev);
-}
-
static enum stripe_result make_stripe_request(struct mddev *mddev,
struct r5conf *conf, struct stripe_request_ctx *ctx,
sector_t logical_sector, struct bio *bi)
if (ahead_of_reshape(mddev, logical_sector,
conf->reshape_safe)) {
spin_unlock_irq(&conf->device_lock);
- ret = STRIPE_SCHEDULE_AND_RETRY;
- goto out;
+ return STRIPE_SCHEDULE_AND_RETRY;
}
}
spin_unlock_irq(&conf->device_lock);
out_release:
raid5_release_stripe(sh);
-out:
- if (ret == STRIPE_SCHEDULE_AND_RETRY && !reshape_inprogress(mddev) &&
- reshape_disabled(mddev)) {
- bi->bi_status = BLK_STS_IOERR;
- ret = STRIPE_FAIL;
- pr_err("md/raid456:%s: io failed across reshape position while reshape can't make progress.\n",
- mdname(mddev));
- }
-
return ret;
}
new != roundup_pow_of_two(new))
return -EINVAL;
- err = mddev_lock(mddev);
+ err = mddev_suspend_and_lock(mddev);
if (err)
return err;
goto out_unlock;
}
- mddev_suspend(mddev);
mutex_lock(&conf->cache_size_mutex);
size = conf->max_nr_stripes;
err = -ENOMEM;
}
mutex_unlock(&conf->cache_size_mutex);
- mddev_resume(mddev);
out_unlock:
- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
return err ?: len;
}
return -EINVAL;
new = !!new;
- err = mddev_lock(mddev);
+ err = mddev_suspend_and_lock(mddev);
if (err)
return err;
conf = mddev->private;
else if (new != conf->skip_copy) {
struct request_queue *q = mddev->queue;
- mddev_suspend(mddev);
conf->skip_copy = new;
if (new)
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, q);
else
blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, q);
- mddev_resume(mddev);
}
- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
return err ?: len;
}
if (new > 8192)
return -EINVAL;
- err = mddev_lock(mddev);
+ err = mddev_suspend_and_lock(mddev);
if (err)
return err;
conf = mddev->private;
if (!conf)
err = -ENODEV;
else if (new != conf->worker_cnt_per_group) {
- mddev_suspend(mddev);
-
old_groups = conf->worker_groups;
if (old_groups)
flush_workqueue(raid5_wq);
kfree(old_groups[0].workers);
kfree(old_groups);
}
- mddev_resume(mddev);
}
- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
return err ?: len;
}
log_exit(conf);
- unregister_shrinker(&conf->shrinker);
+ shrinker_free(conf->shrinker);
free_thread_groups(conf);
shrink_stripes(conf);
raid5_free_percpu(conf);
static unsigned long raid5_cache_scan(struct shrinker *shrink,
struct shrink_control *sc)
{
- struct r5conf *conf = container_of(shrink, struct r5conf, shrinker);
+ struct r5conf *conf = shrink->private_data;
unsigned long ret = SHRINK_STOP;
if (mutex_trylock(&conf->cache_size_mutex)) {
static unsigned long raid5_cache_count(struct shrinker *shrink,
struct shrink_control *sc)
{
- struct r5conf *conf = container_of(shrink, struct r5conf, shrinker);
+ struct r5conf *conf = shrink->private_data;
if (conf->max_nr_stripes < conf->min_nr_stripes)
/* unlikely, but not impossible */
* it reduces the queue depth and so can hurt throughput.
* So set it rather large, scaled by number of devices.
*/
- conf->shrinker.seeks = DEFAULT_SEEKS * conf->raid_disks * 4;
- conf->shrinker.scan_objects = raid5_cache_scan;
- conf->shrinker.count_objects = raid5_cache_count;
- conf->shrinker.batch = 128;
- conf->shrinker.flags = 0;
- ret = register_shrinker(&conf->shrinker, "md-raid5:%s", mdname(mddev));
- if (ret) {
- pr_warn("md/raid:%s: couldn't register shrinker.\n",
+ conf->shrinker = shrinker_alloc(0, "md-raid5:%s", mdname(mddev));
+ if (!conf->shrinker) {
+ ret = -ENOMEM;
+ pr_warn("md/raid:%s: couldn't allocate shrinker.\n",
mdname(mddev));
goto abort;
}
+ conf->shrinker->seeks = DEFAULT_SEEKS * conf->raid_disks * 4;
+ conf->shrinker->scan_objects = raid5_cache_scan;
+ conf->shrinker->count_objects = raid5_cache_count;
+ conf->shrinker->batch = 128;
+ conf->shrinker->private_data = conf;
+
+ shrinker_register(conf->shrinker);
+
sprintf(pers_name, "raid%d", mddev->new_level);
rcu_assign_pointer(conf->thread,
md_register_thread(raid5d, mddev, pers_name));
long long min_offset_diff = 0;
int first = 1;
- if (mddev_init_writes_pending(mddev) < 0)
- return -ENOMEM;
-
if (mddev->recovery_cp != MaxSector)
pr_notice("md/raid:%s: not clean -- starting background reconstruction\n",
mdname(mddev));
* the reshape wasn't running - like Discard or Read - have
* completed.
*/
- mddev_suspend(mddev);
- mddev_resume(mddev);
+ raid5_quiesce(mddev, true);
+ raid5_quiesce(mddev, false);
/* Add some new drives, as many as will fit.
* We know there are enough to make the newly sized array work.
struct r5conf *conf;
int err;
- err = mddev_lock(mddev);
+ err = mddev_suspend_and_lock(mddev);
if (err)
return err;
conf = mddev->private;
if (!conf) {
- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
return -ENODEV;
}
err = log_init(conf, NULL, true);
if (!err) {
err = resize_stripes(conf, conf->pool_size);
- if (err) {
- mddev_suspend(mddev);
+ if (err)
log_exit(conf);
- mddev_resume(mddev);
- }
}
} else
err = -EINVAL;
} else if (strncmp(buf, "resync", 6) == 0) {
if (raid5_has_ppl(conf)) {
- mddev_suspend(mddev);
log_exit(conf);
- mddev_resume(mddev);
err = resize_stripes(conf, conf->pool_size);
} else if (test_bit(MD_HAS_JOURNAL, &conf->mddev->flags) &&
r5l_log_disk_error(conf)) {
break;
}
- if (!journal_dev_exists) {
- mddev_suspend(mddev);
+ if (!journal_dev_exists)
clear_bit(MD_HAS_JOURNAL, &mddev->flags);
- mddev_resume(mddev);
- } else /* need remove journal device first */
+ else /* need remove journal device first */
err = -EBUSY;
} else
err = -EINVAL;
if (!err)
md_update_sb(mddev, 1);
- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
return err;
}
return r5l_start(conf->log);
}
-static void raid5_prepare_suspend(struct mddev *mddev)
-{
- struct r5conf *conf = mddev->private;
-
- wait_event(mddev->sb_wait, !reshape_inprogress(mddev) ||
- percpu_ref_is_zero(&mddev->active_io));
- if (percpu_ref_is_zero(&mddev->active_io))
- return;
-
- /*
- * Reshape is not in progress, and array is suspended, io that is
- * waiting for reshpape can never be done.
- */
- wake_up(&conf->wait_for_overlap);
-}
-
static struct md_personality raid6_personality =
{
.name = "raid6",
.check_reshape = raid6_check_reshape,
.start_reshape = raid5_start_reshape,
.finish_reshape = raid5_finish_reshape,
- .prepare_suspend = raid5_prepare_suspend,
.quiesce = raid5_quiesce,
.takeover = raid6_takeover,
.change_consistency_policy = raid5_change_consistency_policy,
.check_reshape = raid5_check_reshape,
.start_reshape = raid5_start_reshape,
.finish_reshape = raid5_finish_reshape,
- .prepare_suspend = raid5_prepare_suspend,
.quiesce = raid5_quiesce,
.takeover = raid5_takeover,
.change_consistency_policy = raid5_change_consistency_policy,
.check_reshape = raid5_check_reshape,
.start_reshape = raid5_start_reshape,
.finish_reshape = raid5_finish_reshape,
- .prepare_suspend = raid5_prepare_suspend,
.quiesce = raid5_quiesce,
.takeover = raid4_takeover,
.change_consistency_policy = raid5_change_consistency_policy,
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
/* Shrinker to return free pages - VIRTIO_BALLOON_F_FREE_PAGE_HINT */
- struct shrinker shrinker;
+ struct shrinker *shrinker;
/* OOM notifier to deflate on OOM - VIRTIO_BALLOON_F_DEFLATE_ON_OOM */
struct notifier_block oom_nb;
virtio_cread_le(vb->vdev, struct virtio_balloon_config, num_pages,
&num_pages);
- target = num_pages;
+ /*
+ * Aligned up to guest page size to avoid inflating and deflating
+ * balloon endlessly.
+ */
+ target = ALIGN(num_pages, VIRTIO_BALLOON_PAGES_PER_PAGE);
return target - vb->num_pages;
}
static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,
struct shrink_control *sc)
{
- struct virtio_balloon *vb = container_of(shrinker,
- struct virtio_balloon, shrinker);
+ struct virtio_balloon *vb = shrinker->private_data;
return shrink_free_pages(vb, sc->nr_to_scan);
}
static unsigned long virtio_balloon_shrinker_count(struct shrinker *shrinker,
struct shrink_control *sc)
{
- struct virtio_balloon *vb = container_of(shrinker,
- struct virtio_balloon, shrinker);
+ struct virtio_balloon *vb = shrinker->private_data;
return vb->num_free_page_blocks * VIRTIO_BALLOON_HINT_BLOCK_PAGES;
}
static void virtio_balloon_unregister_shrinker(struct virtio_balloon *vb)
{
- unregister_shrinker(&vb->shrinker);
+ shrinker_free(vb->shrinker);
}
static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
{
- vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
- vb->shrinker.count_objects = virtio_balloon_shrinker_count;
- vb->shrinker.seeks = DEFAULT_SEEKS;
+ vb->shrinker = shrinker_alloc(0, "virtio-balloon");
+ if (!vb->shrinker)
+ return -ENOMEM;
- return register_shrinker(&vb->shrinker, "virtio-balloon");
+ vb->shrinker->scan_objects = virtio_balloon_shrinker_scan;
+ vb->shrinker->count_objects = virtio_balloon_shrinker_count;
+ vb->shrinker->private_data = vb;
+
+ shrinker_register(vb->shrinker);
+
+ return 0;
}
static int virtballoon_probe(struct virtio_device *vdev)
--- /dev/null
- struct bch_fs *c = container_of(shrink, struct bch_fs,
- btree_cache.shrink);
+// SPDX-License-Identifier: GPL-2.0
+
+#include "bcachefs.h"
+#include "bkey_buf.h"
+#include "btree_cache.h"
+#include "btree_io.h"
+#include "btree_iter.h"
+#include "btree_locking.h"
+#include "debug.h"
+#include "errcode.h"
+#include "error.h"
+#include "trace.h"
+
+#include <linux/prefetch.h>
+#include <linux/sched/mm.h>
+
+const char * const bch2_btree_node_flags[] = {
+#define x(f) #f,
+ BTREE_FLAGS()
+#undef x
+ NULL
+};
+
+void bch2_recalc_btree_reserve(struct bch_fs *c)
+{
+ unsigned i, reserve = 16;
+
+ if (!c->btree_roots_known[0].b)
+ reserve += 8;
+
+ for (i = 0; i < btree_id_nr_alive(c); i++) {
+ struct btree_root *r = bch2_btree_id_root(c, i);
+
+ if (r->b)
+ reserve += min_t(unsigned, 1, r->b->c.level) * 8;
+ }
+
+ c->btree_cache.reserve = reserve;
+}
+
+static inline unsigned btree_cache_can_free(struct btree_cache *bc)
+{
+ return max_t(int, 0, bc->used - bc->reserve);
+}
+
+static void btree_node_to_freedlist(struct btree_cache *bc, struct btree *b)
+{
+ if (b->c.lock.readers)
+ list_move(&b->list, &bc->freed_pcpu);
+ else
+ list_move(&b->list, &bc->freed_nonpcpu);
+}
+
+static void btree_node_data_free(struct bch_fs *c, struct btree *b)
+{
+ struct btree_cache *bc = &c->btree_cache;
+
+ EBUG_ON(btree_node_write_in_flight(b));
+
+ clear_btree_node_just_written(b);
+
+ kvpfree(b->data, btree_bytes(c));
+ b->data = NULL;
+#ifdef __KERNEL__
+ kvfree(b->aux_data);
+#else
+ munmap(b->aux_data, btree_aux_data_bytes(b));
+#endif
+ b->aux_data = NULL;
+
+ bc->used--;
+
+ btree_node_to_freedlist(bc, b);
+}
+
+static int bch2_btree_cache_cmp_fn(struct rhashtable_compare_arg *arg,
+ const void *obj)
+{
+ const struct btree *b = obj;
+ const u64 *v = arg->key;
+
+ return b->hash_val == *v ? 0 : 1;
+}
+
+static const struct rhashtable_params bch_btree_cache_params = {
+ .head_offset = offsetof(struct btree, hash),
+ .key_offset = offsetof(struct btree, hash_val),
+ .key_len = sizeof(u64),
+ .obj_cmpfn = bch2_btree_cache_cmp_fn,
+};
+
+static int btree_node_data_alloc(struct bch_fs *c, struct btree *b, gfp_t gfp)
+{
+ BUG_ON(b->data || b->aux_data);
+
+ b->data = kvpmalloc(btree_bytes(c), gfp);
+ if (!b->data)
+ return -BCH_ERR_ENOMEM_btree_node_mem_alloc;
+#ifdef __KERNEL__
+ b->aux_data = kvmalloc(btree_aux_data_bytes(b), gfp);
+#else
+ b->aux_data = mmap(NULL, btree_aux_data_bytes(b),
+ PROT_READ|PROT_WRITE|PROT_EXEC,
+ MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
+ if (b->aux_data == MAP_FAILED)
+ b->aux_data = NULL;
+#endif
+ if (!b->aux_data) {
+ kvpfree(b->data, btree_bytes(c));
+ b->data = NULL;
+ return -BCH_ERR_ENOMEM_btree_node_mem_alloc;
+ }
+
+ return 0;
+}
+
+static struct btree *__btree_node_mem_alloc(struct bch_fs *c, gfp_t gfp)
+{
+ struct btree *b;
+
+ b = kzalloc(sizeof(struct btree), gfp);
+ if (!b)
+ return NULL;
+
+ bkey_btree_ptr_init(&b->key);
+ INIT_LIST_HEAD(&b->list);
+ INIT_LIST_HEAD(&b->write_blocked);
+ b->byte_order = ilog2(btree_bytes(c));
+ return b;
+}
+
+struct btree *__bch2_btree_node_mem_alloc(struct bch_fs *c)
+{
+ struct btree_cache *bc = &c->btree_cache;
+ struct btree *b;
+
+ b = __btree_node_mem_alloc(c, GFP_KERNEL);
+ if (!b)
+ return NULL;
+
+ if (btree_node_data_alloc(c, b, GFP_KERNEL)) {
+ kfree(b);
+ return NULL;
+ }
+
+ bch2_btree_lock_init(&b->c, 0);
+
+ bc->used++;
+ list_add(&b->list, &bc->freeable);
+ return b;
+}
+
+/* Btree in memory cache - hash table */
+
+void bch2_btree_node_hash_remove(struct btree_cache *bc, struct btree *b)
+{
+ int ret = rhashtable_remove_fast(&bc->table, &b->hash, bch_btree_cache_params);
+
+ BUG_ON(ret);
+
+ /* Cause future lookups for this node to fail: */
+ b->hash_val = 0;
+}
+
+int __bch2_btree_node_hash_insert(struct btree_cache *bc, struct btree *b)
+{
+ BUG_ON(b->hash_val);
+ b->hash_val = btree_ptr_hash_val(&b->key);
+
+ return rhashtable_lookup_insert_fast(&bc->table, &b->hash,
+ bch_btree_cache_params);
+}
+
+int bch2_btree_node_hash_insert(struct btree_cache *bc, struct btree *b,
+ unsigned level, enum btree_id id)
+{
+ int ret;
+
+ b->c.level = level;
+ b->c.btree_id = id;
+
+ mutex_lock(&bc->lock);
+ ret = __bch2_btree_node_hash_insert(bc, b);
+ if (!ret)
+ list_add_tail(&b->list, &bc->live);
+ mutex_unlock(&bc->lock);
+
+ return ret;
+}
+
+__flatten
+static inline struct btree *btree_cache_find(struct btree_cache *bc,
+ const struct bkey_i *k)
+{
+ u64 v = btree_ptr_hash_val(k);
+
+ return rhashtable_lookup_fast(&bc->table, &v, bch_btree_cache_params);
+}
+
+/*
+ * this version is for btree nodes that have already been freed (we're not
+ * reaping a real btree node)
+ */
+static int __btree_node_reclaim(struct bch_fs *c, struct btree *b, bool flush)
+{
+ struct btree_cache *bc = &c->btree_cache;
+ int ret = 0;
+
+ lockdep_assert_held(&bc->lock);
+wait_on_io:
+ if (b->flags & ((1U << BTREE_NODE_dirty)|
+ (1U << BTREE_NODE_read_in_flight)|
+ (1U << BTREE_NODE_write_in_flight))) {
+ if (!flush)
+ return -BCH_ERR_ENOMEM_btree_node_reclaim;
+
+ /* XXX: waiting on IO with btree cache lock held */
+ bch2_btree_node_wait_on_read(b);
+ bch2_btree_node_wait_on_write(b);
+ }
+
+ if (!six_trylock_intent(&b->c.lock))
+ return -BCH_ERR_ENOMEM_btree_node_reclaim;
+
+ if (!six_trylock_write(&b->c.lock))
+ goto out_unlock_intent;
+
+ /* recheck under lock */
+ if (b->flags & ((1U << BTREE_NODE_read_in_flight)|
+ (1U << BTREE_NODE_write_in_flight))) {
+ if (!flush)
+ goto out_unlock;
+ six_unlock_write(&b->c.lock);
+ six_unlock_intent(&b->c.lock);
+ goto wait_on_io;
+ }
+
+ if (btree_node_noevict(b) ||
+ btree_node_write_blocked(b) ||
+ btree_node_will_make_reachable(b))
+ goto out_unlock;
+
+ if (btree_node_dirty(b)) {
+ if (!flush)
+ goto out_unlock;
+ /*
+ * Using the underscore version because we don't want to compact
+ * bsets after the write, since this node is about to be evicted
+ * - unless btree verify mode is enabled, since it runs out of
+ * the post write cleanup:
+ */
+ if (bch2_verify_btree_ondisk)
+ bch2_btree_node_write(c, b, SIX_LOCK_intent,
+ BTREE_WRITE_cache_reclaim);
+ else
+ __bch2_btree_node_write(c, b,
+ BTREE_WRITE_cache_reclaim);
+
+ six_unlock_write(&b->c.lock);
+ six_unlock_intent(&b->c.lock);
+ goto wait_on_io;
+ }
+out:
+ if (b->hash_val && !ret)
+ trace_and_count(c, btree_cache_reap, c, b);
+ return ret;
+out_unlock:
+ six_unlock_write(&b->c.lock);
+out_unlock_intent:
+ six_unlock_intent(&b->c.lock);
+ ret = -BCH_ERR_ENOMEM_btree_node_reclaim;
+ goto out;
+}
+
+static int btree_node_reclaim(struct bch_fs *c, struct btree *b)
+{
+ return __btree_node_reclaim(c, b, false);
+}
+
+static int btree_node_write_and_reclaim(struct bch_fs *c, struct btree *b)
+{
+ return __btree_node_reclaim(c, b, true);
+}
+
+static unsigned long bch2_btree_cache_scan(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
- struct bch_fs *c = container_of(shrink, struct bch_fs,
- btree_cache.shrink);
++ struct bch_fs *c = shrink->private_data;
+ struct btree_cache *bc = &c->btree_cache;
+ struct btree *b, *t;
+ unsigned long nr = sc->nr_to_scan;
+ unsigned long can_free = 0;
+ unsigned long freed = 0;
+ unsigned long touched = 0;
+ unsigned i, flags;
+ unsigned long ret = SHRINK_STOP;
+ bool trigger_writes = atomic_read(&bc->dirty) + nr >=
+ bc->used * 3 / 4;
+
+ if (bch2_btree_shrinker_disabled)
+ return SHRINK_STOP;
+
+ mutex_lock(&bc->lock);
+ flags = memalloc_nofs_save();
+
+ /*
+ * It's _really_ critical that we don't free too many btree nodes - we
+ * have to always leave ourselves a reserve. The reserve is how we
+ * guarantee that allocating memory for a new btree node can always
+ * succeed, so that inserting keys into the btree can always succeed and
+ * IO can always make forward progress:
+ */
+ can_free = btree_cache_can_free(bc);
+ nr = min_t(unsigned long, nr, can_free);
+
+ i = 0;
+ list_for_each_entry_safe(b, t, &bc->freeable, list) {
+ /*
+ * Leave a few nodes on the freeable list, so that a btree split
+ * won't have to hit the system allocator:
+ */
+ if (++i <= 3)
+ continue;
+
+ touched++;
+
+ if (touched >= nr)
+ goto out;
+
+ if (!btree_node_reclaim(c, b)) {
+ btree_node_data_free(c, b);
+ six_unlock_write(&b->c.lock);
+ six_unlock_intent(&b->c.lock);
+ freed++;
+ }
+ }
+restart:
+ list_for_each_entry_safe(b, t, &bc->live, list) {
+ touched++;
+
+ if (btree_node_accessed(b)) {
+ clear_btree_node_accessed(b);
+ } else if (!btree_node_reclaim(c, b)) {
+ freed++;
+ btree_node_data_free(c, b);
+
+ bch2_btree_node_hash_remove(bc, b);
+ six_unlock_write(&b->c.lock);
+ six_unlock_intent(&b->c.lock);
+
+ if (freed == nr)
+ goto out_rotate;
+ } else if (trigger_writes &&
+ btree_node_dirty(b) &&
+ !btree_node_will_make_reachable(b) &&
+ !btree_node_write_blocked(b) &&
+ six_trylock_read(&b->c.lock)) {
+ list_move(&bc->live, &b->list);
+ mutex_unlock(&bc->lock);
+ __bch2_btree_node_write(c, b, BTREE_WRITE_cache_reclaim);
+ six_unlock_read(&b->c.lock);
+ if (touched >= nr)
+ goto out_nounlock;
+ mutex_lock(&bc->lock);
+ goto restart;
+ }
+
+ if (touched >= nr)
+ break;
+ }
+out_rotate:
+ if (&t->list != &bc->live)
+ list_move_tail(&bc->live, &t->list);
+out:
+ mutex_unlock(&bc->lock);
+out_nounlock:
+ ret = freed;
+ memalloc_nofs_restore(flags);
+ trace_and_count(c, btree_cache_scan, sc->nr_to_scan, can_free, ret);
+ return ret;
+}
+
+static unsigned long bch2_btree_cache_count(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
- unregister_shrinker(&bc->shrink);
++ struct bch_fs *c = shrink->private_data;
+ struct btree_cache *bc = &c->btree_cache;
+
+ if (bch2_btree_shrinker_disabled)
+ return 0;
+
+ return btree_cache_can_free(bc);
+}
+
+void bch2_fs_btree_cache_exit(struct bch_fs *c)
+{
+ struct btree_cache *bc = &c->btree_cache;
+ struct btree *b;
+ unsigned i, flags;
+
- bc->shrink.count_objects = bch2_btree_cache_count;
- bc->shrink.scan_objects = bch2_btree_cache_scan;
- bc->shrink.seeks = 4;
- ret = register_shrinker(&bc->shrink, "%s/btree_cache", c->name);
- if (ret)
++ shrinker_free(bc->shrink);
+
+ /* vfree() can allocate memory: */
+ flags = memalloc_nofs_save();
+ mutex_lock(&bc->lock);
+
+ if (c->verify_data)
+ list_move(&c->verify_data->list, &bc->live);
+
+ kvpfree(c->verify_ondisk, btree_bytes(c));
+
+ for (i = 0; i < btree_id_nr_alive(c); i++) {
+ struct btree_root *r = bch2_btree_id_root(c, i);
+
+ if (r->b)
+ list_add(&r->b->list, &bc->live);
+ }
+
+ list_splice(&bc->freeable, &bc->live);
+
+ while (!list_empty(&bc->live)) {
+ b = list_first_entry(&bc->live, struct btree, list);
+
+ BUG_ON(btree_node_read_in_flight(b) ||
+ btree_node_write_in_flight(b));
+
+ if (btree_node_dirty(b))
+ bch2_btree_complete_write(c, b, btree_current_write(b));
+ clear_btree_node_dirty_acct(c, b);
+
+ btree_node_data_free(c, b);
+ }
+
+ BUG_ON(atomic_read(&c->btree_cache.dirty));
+
+ list_splice(&bc->freed_pcpu, &bc->freed_nonpcpu);
+
+ while (!list_empty(&bc->freed_nonpcpu)) {
+ b = list_first_entry(&bc->freed_nonpcpu, struct btree, list);
+ list_del(&b->list);
+ six_lock_exit(&b->c.lock);
+ kfree(b);
+ }
+
+ mutex_unlock(&bc->lock);
+ memalloc_nofs_restore(flags);
+
+ if (bc->table_init_done)
+ rhashtable_destroy(&bc->table);
+}
+
+int bch2_fs_btree_cache_init(struct bch_fs *c)
+{
+ struct btree_cache *bc = &c->btree_cache;
++ struct shrinker *shrink;
+ unsigned i;
+ int ret = 0;
+
+ ret = rhashtable_init(&bc->table, &bch_btree_cache_params);
+ if (ret)
+ goto err;
+
+ bc->table_init_done = true;
+
+ bch2_recalc_btree_reserve(c);
+
+ for (i = 0; i < bc->reserve; i++)
+ if (!__bch2_btree_node_mem_alloc(c))
+ goto err;
+
+ list_splice_init(&bc->live, &bc->freeable);
+
+ mutex_init(&c->verify_lock);
+
++ shrink = shrinker_alloc(0, "%s/btree_cache", c->name);
++ if (!shrink)
+ goto err;
++ bc->shrink = shrink;
++ shrink->count_objects = bch2_btree_cache_count;
++ shrink->scan_objects = bch2_btree_cache_scan;
++ shrink->seeks = 4;
++ shrink->private_data = c;
++ shrinker_register(shrink);
+
+ return 0;
+err:
+ return -BCH_ERR_ENOMEM_fs_btree_cache_init;
+}
+
+void bch2_fs_btree_cache_init_early(struct btree_cache *bc)
+{
+ mutex_init(&bc->lock);
+ INIT_LIST_HEAD(&bc->live);
+ INIT_LIST_HEAD(&bc->freeable);
+ INIT_LIST_HEAD(&bc->freed_pcpu);
+ INIT_LIST_HEAD(&bc->freed_nonpcpu);
+}
+
+/*
+ * We can only have one thread cannibalizing other cached btree nodes at a time,
+ * or we'll deadlock. We use an open coded mutex to ensure that, which a
+ * cannibalize_bucket() will take. This means every time we unlock the root of
+ * the btree, we need to release this lock if we have it held.
+ */
+void bch2_btree_cache_cannibalize_unlock(struct bch_fs *c)
+{
+ struct btree_cache *bc = &c->btree_cache;
+
+ if (bc->alloc_lock == current) {
+ trace_and_count(c, btree_cache_cannibalize_unlock, c);
+ bc->alloc_lock = NULL;
+ closure_wake_up(&bc->alloc_wait);
+ }
+}
+
+int bch2_btree_cache_cannibalize_lock(struct bch_fs *c, struct closure *cl)
+{
+ struct btree_cache *bc = &c->btree_cache;
+ struct task_struct *old;
+
+ old = cmpxchg(&bc->alloc_lock, NULL, current);
+ if (old == NULL || old == current)
+ goto success;
+
+ if (!cl) {
+ trace_and_count(c, btree_cache_cannibalize_lock_fail, c);
+ return -BCH_ERR_ENOMEM_btree_cache_cannibalize_lock;
+ }
+
+ closure_wait(&bc->alloc_wait, cl);
+
+ /* Try again, after adding ourselves to waitlist */
+ old = cmpxchg(&bc->alloc_lock, NULL, current);
+ if (old == NULL || old == current) {
+ /* We raced */
+ closure_wake_up(&bc->alloc_wait);
+ goto success;
+ }
+
+ trace_and_count(c, btree_cache_cannibalize_lock_fail, c);
+ return -BCH_ERR_btree_cache_cannibalize_lock_blocked;
+
+success:
+ trace_and_count(c, btree_cache_cannibalize_lock, c);
+ return 0;
+}
+
+static struct btree *btree_node_cannibalize(struct bch_fs *c)
+{
+ struct btree_cache *bc = &c->btree_cache;
+ struct btree *b;
+
+ list_for_each_entry_reverse(b, &bc->live, list)
+ if (!btree_node_reclaim(c, b))
+ return b;
+
+ while (1) {
+ list_for_each_entry_reverse(b, &bc->live, list)
+ if (!btree_node_write_and_reclaim(c, b))
+ return b;
+
+ /*
+ * Rare case: all nodes were intent-locked.
+ * Just busy-wait.
+ */
+ WARN_ONCE(1, "btree cache cannibalize failed\n");
+ cond_resched();
+ }
+}
+
+struct btree *bch2_btree_node_mem_alloc(struct btree_trans *trans, bool pcpu_read_locks)
+{
+ struct bch_fs *c = trans->c;
+ struct btree_cache *bc = &c->btree_cache;
+ struct list_head *freed = pcpu_read_locks
+ ? &bc->freed_pcpu
+ : &bc->freed_nonpcpu;
+ struct btree *b, *b2;
+ u64 start_time = local_clock();
+ unsigned flags;
+
+ flags = memalloc_nofs_save();
+ mutex_lock(&bc->lock);
+
+ /*
+ * We never free struct btree itself, just the memory that holds the on
+ * disk node. Check the freed list before allocating a new one:
+ */
+ list_for_each_entry(b, freed, list)
+ if (!btree_node_reclaim(c, b)) {
+ list_del_init(&b->list);
+ goto got_node;
+ }
+
+ b = __btree_node_mem_alloc(c, GFP_NOWAIT|__GFP_NOWARN);
+ if (!b) {
+ mutex_unlock(&bc->lock);
+ bch2_trans_unlock(trans);
+ b = __btree_node_mem_alloc(c, GFP_KERNEL);
+ if (!b)
+ goto err;
+ mutex_lock(&bc->lock);
+ }
+
+ bch2_btree_lock_init(&b->c, pcpu_read_locks ? SIX_LOCK_INIT_PCPU : 0);
+
+ BUG_ON(!six_trylock_intent(&b->c.lock));
+ BUG_ON(!six_trylock_write(&b->c.lock));
+got_node:
+
+ /*
+ * btree_free() doesn't free memory; it sticks the node on the end of
+ * the list. Check if there's any freed nodes there:
+ */
+ list_for_each_entry(b2, &bc->freeable, list)
+ if (!btree_node_reclaim(c, b2)) {
+ swap(b->data, b2->data);
+ swap(b->aux_data, b2->aux_data);
+ btree_node_to_freedlist(bc, b2);
+ six_unlock_write(&b2->c.lock);
+ six_unlock_intent(&b2->c.lock);
+ goto got_mem;
+ }
+
+ mutex_unlock(&bc->lock);
+
+ if (btree_node_data_alloc(c, b, GFP_NOWAIT|__GFP_NOWARN)) {
+ bch2_trans_unlock(trans);
+ if (btree_node_data_alloc(c, b, GFP_KERNEL|__GFP_NOWARN))
+ goto err;
+ }
+
+ mutex_lock(&bc->lock);
+ bc->used++;
+got_mem:
+ mutex_unlock(&bc->lock);
+
+ BUG_ON(btree_node_hashed(b));
+ BUG_ON(btree_node_dirty(b));
+ BUG_ON(btree_node_write_in_flight(b));
+out:
+ b->flags = 0;
+ b->written = 0;
+ b->nsets = 0;
+ b->sib_u64s[0] = 0;
+ b->sib_u64s[1] = 0;
+ b->whiteout_u64s = 0;
+ bch2_btree_keys_init(b);
+ set_btree_node_accessed(b);
+
+ bch2_time_stats_update(&c->times[BCH_TIME_btree_node_mem_alloc],
+ start_time);
+
+ memalloc_nofs_restore(flags);
+ return b;
+err:
+ mutex_lock(&bc->lock);
+
+ /* Try to cannibalize another cached btree node: */
+ if (bc->alloc_lock == current) {
+ b2 = btree_node_cannibalize(c);
+ clear_btree_node_just_written(b2);
+ bch2_btree_node_hash_remove(bc, b2);
+
+ if (b) {
+ swap(b->data, b2->data);
+ swap(b->aux_data, b2->aux_data);
+ btree_node_to_freedlist(bc, b2);
+ six_unlock_write(&b2->c.lock);
+ six_unlock_intent(&b2->c.lock);
+ } else {
+ b = b2;
+ list_del_init(&b->list);
+ }
+
+ mutex_unlock(&bc->lock);
+
+ trace_and_count(c, btree_cache_cannibalize, c);
+ goto out;
+ }
+
+ mutex_unlock(&bc->lock);
+ memalloc_nofs_restore(flags);
+ return ERR_PTR(-BCH_ERR_ENOMEM_btree_node_mem_alloc);
+}
+
+/* Slowpath, don't want it inlined into btree_iter_traverse() */
+static noinline struct btree *bch2_btree_node_fill(struct btree_trans *trans,
+ struct btree_path *path,
+ const struct bkey_i *k,
+ enum btree_id btree_id,
+ unsigned level,
+ enum six_lock_type lock_type,
+ bool sync)
+{
+ struct bch_fs *c = trans->c;
+ struct btree_cache *bc = &c->btree_cache;
+ struct btree *b;
+ u32 seq;
+
+ BUG_ON(level + 1 >= BTREE_MAX_DEPTH);
+ /*
+ * Parent node must be locked, else we could read in a btree node that's
+ * been freed:
+ */
+ if (path && !bch2_btree_node_relock(trans, path, level + 1)) {
+ trace_and_count(c, trans_restart_relock_parent_for_fill, trans, _THIS_IP_, path);
+ return ERR_PTR(btree_trans_restart(trans, BCH_ERR_transaction_restart_fill_relock));
+ }
+
+ b = bch2_btree_node_mem_alloc(trans, level != 0);
+
+ if (bch2_err_matches(PTR_ERR_OR_ZERO(b), ENOMEM)) {
+ trans->memory_allocation_failure = true;
+ trace_and_count(c, trans_restart_memory_allocation_failure, trans, _THIS_IP_, path);
+ return ERR_PTR(btree_trans_restart(trans, BCH_ERR_transaction_restart_fill_mem_alloc_fail));
+ }
+
+ if (IS_ERR(b))
+ return b;
+
+ /*
+ * Btree nodes read in from disk should not have the accessed bit set
+ * initially, so that linear scans don't thrash the cache:
+ */
+ clear_btree_node_accessed(b);
+
+ bkey_copy(&b->key, k);
+ if (bch2_btree_node_hash_insert(bc, b, level, btree_id)) {
+ /* raced with another fill: */
+
+ /* mark as unhashed... */
+ b->hash_val = 0;
+
+ mutex_lock(&bc->lock);
+ list_add(&b->list, &bc->freeable);
+ mutex_unlock(&bc->lock);
+
+ six_unlock_write(&b->c.lock);
+ six_unlock_intent(&b->c.lock);
+ return NULL;
+ }
+
+ set_btree_node_read_in_flight(b);
+
+ six_unlock_write(&b->c.lock);
+ seq = six_lock_seq(&b->c.lock);
+ six_unlock_intent(&b->c.lock);
+
+ /* Unlock before doing IO: */
+ if (path && sync)
+ bch2_trans_unlock_noassert(trans);
+
+ bch2_btree_node_read(c, b, sync);
+
+ if (!sync)
+ return NULL;
+
+ if (path) {
+ int ret = bch2_trans_relock(trans) ?:
+ bch2_btree_path_relock_intent(trans, path);
+ if (ret) {
+ BUG_ON(!trans->restarted);
+ return ERR_PTR(ret);
+ }
+ }
+
+ if (!six_relock_type(&b->c.lock, lock_type, seq)) {
+ if (path)
+ trace_and_count(c, trans_restart_relock_after_fill, trans, _THIS_IP_, path);
+ return ERR_PTR(btree_trans_restart(trans, BCH_ERR_transaction_restart_relock_after_fill));
+ }
+
+ return b;
+}
+
+static noinline void btree_bad_header(struct bch_fs *c, struct btree *b)
+{
+ struct printbuf buf = PRINTBUF;
+
+ if (c->curr_recovery_pass <= BCH_RECOVERY_PASS_check_allocations)
+ return;
+
+ prt_printf(&buf,
+ "btree node header doesn't match ptr\n"
+ "btree %s level %u\n"
+ "ptr: ",
+ bch2_btree_ids[b->c.btree_id], b->c.level);
+ bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&b->key));
+
+ prt_printf(&buf, "\nheader: btree %s level %llu\n"
+ "min ",
+ bch2_btree_ids[BTREE_NODE_ID(b->data)],
+ BTREE_NODE_LEVEL(b->data));
+ bch2_bpos_to_text(&buf, b->data->min_key);
+
+ prt_printf(&buf, "\nmax ");
+ bch2_bpos_to_text(&buf, b->data->max_key);
+
+ bch2_fs_inconsistent(c, "%s", buf.buf);
+ printbuf_exit(&buf);
+}
+
+static inline void btree_check_header(struct bch_fs *c, struct btree *b)
+{
+ if (b->c.btree_id != BTREE_NODE_ID(b->data) ||
+ b->c.level != BTREE_NODE_LEVEL(b->data) ||
+ !bpos_eq(b->data->max_key, b->key.k.p) ||
+ (b->key.k.type == KEY_TYPE_btree_ptr_v2 &&
+ !bpos_eq(b->data->min_key,
+ bkey_i_to_btree_ptr_v2(&b->key)->v.min_key)))
+ btree_bad_header(c, b);
+}
+
+static struct btree *__bch2_btree_node_get(struct btree_trans *trans, struct btree_path *path,
+ const struct bkey_i *k, unsigned level,
+ enum six_lock_type lock_type,
+ unsigned long trace_ip)
+{
+ struct bch_fs *c = trans->c;
+ struct btree_cache *bc = &c->btree_cache;
+ struct btree *b;
+ struct bset_tree *t;
+ bool need_relock = false;
+ int ret;
+
+ EBUG_ON(level >= BTREE_MAX_DEPTH);
+retry:
+ b = btree_cache_find(bc, k);
+ if (unlikely(!b)) {
+ /*
+ * We must have the parent locked to call bch2_btree_node_fill(),
+ * else we could read in a btree node from disk that's been
+ * freed:
+ */
+ b = bch2_btree_node_fill(trans, path, k, path->btree_id,
+ level, lock_type, true);
+ need_relock = true;
+
+ /* We raced and found the btree node in the cache */
+ if (!b)
+ goto retry;
+
+ if (IS_ERR(b))
+ return b;
+ } else {
+ if (btree_node_read_locked(path, level + 1))
+ btree_node_unlock(trans, path, level + 1);
+
+ ret = btree_node_lock(trans, path, &b->c, level, lock_type, trace_ip);
+ if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
+ return ERR_PTR(ret);
+
+ BUG_ON(ret);
+
+ if (unlikely(b->hash_val != btree_ptr_hash_val(k) ||
+ b->c.level != level ||
+ race_fault())) {
+ six_unlock_type(&b->c.lock, lock_type);
+ if (bch2_btree_node_relock(trans, path, level + 1))
+ goto retry;
+
+ trace_and_count(c, trans_restart_btree_node_reused, trans, trace_ip, path);
+ return ERR_PTR(btree_trans_restart(trans, BCH_ERR_transaction_restart_lock_node_reused));
+ }
+
+ /* avoid atomic set bit if it's not needed: */
+ if (!btree_node_accessed(b))
+ set_btree_node_accessed(b);
+ }
+
+ if (unlikely(btree_node_read_in_flight(b))) {
+ u32 seq = six_lock_seq(&b->c.lock);
+
+ six_unlock_type(&b->c.lock, lock_type);
+ bch2_trans_unlock(trans);
+ need_relock = true;
+
+ bch2_btree_node_wait_on_read(b);
+
+ /*
+ * should_be_locked is not set on this path yet, so we need to
+ * relock it specifically:
+ */
+ if (!six_relock_type(&b->c.lock, lock_type, seq))
+ goto retry;
+ }
+
+ if (unlikely(need_relock)) {
+ ret = bch2_trans_relock(trans) ?:
+ bch2_btree_path_relock_intent(trans, path);
+ if (ret) {
+ six_unlock_type(&b->c.lock, lock_type);
+ return ERR_PTR(ret);
+ }
+ }
+
+ prefetch(b->aux_data);
+
+ for_each_bset(b, t) {
+ void *p = (u64 *) b->aux_data + t->aux_data_offset;
+
+ prefetch(p + L1_CACHE_BYTES * 0);
+ prefetch(p + L1_CACHE_BYTES * 1);
+ prefetch(p + L1_CACHE_BYTES * 2);
+ }
+
+ if (unlikely(btree_node_read_error(b))) {
+ six_unlock_type(&b->c.lock, lock_type);
+ return ERR_PTR(-EIO);
+ }
+
+ EBUG_ON(b->c.btree_id != path->btree_id);
+ EBUG_ON(BTREE_NODE_LEVEL(b->data) != level);
+ btree_check_header(c, b);
+
+ return b;
+}
+
+/**
+ * bch2_btree_node_get - find a btree node in the cache and lock it, reading it
+ * in from disk if necessary.
+ *
+ * @trans: btree transaction object
+ * @path: btree_path being traversed
+ * @k: pointer to btree node (generally KEY_TYPE_btree_ptr_v2)
+ * @level: level of btree node being looked up (0 == leaf node)
+ * @lock_type: SIX_LOCK_read or SIX_LOCK_intent
+ * @trace_ip: ip of caller of btree iterator code (i.e. caller of bch2_btree_iter_peek())
+ *
+ * The btree node will have either a read or a write lock held, depending on
+ * the @write parameter.
+ *
+ * Returns: btree node or ERR_PTR()
+ */
+struct btree *bch2_btree_node_get(struct btree_trans *trans, struct btree_path *path,
+ const struct bkey_i *k, unsigned level,
+ enum six_lock_type lock_type,
+ unsigned long trace_ip)
+{
+ struct bch_fs *c = trans->c;
+ struct btree *b;
+ struct bset_tree *t;
+ int ret;
+
+ EBUG_ON(level >= BTREE_MAX_DEPTH);
+
+ b = btree_node_mem_ptr(k);
+
+ /*
+ * Check b->hash_val _before_ calling btree_node_lock() - this might not
+ * be the node we want anymore, and trying to lock the wrong node could
+ * cause an unneccessary transaction restart:
+ */
+ if (unlikely(!c->opts.btree_node_mem_ptr_optimization ||
+ !b ||
+ b->hash_val != btree_ptr_hash_val(k)))
+ return __bch2_btree_node_get(trans, path, k, level, lock_type, trace_ip);
+
+ if (btree_node_read_locked(path, level + 1))
+ btree_node_unlock(trans, path, level + 1);
+
+ ret = btree_node_lock(trans, path, &b->c, level, lock_type, trace_ip);
+ if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
+ return ERR_PTR(ret);
+
+ BUG_ON(ret);
+
+ if (unlikely(b->hash_val != btree_ptr_hash_val(k) ||
+ b->c.level != level ||
+ race_fault())) {
+ six_unlock_type(&b->c.lock, lock_type);
+ if (bch2_btree_node_relock(trans, path, level + 1))
+ return __bch2_btree_node_get(trans, path, k, level, lock_type, trace_ip);
+
+ trace_and_count(c, trans_restart_btree_node_reused, trans, trace_ip, path);
+ return ERR_PTR(btree_trans_restart(trans, BCH_ERR_transaction_restart_lock_node_reused));
+ }
+
+ if (unlikely(btree_node_read_in_flight(b))) {
+ six_unlock_type(&b->c.lock, lock_type);
+ return __bch2_btree_node_get(trans, path, k, level, lock_type, trace_ip);
+ }
+
+ prefetch(b->aux_data);
+
+ for_each_bset(b, t) {
+ void *p = (u64 *) b->aux_data + t->aux_data_offset;
+
+ prefetch(p + L1_CACHE_BYTES * 0);
+ prefetch(p + L1_CACHE_BYTES * 1);
+ prefetch(p + L1_CACHE_BYTES * 2);
+ }
+
+ /* avoid atomic set bit if it's not needed: */
+ if (!btree_node_accessed(b))
+ set_btree_node_accessed(b);
+
+ if (unlikely(btree_node_read_error(b))) {
+ six_unlock_type(&b->c.lock, lock_type);
+ return ERR_PTR(-EIO);
+ }
+
+ EBUG_ON(b->c.btree_id != path->btree_id);
+ EBUG_ON(BTREE_NODE_LEVEL(b->data) != level);
+ btree_check_header(c, b);
+
+ return b;
+}
+
+struct btree *bch2_btree_node_get_noiter(struct btree_trans *trans,
+ const struct bkey_i *k,
+ enum btree_id btree_id,
+ unsigned level,
+ bool nofill)
+{
+ struct bch_fs *c = trans->c;
+ struct btree_cache *bc = &c->btree_cache;
+ struct btree *b;
+ struct bset_tree *t;
+ int ret;
+
+ EBUG_ON(level >= BTREE_MAX_DEPTH);
+
+ if (c->opts.btree_node_mem_ptr_optimization) {
+ b = btree_node_mem_ptr(k);
+ if (b)
+ goto lock_node;
+ }
+retry:
+ b = btree_cache_find(bc, k);
+ if (unlikely(!b)) {
+ if (nofill)
+ goto out;
+
+ b = bch2_btree_node_fill(trans, NULL, k, btree_id,
+ level, SIX_LOCK_read, true);
+
+ /* We raced and found the btree node in the cache */
+ if (!b)
+ goto retry;
+
+ if (IS_ERR(b) &&
+ !bch2_btree_cache_cannibalize_lock(c, NULL))
+ goto retry;
+
+ if (IS_ERR(b))
+ goto out;
+ } else {
+lock_node:
+ ret = btree_node_lock_nopath(trans, &b->c, SIX_LOCK_read, _THIS_IP_);
+ if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
+ return ERR_PTR(ret);
+
+ BUG_ON(ret);
+
+ if (unlikely(b->hash_val != btree_ptr_hash_val(k) ||
+ b->c.btree_id != btree_id ||
+ b->c.level != level)) {
+ six_unlock_read(&b->c.lock);
+ goto retry;
+ }
+ }
+
+ /* XXX: waiting on IO with btree locks held: */
+ __bch2_btree_node_wait_on_read(b);
+
+ prefetch(b->aux_data);
+
+ for_each_bset(b, t) {
+ void *p = (u64 *) b->aux_data + t->aux_data_offset;
+
+ prefetch(p + L1_CACHE_BYTES * 0);
+ prefetch(p + L1_CACHE_BYTES * 1);
+ prefetch(p + L1_CACHE_BYTES * 2);
+ }
+
+ /* avoid atomic set bit if it's not needed: */
+ if (!btree_node_accessed(b))
+ set_btree_node_accessed(b);
+
+ if (unlikely(btree_node_read_error(b))) {
+ six_unlock_read(&b->c.lock);
+ b = ERR_PTR(-EIO);
+ goto out;
+ }
+
+ EBUG_ON(b->c.btree_id != btree_id);
+ EBUG_ON(BTREE_NODE_LEVEL(b->data) != level);
+ btree_check_header(c, b);
+out:
+ bch2_btree_cache_cannibalize_unlock(c);
+ return b;
+}
+
+int bch2_btree_node_prefetch(struct btree_trans *trans,
+ struct btree_path *path,
+ const struct bkey_i *k,
+ enum btree_id btree_id, unsigned level)
+{
+ struct bch_fs *c = trans->c;
+ struct btree_cache *bc = &c->btree_cache;
+ struct btree *b;
+
+ BUG_ON(trans && !btree_node_locked(path, level + 1));
+ BUG_ON(level >= BTREE_MAX_DEPTH);
+
+ b = btree_cache_find(bc, k);
+ if (b)
+ return 0;
+
+ b = bch2_btree_node_fill(trans, path, k, btree_id,
+ level, SIX_LOCK_read, false);
+ return PTR_ERR_OR_ZERO(b);
+}
+
+void bch2_btree_node_evict(struct btree_trans *trans, const struct bkey_i *k)
+{
+ struct bch_fs *c = trans->c;
+ struct btree_cache *bc = &c->btree_cache;
+ struct btree *b;
+
+ b = btree_cache_find(bc, k);
+ if (!b)
+ return;
+wait_on_io:
+ /* not allowed to wait on io with btree locks held: */
+
+ /* XXX we're called from btree_gc which will be holding other btree
+ * nodes locked
+ */
+ __bch2_btree_node_wait_on_read(b);
+ __bch2_btree_node_wait_on_write(b);
+
+ btree_node_lock_nopath_nofail(trans, &b->c, SIX_LOCK_intent);
+ btree_node_lock_nopath_nofail(trans, &b->c, SIX_LOCK_write);
+
+ if (btree_node_dirty(b)) {
+ __bch2_btree_node_write(c, b, BTREE_WRITE_cache_reclaim);
+ six_unlock_write(&b->c.lock);
+ six_unlock_intent(&b->c.lock);
+ goto wait_on_io;
+ }
+
+ BUG_ON(btree_node_dirty(b));
+
+ mutex_lock(&bc->lock);
+ btree_node_data_free(c, b);
+ bch2_btree_node_hash_remove(bc, b);
+ mutex_unlock(&bc->lock);
+
+ six_unlock_write(&b->c.lock);
+ six_unlock_intent(&b->c.lock);
+}
+
+void bch2_btree_node_to_text(struct printbuf *out, struct bch_fs *c,
+ const struct btree *b)
+{
+ struct bset_stats stats;
+
+ memset(&stats, 0, sizeof(stats));
+
+ bch2_btree_keys_stats(b, &stats);
+
+ prt_printf(out, "l %u ", b->c.level);
+ bch2_bpos_to_text(out, b->data->min_key);
+ prt_printf(out, " - ");
+ bch2_bpos_to_text(out, b->data->max_key);
+ prt_printf(out, ":\n"
+ " ptrs: ");
+ bch2_val_to_text(out, c, bkey_i_to_s_c(&b->key));
+ prt_newline(out);
+
+ prt_printf(out,
+ " format: ");
+ bch2_bkey_format_to_text(out, &b->format);
+
+ prt_printf(out,
+ " unpack fn len: %u\n"
+ " bytes used %zu/%zu (%zu%% full)\n"
+ " sib u64s: %u, %u (merge threshold %u)\n"
+ " nr packed keys %u\n"
+ " nr unpacked keys %u\n"
+ " floats %zu\n"
+ " failed unpacked %zu\n",
+ b->unpack_fn_len,
+ b->nr.live_u64s * sizeof(u64),
+ btree_bytes(c) - sizeof(struct btree_node),
+ b->nr.live_u64s * 100 / btree_max_u64s(c),
+ b->sib_u64s[0],
+ b->sib_u64s[1],
+ c->btree_foreground_merge_threshold,
+ b->nr.packed_keys,
+ b->nr.unpacked_keys,
+ stats.floats,
+ stats.failed);
+}
+
+void bch2_btree_cache_to_text(struct printbuf *out, const struct bch_fs *c)
+{
+ prt_printf(out, "nr nodes:\t\t%u\n", c->btree_cache.used);
+ prt_printf(out, "nr dirty:\t\t%u\n", atomic_read(&c->btree_cache.dirty));
+ prt_printf(out, "cannibalize lock:\t%p\n", c->btree_cache.alloc_lock);
+}
--- /dev/null
- struct bch_fs *c = container_of(shrink, struct bch_fs,
- btree_key_cache.shrink);
+// SPDX-License-Identifier: GPL-2.0
+
+#include "bcachefs.h"
+#include "btree_cache.h"
+#include "btree_iter.h"
+#include "btree_key_cache.h"
+#include "btree_locking.h"
+#include "btree_update.h"
+#include "errcode.h"
+#include "error.h"
+#include "journal.h"
+#include "journal_reclaim.h"
+#include "trace.h"
+
+#include <linux/sched/mm.h>
+
+static inline bool btree_uses_pcpu_readers(enum btree_id id)
+{
+ return id == BTREE_ID_subvolumes;
+}
+
+static struct kmem_cache *bch2_key_cache;
+
+static int bch2_btree_key_cache_cmp_fn(struct rhashtable_compare_arg *arg,
+ const void *obj)
+{
+ const struct bkey_cached *ck = obj;
+ const struct bkey_cached_key *key = arg->key;
+
+ return ck->key.btree_id != key->btree_id ||
+ !bpos_eq(ck->key.pos, key->pos);
+}
+
+static const struct rhashtable_params bch2_btree_key_cache_params = {
+ .head_offset = offsetof(struct bkey_cached, hash),
+ .key_offset = offsetof(struct bkey_cached, key),
+ .key_len = sizeof(struct bkey_cached_key),
+ .obj_cmpfn = bch2_btree_key_cache_cmp_fn,
+};
+
+__flatten
+inline struct bkey_cached *
+bch2_btree_key_cache_find(struct bch_fs *c, enum btree_id btree_id, struct bpos pos)
+{
+ struct bkey_cached_key key = {
+ .btree_id = btree_id,
+ .pos = pos,
+ };
+
+ return rhashtable_lookup_fast(&c->btree_key_cache.table, &key,
+ bch2_btree_key_cache_params);
+}
+
+static bool bkey_cached_lock_for_evict(struct bkey_cached *ck)
+{
+ if (!six_trylock_intent(&ck->c.lock))
+ return false;
+
+ if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
+ six_unlock_intent(&ck->c.lock);
+ return false;
+ }
+
+ if (!six_trylock_write(&ck->c.lock)) {
+ six_unlock_intent(&ck->c.lock);
+ return false;
+ }
+
+ return true;
+}
+
+static void bkey_cached_evict(struct btree_key_cache *c,
+ struct bkey_cached *ck)
+{
+ BUG_ON(rhashtable_remove_fast(&c->table, &ck->hash,
+ bch2_btree_key_cache_params));
+ memset(&ck->key, ~0, sizeof(ck->key));
+
+ atomic_long_dec(&c->nr_keys);
+}
+
+static void bkey_cached_free(struct btree_key_cache *bc,
+ struct bkey_cached *ck)
+{
+ struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache);
+
+ BUG_ON(test_bit(BKEY_CACHED_DIRTY, &ck->flags));
+
+ ck->btree_trans_barrier_seq =
+ start_poll_synchronize_srcu(&c->btree_trans_barrier);
+
+ if (ck->c.lock.readers)
+ list_move_tail(&ck->list, &bc->freed_pcpu);
+ else
+ list_move_tail(&ck->list, &bc->freed_nonpcpu);
+ atomic_long_inc(&bc->nr_freed);
+
+ kfree(ck->k);
+ ck->k = NULL;
+ ck->u64s = 0;
+
+ six_unlock_write(&ck->c.lock);
+ six_unlock_intent(&ck->c.lock);
+}
+
+#ifdef __KERNEL__
+static void __bkey_cached_move_to_freelist_ordered(struct btree_key_cache *bc,
+ struct bkey_cached *ck)
+{
+ struct bkey_cached *pos;
+
+ list_for_each_entry_reverse(pos, &bc->freed_nonpcpu, list) {
+ if (ULONG_CMP_GE(ck->btree_trans_barrier_seq,
+ pos->btree_trans_barrier_seq)) {
+ list_move(&ck->list, &pos->list);
+ return;
+ }
+ }
+
+ list_move(&ck->list, &bc->freed_nonpcpu);
+}
+#endif
+
+static void bkey_cached_move_to_freelist(struct btree_key_cache *bc,
+ struct bkey_cached *ck)
+{
+ BUG_ON(test_bit(BKEY_CACHED_DIRTY, &ck->flags));
+
+ if (!ck->c.lock.readers) {
+#ifdef __KERNEL__
+ struct btree_key_cache_freelist *f;
+ bool freed = false;
+
+ preempt_disable();
+ f = this_cpu_ptr(bc->pcpu_freed);
+
+ if (f->nr < ARRAY_SIZE(f->objs)) {
+ f->objs[f->nr++] = ck;
+ freed = true;
+ }
+ preempt_enable();
+
+ if (!freed) {
+ mutex_lock(&bc->lock);
+ preempt_disable();
+ f = this_cpu_ptr(bc->pcpu_freed);
+
+ while (f->nr > ARRAY_SIZE(f->objs) / 2) {
+ struct bkey_cached *ck2 = f->objs[--f->nr];
+
+ __bkey_cached_move_to_freelist_ordered(bc, ck2);
+ }
+ preempt_enable();
+
+ __bkey_cached_move_to_freelist_ordered(bc, ck);
+ mutex_unlock(&bc->lock);
+ }
+#else
+ mutex_lock(&bc->lock);
+ list_move_tail(&ck->list, &bc->freed_nonpcpu);
+ mutex_unlock(&bc->lock);
+#endif
+ } else {
+ mutex_lock(&bc->lock);
+ list_move_tail(&ck->list, &bc->freed_pcpu);
+ mutex_unlock(&bc->lock);
+ }
+}
+
+static void bkey_cached_free_fast(struct btree_key_cache *bc,
+ struct bkey_cached *ck)
+{
+ struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache);
+
+ ck->btree_trans_barrier_seq =
+ start_poll_synchronize_srcu(&c->btree_trans_barrier);
+
+ list_del_init(&ck->list);
+ atomic_long_inc(&bc->nr_freed);
+
+ kfree(ck->k);
+ ck->k = NULL;
+ ck->u64s = 0;
+
+ bkey_cached_move_to_freelist(bc, ck);
+
+ six_unlock_write(&ck->c.lock);
+ six_unlock_intent(&ck->c.lock);
+}
+
+static struct bkey_cached *
+bkey_cached_alloc(struct btree_trans *trans, struct btree_path *path,
+ bool *was_new)
+{
+ struct bch_fs *c = trans->c;
+ struct btree_key_cache *bc = &c->btree_key_cache;
+ struct bkey_cached *ck = NULL;
+ bool pcpu_readers = btree_uses_pcpu_readers(path->btree_id);
+ int ret;
+
+ if (!pcpu_readers) {
+#ifdef __KERNEL__
+ struct btree_key_cache_freelist *f;
+
+ preempt_disable();
+ f = this_cpu_ptr(bc->pcpu_freed);
+ if (f->nr)
+ ck = f->objs[--f->nr];
+ preempt_enable();
+
+ if (!ck) {
+ mutex_lock(&bc->lock);
+ preempt_disable();
+ f = this_cpu_ptr(bc->pcpu_freed);
+
+ while (!list_empty(&bc->freed_nonpcpu) &&
+ f->nr < ARRAY_SIZE(f->objs) / 2) {
+ ck = list_last_entry(&bc->freed_nonpcpu, struct bkey_cached, list);
+ list_del_init(&ck->list);
+ f->objs[f->nr++] = ck;
+ }
+
+ ck = f->nr ? f->objs[--f->nr] : NULL;
+ preempt_enable();
+ mutex_unlock(&bc->lock);
+ }
+#else
+ mutex_lock(&bc->lock);
+ if (!list_empty(&bc->freed_nonpcpu)) {
+ ck = list_last_entry(&bc->freed_nonpcpu, struct bkey_cached, list);
+ list_del_init(&ck->list);
+ }
+ mutex_unlock(&bc->lock);
+#endif
+ } else {
+ mutex_lock(&bc->lock);
+ if (!list_empty(&bc->freed_pcpu)) {
+ ck = list_last_entry(&bc->freed_pcpu, struct bkey_cached, list);
+ list_del_init(&ck->list);
+ }
+ mutex_unlock(&bc->lock);
+ }
+
+ if (ck) {
+ ret = btree_node_lock_nopath(trans, &ck->c, SIX_LOCK_intent, _THIS_IP_);
+ if (unlikely(ret)) {
+ bkey_cached_move_to_freelist(bc, ck);
+ return ERR_PTR(ret);
+ }
+
+ path->l[0].b = (void *) ck;
+ path->l[0].lock_seq = six_lock_seq(&ck->c.lock);
+ mark_btree_node_locked(trans, path, 0, BTREE_NODE_INTENT_LOCKED);
+
+ ret = bch2_btree_node_lock_write(trans, path, &ck->c);
+ if (unlikely(ret)) {
+ btree_node_unlock(trans, path, 0);
+ bkey_cached_move_to_freelist(bc, ck);
+ return ERR_PTR(ret);
+ }
+
+ return ck;
+ }
+
+ ck = allocate_dropping_locks(trans, ret,
+ kmem_cache_zalloc(bch2_key_cache, _gfp));
+ if (ret) {
+ kmem_cache_free(bch2_key_cache, ck);
+ return ERR_PTR(ret);
+ }
+
+ if (!ck)
+ return NULL;
+
+ INIT_LIST_HEAD(&ck->list);
+ bch2_btree_lock_init(&ck->c, pcpu_readers ? SIX_LOCK_INIT_PCPU : 0);
+
+ ck->c.cached = true;
+ BUG_ON(!six_trylock_intent(&ck->c.lock));
+ BUG_ON(!six_trylock_write(&ck->c.lock));
+ *was_new = true;
+ return ck;
+}
+
+static struct bkey_cached *
+bkey_cached_reuse(struct btree_key_cache *c)
+{
+ struct bucket_table *tbl;
+ struct rhash_head *pos;
+ struct bkey_cached *ck;
+ unsigned i;
+
+ mutex_lock(&c->lock);
+ rcu_read_lock();
+ tbl = rht_dereference_rcu(c->table.tbl, &c->table);
+ for (i = 0; i < tbl->size; i++)
+ rht_for_each_entry_rcu(ck, pos, tbl, i, hash) {
+ if (!test_bit(BKEY_CACHED_DIRTY, &ck->flags) &&
+ bkey_cached_lock_for_evict(ck)) {
+ bkey_cached_evict(c, ck);
+ goto out;
+ }
+ }
+ ck = NULL;
+out:
+ rcu_read_unlock();
+ mutex_unlock(&c->lock);
+ return ck;
+}
+
+static struct bkey_cached *
+btree_key_cache_create(struct btree_trans *trans, struct btree_path *path)
+{
+ struct bch_fs *c = trans->c;
+ struct btree_key_cache *bc = &c->btree_key_cache;
+ struct bkey_cached *ck;
+ bool was_new = false;
+
+ ck = bkey_cached_alloc(trans, path, &was_new);
+ if (IS_ERR(ck))
+ return ck;
+
+ if (unlikely(!ck)) {
+ ck = bkey_cached_reuse(bc);
+ if (unlikely(!ck)) {
+ bch_err(c, "error allocating memory for key cache item, btree %s",
+ bch2_btree_ids[path->btree_id]);
+ return ERR_PTR(-BCH_ERR_ENOMEM_btree_key_cache_create);
+ }
+
+ mark_btree_node_locked(trans, path, 0, BTREE_NODE_INTENT_LOCKED);
+ }
+
+ ck->c.level = 0;
+ ck->c.btree_id = path->btree_id;
+ ck->key.btree_id = path->btree_id;
+ ck->key.pos = path->pos;
+ ck->valid = false;
+ ck->flags = 1U << BKEY_CACHED_ACCESSED;
+
+ if (unlikely(rhashtable_lookup_insert_fast(&bc->table,
+ &ck->hash,
+ bch2_btree_key_cache_params))) {
+ /* We raced with another fill: */
+
+ if (likely(was_new)) {
+ six_unlock_write(&ck->c.lock);
+ six_unlock_intent(&ck->c.lock);
+ kfree(ck);
+ } else {
+ bkey_cached_free_fast(bc, ck);
+ }
+
+ mark_btree_node_locked(trans, path, 0, BTREE_NODE_UNLOCKED);
+ return NULL;
+ }
+
+ atomic_long_inc(&bc->nr_keys);
+
+ six_unlock_write(&ck->c.lock);
+
+ return ck;
+}
+
+static int btree_key_cache_fill(struct btree_trans *trans,
+ struct btree_path *ck_path,
+ struct bkey_cached *ck)
+{
+ struct btree_iter iter;
+ struct bkey_s_c k;
+ unsigned new_u64s = 0;
+ struct bkey_i *new_k = NULL;
+ int ret;
+
+ k = bch2_bkey_get_iter(trans, &iter, ck->key.btree_id, ck->key.pos,
+ BTREE_ITER_KEY_CACHE_FILL|
+ BTREE_ITER_CACHED_NOFILL);
+ ret = bkey_err(k);
+ if (ret)
+ goto err;
+
+ if (!bch2_btree_node_relock(trans, ck_path, 0)) {
+ trace_and_count(trans->c, trans_restart_relock_key_cache_fill, trans, _THIS_IP_, ck_path);
+ ret = btree_trans_restart(trans, BCH_ERR_transaction_restart_key_cache_fill);
+ goto err;
+ }
+
+ /*
+ * bch2_varint_decode can read past the end of the buffer by at
+ * most 7 bytes (it won't be used):
+ */
+ new_u64s = k.k->u64s + 1;
+
+ /*
+ * Allocate some extra space so that the transaction commit path is less
+ * likely to have to reallocate, since that requires a transaction
+ * restart:
+ */
+ new_u64s = min(256U, (new_u64s * 3) / 2);
+
+ if (new_u64s > ck->u64s) {
+ new_u64s = roundup_pow_of_two(new_u64s);
+ new_k = kmalloc(new_u64s * sizeof(u64), GFP_NOWAIT|__GFP_NOWARN);
+ if (!new_k) {
+ bch2_trans_unlock(trans);
+
+ new_k = kmalloc(new_u64s * sizeof(u64), GFP_KERNEL);
+ if (!new_k) {
+ bch_err(trans->c, "error allocating memory for key cache key, btree %s u64s %u",
+ bch2_btree_ids[ck->key.btree_id], new_u64s);
+ ret = -BCH_ERR_ENOMEM_btree_key_cache_fill;
+ goto err;
+ }
+
+ if (!bch2_btree_node_relock(trans, ck_path, 0)) {
+ kfree(new_k);
+ trace_and_count(trans->c, trans_restart_relock_key_cache_fill, trans, _THIS_IP_, ck_path);
+ ret = btree_trans_restart(trans, BCH_ERR_transaction_restart_key_cache_fill);
+ goto err;
+ }
+
+ ret = bch2_trans_relock(trans);
+ if (ret) {
+ kfree(new_k);
+ goto err;
+ }
+ }
+ }
+
+ ret = bch2_btree_node_lock_write(trans, ck_path, &ck_path->l[0].b->c);
+ if (ret) {
+ kfree(new_k);
+ goto err;
+ }
+
+ if (new_k) {
+ kfree(ck->k);
+ ck->u64s = new_u64s;
+ ck->k = new_k;
+ }
+
+ bkey_reassemble(ck->k, k);
+ ck->valid = true;
+ bch2_btree_node_unlock_write(trans, ck_path, ck_path->l[0].b);
+
+ /* We're not likely to need this iterator again: */
+ set_btree_iter_dontneed(&iter);
+err:
+ bch2_trans_iter_exit(trans, &iter);
+ return ret;
+}
+
+static noinline int
+bch2_btree_path_traverse_cached_slowpath(struct btree_trans *trans, struct btree_path *path,
+ unsigned flags)
+{
+ struct bch_fs *c = trans->c;
+ struct bkey_cached *ck;
+ int ret = 0;
+
+ BUG_ON(path->level);
+
+ path->l[1].b = NULL;
+
+ if (bch2_btree_node_relock_notrace(trans, path, 0)) {
+ ck = (void *) path->l[0].b;
+ goto fill;
+ }
+retry:
+ ck = bch2_btree_key_cache_find(c, path->btree_id, path->pos);
+ if (!ck) {
+ ck = btree_key_cache_create(trans, path);
+ ret = PTR_ERR_OR_ZERO(ck);
+ if (ret)
+ goto err;
+ if (!ck)
+ goto retry;
+
+ mark_btree_node_locked(trans, path, 0, BTREE_NODE_INTENT_LOCKED);
+ path->locks_want = 1;
+ } else {
+ enum six_lock_type lock_want = __btree_lock_want(path, 0);
+
+ ret = btree_node_lock(trans, path, (void *) ck, 0,
+ lock_want, _THIS_IP_);
+ if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
+ goto err;
+
+ BUG_ON(ret);
+
+ if (ck->key.btree_id != path->btree_id ||
+ !bpos_eq(ck->key.pos, path->pos)) {
+ six_unlock_type(&ck->c.lock, lock_want);
+ goto retry;
+ }
+
+ mark_btree_node_locked(trans, path, 0,
+ (enum btree_node_locked_type) lock_want);
+ }
+
+ path->l[0].lock_seq = six_lock_seq(&ck->c.lock);
+ path->l[0].b = (void *) ck;
+fill:
+ path->uptodate = BTREE_ITER_UPTODATE;
+
+ if (!ck->valid && !(flags & BTREE_ITER_CACHED_NOFILL)) {
+ /*
+ * Using the underscore version because we haven't set
+ * path->uptodate yet:
+ */
+ if (!path->locks_want &&
+ !__bch2_btree_path_upgrade(trans, path, 1)) {
+ trace_and_count(trans->c, trans_restart_key_cache_upgrade, trans, _THIS_IP_);
+ ret = btree_trans_restart(trans, BCH_ERR_transaction_restart_key_cache_upgrade);
+ goto err;
+ }
+
+ ret = btree_key_cache_fill(trans, path, ck);
+ if (ret)
+ goto err;
+
+ ret = bch2_btree_path_relock(trans, path, _THIS_IP_);
+ if (ret)
+ goto err;
+
+ path->uptodate = BTREE_ITER_UPTODATE;
+ }
+
+ if (!test_bit(BKEY_CACHED_ACCESSED, &ck->flags))
+ set_bit(BKEY_CACHED_ACCESSED, &ck->flags);
+
+ BUG_ON(btree_node_locked_type(path, 0) != btree_lock_want(path, 0));
+ BUG_ON(path->uptodate);
+
+ return ret;
+err:
+ path->uptodate = BTREE_ITER_NEED_TRAVERSE;
+ if (!bch2_err_matches(ret, BCH_ERR_transaction_restart)) {
+ btree_node_unlock(trans, path, 0);
+ path->l[0].b = ERR_PTR(ret);
+ }
+ return ret;
+}
+
+int bch2_btree_path_traverse_cached(struct btree_trans *trans, struct btree_path *path,
+ unsigned flags)
+{
+ struct bch_fs *c = trans->c;
+ struct bkey_cached *ck;
+ int ret = 0;
+
+ EBUG_ON(path->level);
+
+ path->l[1].b = NULL;
+
+ if (bch2_btree_node_relock_notrace(trans, path, 0)) {
+ ck = (void *) path->l[0].b;
+ goto fill;
+ }
+retry:
+ ck = bch2_btree_key_cache_find(c, path->btree_id, path->pos);
+ if (!ck) {
+ return bch2_btree_path_traverse_cached_slowpath(trans, path, flags);
+ } else {
+ enum six_lock_type lock_want = __btree_lock_want(path, 0);
+
+ ret = btree_node_lock(trans, path, (void *) ck, 0,
+ lock_want, _THIS_IP_);
+ EBUG_ON(ret && !bch2_err_matches(ret, BCH_ERR_transaction_restart));
+
+ if (ret)
+ return ret;
+
+ if (ck->key.btree_id != path->btree_id ||
+ !bpos_eq(ck->key.pos, path->pos)) {
+ six_unlock_type(&ck->c.lock, lock_want);
+ goto retry;
+ }
+
+ mark_btree_node_locked(trans, path, 0,
+ (enum btree_node_locked_type) lock_want);
+ }
+
+ path->l[0].lock_seq = six_lock_seq(&ck->c.lock);
+ path->l[0].b = (void *) ck;
+fill:
+ if (!ck->valid)
+ return bch2_btree_path_traverse_cached_slowpath(trans, path, flags);
+
+ if (!test_bit(BKEY_CACHED_ACCESSED, &ck->flags))
+ set_bit(BKEY_CACHED_ACCESSED, &ck->flags);
+
+ path->uptodate = BTREE_ITER_UPTODATE;
+ EBUG_ON(!ck->valid);
+ EBUG_ON(btree_node_locked_type(path, 0) != btree_lock_want(path, 0));
+
+ return ret;
+}
+
+static int btree_key_cache_flush_pos(struct btree_trans *trans,
+ struct bkey_cached_key key,
+ u64 journal_seq,
+ unsigned commit_flags,
+ bool evict)
+{
+ struct bch_fs *c = trans->c;
+ struct journal *j = &c->journal;
+ struct btree_iter c_iter, b_iter;
+ struct bkey_cached *ck = NULL;
+ int ret;
+
+ bch2_trans_iter_init(trans, &b_iter, key.btree_id, key.pos,
+ BTREE_ITER_SLOTS|
+ BTREE_ITER_INTENT|
+ BTREE_ITER_ALL_SNAPSHOTS);
+ bch2_trans_iter_init(trans, &c_iter, key.btree_id, key.pos,
+ BTREE_ITER_CACHED|
+ BTREE_ITER_INTENT);
+ b_iter.flags &= ~BTREE_ITER_WITH_KEY_CACHE;
+
+ ret = bch2_btree_iter_traverse(&c_iter);
+ if (ret)
+ goto out;
+
+ ck = (void *) c_iter.path->l[0].b;
+ if (!ck)
+ goto out;
+
+ if (!test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
+ if (evict)
+ goto evict;
+ goto out;
+ }
+
+ BUG_ON(!ck->valid);
+
+ if (journal_seq && ck->journal.seq != journal_seq)
+ goto out;
+
+ /*
+ * Since journal reclaim depends on us making progress here, and the
+ * allocator/copygc depend on journal reclaim making progress, we need
+ * to be using alloc reserves:
+ */
+ ret = bch2_btree_iter_traverse(&b_iter) ?:
+ bch2_trans_update(trans, &b_iter, ck->k,
+ BTREE_UPDATE_KEY_CACHE_RECLAIM|
+ BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE|
+ BTREE_TRIGGER_NORUN) ?:
+ bch2_trans_commit(trans, NULL, NULL,
+ BTREE_INSERT_NOCHECK_RW|
+ BTREE_INSERT_NOFAIL|
+ (ck->journal.seq == journal_last_seq(j)
+ ? BCH_WATERMARK_reclaim
+ : 0)|
+ commit_flags);
+
+ bch2_fs_fatal_err_on(ret &&
+ !bch2_err_matches(ret, BCH_ERR_transaction_restart) &&
+ !bch2_err_matches(ret, BCH_ERR_journal_reclaim_would_deadlock) &&
+ !bch2_journal_error(j), c,
+ "error flushing key cache: %s", bch2_err_str(ret));
+ if (ret)
+ goto out;
+
+ bch2_journal_pin_drop(j, &ck->journal);
+ bch2_journal_preres_put(j, &ck->res);
+
+ BUG_ON(!btree_node_locked(c_iter.path, 0));
+
+ if (!evict) {
+ if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
+ clear_bit(BKEY_CACHED_DIRTY, &ck->flags);
+ atomic_long_dec(&c->btree_key_cache.nr_dirty);
+ }
+ } else {
+ struct btree_path *path2;
+evict:
+ trans_for_each_path(trans, path2)
+ if (path2 != c_iter.path)
+ __bch2_btree_path_unlock(trans, path2);
+
+ bch2_btree_node_lock_write_nofail(trans, c_iter.path, &ck->c);
+
+ if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
+ clear_bit(BKEY_CACHED_DIRTY, &ck->flags);
+ atomic_long_dec(&c->btree_key_cache.nr_dirty);
+ }
+
+ mark_btree_node_locked_noreset(c_iter.path, 0, BTREE_NODE_UNLOCKED);
+ bkey_cached_evict(&c->btree_key_cache, ck);
+ bkey_cached_free_fast(&c->btree_key_cache, ck);
+ }
+out:
+ bch2_trans_iter_exit(trans, &b_iter);
+ bch2_trans_iter_exit(trans, &c_iter);
+ return ret;
+}
+
+int bch2_btree_key_cache_journal_flush(struct journal *j,
+ struct journal_entry_pin *pin, u64 seq)
+{
+ struct bch_fs *c = container_of(j, struct bch_fs, journal);
+ struct bkey_cached *ck =
+ container_of(pin, struct bkey_cached, journal);
+ struct bkey_cached_key key;
+ struct btree_trans *trans = bch2_trans_get(c);
+ int srcu_idx = srcu_read_lock(&c->btree_trans_barrier);
+ int ret = 0;
+
+ btree_node_lock_nopath_nofail(trans, &ck->c, SIX_LOCK_read);
+ key = ck->key;
+
+ if (ck->journal.seq != seq ||
+ !test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
+ six_unlock_read(&ck->c.lock);
+ goto unlock;
+ }
+
+ if (ck->seq != seq) {
+ bch2_journal_pin_update(&c->journal, ck->seq, &ck->journal,
+ bch2_btree_key_cache_journal_flush);
+ six_unlock_read(&ck->c.lock);
+ goto unlock;
+ }
+ six_unlock_read(&ck->c.lock);
+
+ ret = commit_do(trans, NULL, NULL, 0,
+ btree_key_cache_flush_pos(trans, key, seq,
+ BTREE_INSERT_JOURNAL_RECLAIM, false));
+unlock:
+ srcu_read_unlock(&c->btree_trans_barrier, srcu_idx);
+
+ bch2_trans_put(trans);
+ return ret;
+}
+
+/*
+ * Flush and evict a key from the key cache:
+ */
+int bch2_btree_key_cache_flush(struct btree_trans *trans,
+ enum btree_id id, struct bpos pos)
+{
+ struct bch_fs *c = trans->c;
+ struct bkey_cached_key key = { id, pos };
+
+ /* Fastpath - assume it won't be found: */
+ if (!bch2_btree_key_cache_find(c, id, pos))
+ return 0;
+
+ return btree_key_cache_flush_pos(trans, key, 0, 0, true);
+}
+
+bool bch2_btree_insert_key_cached(struct btree_trans *trans,
+ unsigned flags,
+ struct btree_insert_entry *insert_entry)
+{
+ struct bch_fs *c = trans->c;
+ struct bkey_cached *ck = (void *) insert_entry->path->l[0].b;
+ struct bkey_i *insert = insert_entry->k;
+ bool kick_reclaim = false;
+
+ BUG_ON(insert->k.u64s > ck->u64s);
+
+ if (likely(!(flags & BTREE_INSERT_JOURNAL_REPLAY))) {
+ int difference;
+
+ BUG_ON(jset_u64s(insert->k.u64s) > trans->journal_preres.u64s);
+
+ difference = jset_u64s(insert->k.u64s) - ck->res.u64s;
+ if (difference > 0) {
+ trans->journal_preres.u64s -= difference;
+ ck->res.u64s += difference;
+ }
+ }
+
+ bkey_copy(ck->k, insert);
+ ck->valid = true;
+
+ if (!test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
+ EBUG_ON(test_bit(BCH_FS_CLEAN_SHUTDOWN, &c->flags));
+ set_bit(BKEY_CACHED_DIRTY, &ck->flags);
+ atomic_long_inc(&c->btree_key_cache.nr_dirty);
+
+ if (bch2_nr_btree_keys_need_flush(c))
+ kick_reclaim = true;
+ }
+
+ /*
+ * To minimize lock contention, we only add the journal pin here and
+ * defer pin updates to the flush callback via ->seq. Be careful not to
+ * update ->seq on nojournal commits because we don't want to update the
+ * pin to a seq that doesn't include journal updates on disk. Otherwise
+ * we risk losing the update after a crash.
+ *
+ * The only exception is if the pin is not active in the first place. We
+ * have to add the pin because journal reclaim drives key cache
+ * flushing. The flush callback will not proceed unless ->seq matches
+ * the latest pin, so make sure it starts with a consistent value.
+ */
+ if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL) ||
+ !journal_pin_active(&ck->journal)) {
+ ck->seq = trans->journal_res.seq;
+ }
+ bch2_journal_pin_add(&c->journal, trans->journal_res.seq,
+ &ck->journal, bch2_btree_key_cache_journal_flush);
+
+ if (kick_reclaim)
+ journal_reclaim_kick(&c->journal);
+ return true;
+}
+
+void bch2_btree_key_cache_drop(struct btree_trans *trans,
+ struct btree_path *path)
+{
+ struct bch_fs *c = trans->c;
+ struct bkey_cached *ck = (void *) path->l[0].b;
+
+ BUG_ON(!ck->valid);
+
+ /*
+ * We just did an update to the btree, bypassing the key cache: the key
+ * cache key is now stale and must be dropped, even if dirty:
+ */
+ if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
+ clear_bit(BKEY_CACHED_DIRTY, &ck->flags);
+ atomic_long_dec(&c->btree_key_cache.nr_dirty);
+ bch2_journal_pin_drop(&c->journal, &ck->journal);
+ }
+
+ ck->valid = false;
+}
+
+static unsigned long bch2_btree_key_cache_scan(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
- struct bch_fs *c = container_of(shrink, struct bch_fs,
- btree_key_cache.shrink);
++ struct bch_fs *c = shrink->private_data;
+ struct btree_key_cache *bc = &c->btree_key_cache;
+ struct bucket_table *tbl;
+ struct bkey_cached *ck, *t;
+ size_t scanned = 0, freed = 0, nr = sc->nr_to_scan;
+ unsigned start, flags;
+ int srcu_idx;
+
+ mutex_lock(&bc->lock);
+ srcu_idx = srcu_read_lock(&c->btree_trans_barrier);
+ flags = memalloc_nofs_save();
+
+ /*
+ * Newest freed entries are at the end of the list - once we hit one
+ * that's too new to be freed, we can bail out:
+ */
+ list_for_each_entry_safe(ck, t, &bc->freed_nonpcpu, list) {
+ if (!poll_state_synchronize_srcu(&c->btree_trans_barrier,
+ ck->btree_trans_barrier_seq))
+ break;
+
+ list_del(&ck->list);
+ six_lock_exit(&ck->c.lock);
+ kmem_cache_free(bch2_key_cache, ck);
+ atomic_long_dec(&bc->nr_freed);
+ scanned++;
+ freed++;
+ }
+
+ if (scanned >= nr)
+ goto out;
+
+ list_for_each_entry_safe(ck, t, &bc->freed_pcpu, list) {
+ if (!poll_state_synchronize_srcu(&c->btree_trans_barrier,
+ ck->btree_trans_barrier_seq))
+ break;
+
+ list_del(&ck->list);
+ six_lock_exit(&ck->c.lock);
+ kmem_cache_free(bch2_key_cache, ck);
+ atomic_long_dec(&bc->nr_freed);
+ scanned++;
+ freed++;
+ }
+
+ if (scanned >= nr)
+ goto out;
+
+ rcu_read_lock();
+ tbl = rht_dereference_rcu(bc->table.tbl, &bc->table);
+ if (bc->shrink_iter >= tbl->size)
+ bc->shrink_iter = 0;
+ start = bc->shrink_iter;
+
+ do {
+ struct rhash_head *pos, *next;
+
+ pos = rht_ptr_rcu(rht_bucket(tbl, bc->shrink_iter));
+
+ while (!rht_is_a_nulls(pos)) {
+ next = rht_dereference_bucket_rcu(pos->next, tbl, bc->shrink_iter);
+ ck = container_of(pos, struct bkey_cached, hash);
+
+ if (test_bit(BKEY_CACHED_DIRTY, &ck->flags))
+ goto next;
+
+ if (test_bit(BKEY_CACHED_ACCESSED, &ck->flags))
+ clear_bit(BKEY_CACHED_ACCESSED, &ck->flags);
+ else if (bkey_cached_lock_for_evict(ck)) {
+ bkey_cached_evict(bc, ck);
+ bkey_cached_free(bc, ck);
+ }
+
+ scanned++;
+ if (scanned >= nr)
+ break;
+next:
+ pos = next;
+ }
+
+ bc->shrink_iter++;
+ if (bc->shrink_iter >= tbl->size)
+ bc->shrink_iter = 0;
+ } while (scanned < nr && bc->shrink_iter != start);
+
+ rcu_read_unlock();
+out:
+ memalloc_nofs_restore(flags);
+ srcu_read_unlock(&c->btree_trans_barrier, srcu_idx);
+ mutex_unlock(&bc->lock);
+
+ return freed;
+}
+
+static unsigned long bch2_btree_key_cache_count(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
- unregister_shrinker(&bc->shrink);
++ struct bch_fs *c = shrink->private_data;
+ struct btree_key_cache *bc = &c->btree_key_cache;
+ long nr = atomic_long_read(&bc->nr_keys) -
+ atomic_long_read(&bc->nr_dirty);
+
+ return max(0L, nr);
+}
+
+void bch2_fs_btree_key_cache_exit(struct btree_key_cache *bc)
+{
+ struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache);
+ struct bucket_table *tbl;
+ struct bkey_cached *ck, *n;
+ struct rhash_head *pos;
+ LIST_HEAD(items);
+ unsigned i;
+#ifdef __KERNEL__
+ int cpu;
+#endif
+
- bc->shrink.seeks = 0;
- bc->shrink.count_objects = bch2_btree_key_cache_count;
- bc->shrink.scan_objects = bch2_btree_key_cache_scan;
- if (register_shrinker(&bc->shrink, "%s/btree_key_cache", c->name))
++ shrinker_free(bc->shrink);
+
+ mutex_lock(&bc->lock);
+
+ /*
+ * The loop is needed to guard against racing with rehash:
+ */
+ while (atomic_long_read(&bc->nr_keys)) {
+ rcu_read_lock();
+ tbl = rht_dereference_rcu(bc->table.tbl, &bc->table);
+ if (tbl)
+ for (i = 0; i < tbl->size; i++)
+ rht_for_each_entry_rcu(ck, pos, tbl, i, hash) {
+ bkey_cached_evict(bc, ck);
+ list_add(&ck->list, &items);
+ }
+ rcu_read_unlock();
+ }
+
+#ifdef __KERNEL__
+ for_each_possible_cpu(cpu) {
+ struct btree_key_cache_freelist *f =
+ per_cpu_ptr(bc->pcpu_freed, cpu);
+
+ for (i = 0; i < f->nr; i++) {
+ ck = f->objs[i];
+ list_add(&ck->list, &items);
+ }
+ }
+#endif
+
+ list_splice(&bc->freed_pcpu, &items);
+ list_splice(&bc->freed_nonpcpu, &items);
+
+ mutex_unlock(&bc->lock);
+
+ list_for_each_entry_safe(ck, n, &items, list) {
+ cond_resched();
+
+ bch2_journal_pin_drop(&c->journal, &ck->journal);
+ bch2_journal_preres_put(&c->journal, &ck->res);
+
+ list_del(&ck->list);
+ kfree(ck->k);
+ six_lock_exit(&ck->c.lock);
+ kmem_cache_free(bch2_key_cache, ck);
+ }
+
+ if (atomic_long_read(&bc->nr_dirty) &&
+ !bch2_journal_error(&c->journal) &&
+ test_bit(BCH_FS_WAS_RW, &c->flags))
+ panic("btree key cache shutdown error: nr_dirty nonzero (%li)\n",
+ atomic_long_read(&bc->nr_dirty));
+
+ if (atomic_long_read(&bc->nr_keys))
+ panic("btree key cache shutdown error: nr_keys nonzero (%li)\n",
+ atomic_long_read(&bc->nr_keys));
+
+ if (bc->table_init_done)
+ rhashtable_destroy(&bc->table);
+
+ free_percpu(bc->pcpu_freed);
+}
+
+void bch2_fs_btree_key_cache_init_early(struct btree_key_cache *c)
+{
+ mutex_init(&c->lock);
+ INIT_LIST_HEAD(&c->freed_pcpu);
+ INIT_LIST_HEAD(&c->freed_nonpcpu);
+}
+
+int bch2_fs_btree_key_cache_init(struct btree_key_cache *bc)
+{
+ struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache);
++ struct shrinker *shrink;
+
+#ifdef __KERNEL__
+ bc->pcpu_freed = alloc_percpu(struct btree_key_cache_freelist);
+ if (!bc->pcpu_freed)
+ return -BCH_ERR_ENOMEM_fs_btree_cache_init;
+#endif
+
+ if (rhashtable_init(&bc->table, &bch2_btree_key_cache_params))
+ return -BCH_ERR_ENOMEM_fs_btree_cache_init;
+
+ bc->table_init_done = true;
+
++ shrink = shrinker_alloc(0, "%s/btree_key_cache", c->name);
++ if (!shrink)
+ return -BCH_ERR_ENOMEM_fs_btree_cache_init;
++ bc->shrink = shrink;
++ shrink->seeks = 0;
++ shrink->count_objects = bch2_btree_key_cache_count;
++ shrink->scan_objects = bch2_btree_key_cache_scan;
++ shrink->private_data = c;
++ shrinker_register(shrink);
+ return 0;
+}
+
+void bch2_btree_key_cache_to_text(struct printbuf *out, struct btree_key_cache *c)
+{
+ prt_printf(out, "nr_freed:\t%lu", atomic_long_read(&c->nr_freed));
+ prt_newline(out);
+ prt_printf(out, "nr_keys:\t%lu", atomic_long_read(&c->nr_keys));
+ prt_newline(out);
+ prt_printf(out, "nr_dirty:\t%lu", atomic_long_read(&c->nr_dirty));
+ prt_newline(out);
+}
+
+void bch2_btree_key_cache_exit(void)
+{
+ kmem_cache_destroy(bch2_key_cache);
+}
+
+int __init bch2_btree_key_cache_init(void)
+{
+ bch2_key_cache = KMEM_CACHE(bkey_cached, SLAB_RECLAIM_ACCOUNT);
+ if (!bch2_key_cache)
+ return -ENOMEM;
+
+ return 0;
+}
--- /dev/null
- struct shrinker shrink;
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _BCACHEFS_BTREE_TYPES_H
+#define _BCACHEFS_BTREE_TYPES_H
+
+#include <linux/list.h>
+#include <linux/rhashtable.h>
+
+//#include "bkey_methods.h"
+#include "buckets_types.h"
+#include "darray.h"
+#include "errcode.h"
+#include "journal_types.h"
+#include "replicas_types.h"
+#include "six.h"
+
+struct open_bucket;
+struct btree_update;
+struct btree_trans;
+
+#define MAX_BSETS 3U
+
+struct btree_nr_keys {
+
+ /*
+ * Amount of live metadata (i.e. size of node after a compaction) in
+ * units of u64s
+ */
+ u16 live_u64s;
+ u16 bset_u64s[MAX_BSETS];
+
+ /* live keys only: */
+ u16 packed_keys;
+ u16 unpacked_keys;
+};
+
+struct bset_tree {
+ /*
+ * We construct a binary tree in an array as if the array
+ * started at 1, so that things line up on the same cachelines
+ * better: see comments in bset.c at cacheline_to_bkey() for
+ * details
+ */
+
+ /* size of the binary tree and prev array */
+ u16 size;
+
+ /* function of size - precalculated for to_inorder() */
+ u16 extra;
+
+ u16 data_offset;
+ u16 aux_data_offset;
+ u16 end_offset;
+};
+
+struct btree_write {
+ struct journal_entry_pin journal;
+};
+
+struct btree_alloc {
+ struct open_buckets ob;
+ __BKEY_PADDED(k, BKEY_BTREE_PTR_VAL_U64s_MAX);
+};
+
+struct btree_bkey_cached_common {
+ struct six_lock lock;
+ u8 level;
+ u8 btree_id;
+ bool cached;
+};
+
+struct btree {
+ struct btree_bkey_cached_common c;
+
+ struct rhash_head hash;
+ u64 hash_val;
+
+ unsigned long flags;
+ u16 written;
+ u8 nsets;
+ u8 nr_key_bits;
+ u16 version_ondisk;
+
+ struct bkey_format format;
+
+ struct btree_node *data;
+ void *aux_data;
+
+ /*
+ * Sets of sorted keys - the real btree node - plus a binary search tree
+ *
+ * set[0] is special; set[0]->tree, set[0]->prev and set[0]->data point
+ * to the memory we have allocated for this btree node. Additionally,
+ * set[0]->data points to the entire btree node as it exists on disk.
+ */
+ struct bset_tree set[MAX_BSETS];
+
+ struct btree_nr_keys nr;
+ u16 sib_u64s[2];
+ u16 whiteout_u64s;
+ u8 byte_order;
+ u8 unpack_fn_len;
+
+ struct btree_write writes[2];
+
+ /* Key/pointer for this btree node */
+ __BKEY_PADDED(key, BKEY_BTREE_PTR_VAL_U64s_MAX);
+
+ /*
+ * XXX: add a delete sequence number, so when bch2_btree_node_relock()
+ * fails because the lock sequence number has changed - i.e. the
+ * contents were modified - we can still relock the node if it's still
+ * the one we want, without redoing the traversal
+ */
+
+ /*
+ * For asynchronous splits/interior node updates:
+ * When we do a split, we allocate new child nodes and update the parent
+ * node to point to them: we update the parent in memory immediately,
+ * but then we must wait until the children have been written out before
+ * the update to the parent can be written - this is a list of the
+ * btree_updates that are blocking this node from being
+ * written:
+ */
+ struct list_head write_blocked;
+
+ /*
+ * Also for asynchronous splits/interior node updates:
+ * If a btree node isn't reachable yet, we don't want to kick off
+ * another write - because that write also won't yet be reachable and
+ * marking it as completed before it's reachable would be incorrect:
+ */
+ unsigned long will_make_reachable;
+
+ struct open_buckets ob;
+
+ /* lru list */
+ struct list_head list;
+};
+
+struct btree_cache {
+ struct rhashtable table;
+ bool table_init_done;
+ /*
+ * We never free a struct btree, except on shutdown - we just put it on
+ * the btree_cache_freed list and reuse it later. This simplifies the
+ * code, and it doesn't cost us much memory as the memory usage is
+ * dominated by buffers that hold the actual btree node data and those
+ * can be freed - and the number of struct btrees allocated is
+ * effectively bounded.
+ *
+ * btree_cache_freeable effectively is a small cache - we use it because
+ * high order page allocations can be rather expensive, and it's quite
+ * common to delete and allocate btree nodes in quick succession. It
+ * should never grow past ~2-3 nodes in practice.
+ */
+ struct mutex lock;
+ struct list_head live;
+ struct list_head freeable;
+ struct list_head freed_pcpu;
+ struct list_head freed_nonpcpu;
+
+ /* Number of elements in live + freeable lists */
+ unsigned used;
+ unsigned reserve;
+ atomic_t dirty;
- struct shrinker shrink;
++ struct shrinker *shrink;
+
+ /*
+ * If we need to allocate memory for a new btree node and that
+ * allocation fails, we can cannibalize another node in the btree cache
+ * to satisfy the allocation - lock to guarantee only one thread does
+ * this at a time:
+ */
+ struct task_struct *alloc_lock;
+ struct closure_waitlist alloc_wait;
+};
+
+struct btree_node_iter {
+ struct btree_node_iter_set {
+ u16 k, end;
+ } data[MAX_BSETS];
+};
+
+/*
+ * Iterate over all possible positions, synthesizing deleted keys for holes:
+ */
+static const __maybe_unused u16 BTREE_ITER_SLOTS = 1 << 0;
+static const __maybe_unused u16 BTREE_ITER_ALL_LEVELS = 1 << 1;
+/*
+ * Indicates that intent locks should be taken on leaf nodes, because we expect
+ * to be doing updates:
+ */
+static const __maybe_unused u16 BTREE_ITER_INTENT = 1 << 2;
+/*
+ * Causes the btree iterator code to prefetch additional btree nodes from disk:
+ */
+static const __maybe_unused u16 BTREE_ITER_PREFETCH = 1 << 3;
+/*
+ * Used in bch2_btree_iter_traverse(), to indicate whether we're searching for
+ * @pos or the first key strictly greater than @pos
+ */
+static const __maybe_unused u16 BTREE_ITER_IS_EXTENTS = 1 << 4;
+static const __maybe_unused u16 BTREE_ITER_NOT_EXTENTS = 1 << 5;
+static const __maybe_unused u16 BTREE_ITER_CACHED = 1 << 6;
+static const __maybe_unused u16 BTREE_ITER_WITH_KEY_CACHE = 1 << 7;
+static const __maybe_unused u16 BTREE_ITER_WITH_UPDATES = 1 << 8;
+static const __maybe_unused u16 BTREE_ITER_WITH_JOURNAL = 1 << 9;
+static const __maybe_unused u16 __BTREE_ITER_ALL_SNAPSHOTS = 1 << 10;
+static const __maybe_unused u16 BTREE_ITER_ALL_SNAPSHOTS = 1 << 11;
+static const __maybe_unused u16 BTREE_ITER_FILTER_SNAPSHOTS = 1 << 12;
+static const __maybe_unused u16 BTREE_ITER_NOPRESERVE = 1 << 13;
+static const __maybe_unused u16 BTREE_ITER_CACHED_NOFILL = 1 << 14;
+static const __maybe_unused u16 BTREE_ITER_KEY_CACHE_FILL = 1 << 15;
+#define __BTREE_ITER_FLAGS_END 16
+
+enum btree_path_uptodate {
+ BTREE_ITER_UPTODATE = 0,
+ BTREE_ITER_NEED_RELOCK = 1,
+ BTREE_ITER_NEED_TRAVERSE = 2,
+};
+
+#if defined(CONFIG_BCACHEFS_LOCK_TIME_STATS) || defined(CONFIG_BCACHEFS_DEBUG)
+#define TRACK_PATH_ALLOCATED
+#endif
+
+struct btree_path {
+ u8 idx;
+ u8 sorted_idx;
+ u8 ref;
+ u8 intent_ref;
+
+ /* btree_iter_copy starts here: */
+ struct bpos pos;
+
+ enum btree_id btree_id:5;
+ bool cached:1;
+ bool preserve:1;
+ enum btree_path_uptodate uptodate:2;
+ /*
+ * When true, failing to relock this path will cause the transaction to
+ * restart:
+ */
+ bool should_be_locked:1;
+ unsigned level:3,
+ locks_want:3;
+ u8 nodes_locked;
+
+ struct btree_path_level {
+ struct btree *b;
+ struct btree_node_iter iter;
+ u32 lock_seq;
+#ifdef CONFIG_BCACHEFS_LOCK_TIME_STATS
+ u64 lock_taken_time;
+#endif
+ } l[BTREE_MAX_DEPTH];
+#ifdef TRACK_PATH_ALLOCATED
+ unsigned long ip_allocated;
+#endif
+};
+
+static inline struct btree_path_level *path_l(struct btree_path *path)
+{
+ return path->l + path->level;
+}
+
+static inline unsigned long btree_path_ip_allocated(struct btree_path *path)
+{
+#ifdef TRACK_PATH_ALLOCATED
+ return path->ip_allocated;
+#else
+ return _THIS_IP_;
+#endif
+}
+
+/*
+ * @pos - iterator's current position
+ * @level - current btree depth
+ * @locks_want - btree level below which we start taking intent locks
+ * @nodes_locked - bitmask indicating which nodes in @nodes are locked
+ * @nodes_intent_locked - bitmask indicating which locks are intent locks
+ */
+struct btree_iter {
+ struct btree_trans *trans;
+ struct btree_path *path;
+ struct btree_path *update_path;
+ struct btree_path *key_cache_path;
+
+ enum btree_id btree_id:8;
+ unsigned min_depth:3;
+ unsigned advanced:1;
+
+ /* btree_iter_copy starts here: */
+ u16 flags;
+
+ /* When we're filtering by snapshot, the snapshot ID we're looking for: */
+ unsigned snapshot;
+
+ struct bpos pos;
+ /*
+ * Current unpacked key - so that bch2_btree_iter_next()/
+ * bch2_btree_iter_next_slot() can correctly advance pos.
+ */
+ struct bkey k;
+
+ /* BTREE_ITER_WITH_JOURNAL: */
+ size_t journal_idx;
+ struct bpos journal_pos;
+#ifdef TRACK_PATH_ALLOCATED
+ unsigned long ip_allocated;
+#endif
+};
+
+struct btree_key_cache_freelist {
+ struct bkey_cached *objs[16];
+ unsigned nr;
+};
+
+struct btree_key_cache {
+ struct mutex lock;
+ struct rhashtable table;
+ bool table_init_done;
+ struct list_head freed_pcpu;
+ struct list_head freed_nonpcpu;
++ struct shrinker *shrink;
+ unsigned shrink_iter;
+ struct btree_key_cache_freelist __percpu *pcpu_freed;
+
+ atomic_long_t nr_freed;
+ atomic_long_t nr_keys;
+ atomic_long_t nr_dirty;
+};
+
+struct bkey_cached_key {
+ u32 btree_id;
+ struct bpos pos;
+} __packed __aligned(4);
+
+#define BKEY_CACHED_ACCESSED 0
+#define BKEY_CACHED_DIRTY 1
+
+struct bkey_cached {
+ struct btree_bkey_cached_common c;
+
+ unsigned long flags;
+ u16 u64s;
+ bool valid;
+ u32 btree_trans_barrier_seq;
+ struct bkey_cached_key key;
+
+ struct rhash_head hash;
+ struct list_head list;
+
+ struct journal_preres res;
+ struct journal_entry_pin journal;
+ u64 seq;
+
+ struct bkey_i *k;
+};
+
+static inline struct bpos btree_node_pos(struct btree_bkey_cached_common *b)
+{
+ return !b->cached
+ ? container_of(b, struct btree, c)->key.k.p
+ : container_of(b, struct bkey_cached, c)->key.pos;
+}
+
+struct btree_insert_entry {
+ unsigned flags;
+ u8 bkey_type;
+ enum btree_id btree_id:8;
+ u8 level:4;
+ bool cached:1;
+ bool insert_trigger_run:1;
+ bool overwrite_trigger_run:1;
+ bool key_cache_already_flushed:1;
+ /*
+ * @old_k may be a key from the journal; @old_btree_u64s always refers
+ * to the size of the key being overwritten in the btree:
+ */
+ u8 old_btree_u64s;
+ struct bkey_i *k;
+ struct btree_path *path;
+ u64 seq;
+ /* key being overwritten: */
+ struct bkey old_k;
+ const struct bch_val *old_v;
+ unsigned long ip_allocated;
+};
+
+#ifndef CONFIG_LOCKDEP
+#define BTREE_ITER_MAX 64
+#else
+#define BTREE_ITER_MAX 32
+#endif
+
+struct btree_trans_commit_hook;
+typedef int (btree_trans_commit_hook_fn)(struct btree_trans *, struct btree_trans_commit_hook *);
+
+struct btree_trans_commit_hook {
+ btree_trans_commit_hook_fn *fn;
+ struct btree_trans_commit_hook *next;
+};
+
+#define BTREE_TRANS_MEM_MAX (1U << 16)
+
+#define BTREE_TRANS_MAX_LOCK_HOLD_TIME_NS 10000
+
+struct btree_trans {
+ struct bch_fs *c;
+ const char *fn;
+ struct closure ref;
+ struct list_head list;
+ u64 last_begin_time;
+
+ u8 lock_may_not_fail;
+ u8 lock_must_abort;
+ struct btree_bkey_cached_common *locking;
+ struct six_lock_waiter locking_wait;
+
+ int srcu_idx;
+
+ u8 fn_idx;
+ u8 nr_sorted;
+ u8 nr_updates;
+ u8 nr_wb_updates;
+ u8 wb_updates_size;
+ bool used_mempool:1;
+ bool in_traverse_all:1;
+ bool paths_sorted:1;
+ bool memory_allocation_failure:1;
+ bool journal_transaction_names:1;
+ bool journal_replay_not_finished:1;
+ bool notrace_relock_fail:1;
+ enum bch_errcode restarted:16;
+ u32 restart_count;
+ unsigned long last_begin_ip;
+ unsigned long last_restarted_ip;
+ unsigned long srcu_lock_time;
+
+ /*
+ * For when bch2_trans_update notices we'll be splitting a compressed
+ * extent:
+ */
+ unsigned extra_journal_res;
+ unsigned nr_max_paths;
+
+ u64 paths_allocated;
+
+ unsigned mem_top;
+ unsigned mem_max;
+ unsigned mem_bytes;
+ void *mem;
+
+ u8 sorted[BTREE_ITER_MAX + 8];
+ struct btree_path paths[BTREE_ITER_MAX];
+ struct btree_insert_entry updates[BTREE_ITER_MAX];
+ struct btree_write_buffered_key *wb_updates;
+
+ /* update path: */
+ struct btree_trans_commit_hook *hooks;
+ darray_u64 extra_journal_entries;
+ struct journal_entry_pin *journal_pin;
+
+ struct journal_res journal_res;
+ struct journal_preres journal_preres;
+ u64 *journal_seq;
+ struct disk_reservation *disk_res;
+ unsigned journal_u64s;
+ unsigned journal_preres_u64s;
+ struct replicas_delta_list *fs_usage_deltas;
+};
+
+#define BCH_BTREE_WRITE_TYPES() \
+ x(initial, 0) \
+ x(init_next_bset, 1) \
+ x(cache_reclaim, 2) \
+ x(journal_reclaim, 3) \
+ x(interior, 4)
+
+enum btree_write_type {
+#define x(t, n) BTREE_WRITE_##t,
+ BCH_BTREE_WRITE_TYPES()
+#undef x
+ BTREE_WRITE_TYPE_NR,
+};
+
+#define BTREE_WRITE_TYPE_MASK (roundup_pow_of_two(BTREE_WRITE_TYPE_NR) - 1)
+#define BTREE_WRITE_TYPE_BITS ilog2(roundup_pow_of_two(BTREE_WRITE_TYPE_NR))
+
+#define BTREE_FLAGS() \
+ x(read_in_flight) \
+ x(read_error) \
+ x(dirty) \
+ x(need_write) \
+ x(write_blocked) \
+ x(will_make_reachable) \
+ x(noevict) \
+ x(write_idx) \
+ x(accessed) \
+ x(write_in_flight) \
+ x(write_in_flight_inner) \
+ x(just_written) \
+ x(dying) \
+ x(fake) \
+ x(need_rewrite) \
+ x(never_write)
+
+enum btree_flags {
+ /* First bits for btree node write type */
+ BTREE_NODE_FLAGS_START = BTREE_WRITE_TYPE_BITS - 1,
+#define x(flag) BTREE_NODE_##flag,
+ BTREE_FLAGS()
+#undef x
+};
+
+#define x(flag) \
+static inline bool btree_node_ ## flag(struct btree *b) \
+{ return test_bit(BTREE_NODE_ ## flag, &b->flags); } \
+ \
+static inline void set_btree_node_ ## flag(struct btree *b) \
+{ set_bit(BTREE_NODE_ ## flag, &b->flags); } \
+ \
+static inline void clear_btree_node_ ## flag(struct btree *b) \
+{ clear_bit(BTREE_NODE_ ## flag, &b->flags); }
+
+BTREE_FLAGS()
+#undef x
+
+static inline struct btree_write *btree_current_write(struct btree *b)
+{
+ return b->writes + btree_node_write_idx(b);
+}
+
+static inline struct btree_write *btree_prev_write(struct btree *b)
+{
+ return b->writes + (btree_node_write_idx(b) ^ 1);
+}
+
+static inline struct bset_tree *bset_tree_last(struct btree *b)
+{
+ EBUG_ON(!b->nsets);
+ return b->set + b->nsets - 1;
+}
+
+static inline void *
+__btree_node_offset_to_ptr(const struct btree *b, u16 offset)
+{
+ return (void *) ((u64 *) b->data + 1 + offset);
+}
+
+static inline u16
+__btree_node_ptr_to_offset(const struct btree *b, const void *p)
+{
+ u16 ret = (u64 *) p - 1 - (u64 *) b->data;
+
+ EBUG_ON(__btree_node_offset_to_ptr(b, ret) != p);
+ return ret;
+}
+
+static inline struct bset *bset(const struct btree *b,
+ const struct bset_tree *t)
+{
+ return __btree_node_offset_to_ptr(b, t->data_offset);
+}
+
+static inline void set_btree_bset_end(struct btree *b, struct bset_tree *t)
+{
+ t->end_offset =
+ __btree_node_ptr_to_offset(b, vstruct_last(bset(b, t)));
+}
+
+static inline void set_btree_bset(struct btree *b, struct bset_tree *t,
+ const struct bset *i)
+{
+ t->data_offset = __btree_node_ptr_to_offset(b, i);
+ set_btree_bset_end(b, t);
+}
+
+static inline struct bset *btree_bset_first(struct btree *b)
+{
+ return bset(b, b->set);
+}
+
+static inline struct bset *btree_bset_last(struct btree *b)
+{
+ return bset(b, bset_tree_last(b));
+}
+
+static inline u16
+__btree_node_key_to_offset(const struct btree *b, const struct bkey_packed *k)
+{
+ return __btree_node_ptr_to_offset(b, k);
+}
+
+static inline struct bkey_packed *
+__btree_node_offset_to_key(const struct btree *b, u16 k)
+{
+ return __btree_node_offset_to_ptr(b, k);
+}
+
+static inline unsigned btree_bkey_first_offset(const struct bset_tree *t)
+{
+ return t->data_offset + offsetof(struct bset, _data) / sizeof(u64);
+}
+
+#define btree_bkey_first(_b, _t) \
+({ \
+ EBUG_ON(bset(_b, _t)->start != \
+ __btree_node_offset_to_key(_b, btree_bkey_first_offset(_t)));\
+ \
+ bset(_b, _t)->start; \
+})
+
+#define btree_bkey_last(_b, _t) \
+({ \
+ EBUG_ON(__btree_node_offset_to_key(_b, (_t)->end_offset) != \
+ vstruct_last(bset(_b, _t))); \
+ \
+ __btree_node_offset_to_key(_b, (_t)->end_offset); \
+})
+
+static inline unsigned bset_u64s(struct bset_tree *t)
+{
+ return t->end_offset - t->data_offset -
+ sizeof(struct bset) / sizeof(u64);
+}
+
+static inline unsigned bset_dead_u64s(struct btree *b, struct bset_tree *t)
+{
+ return bset_u64s(t) - b->nr.bset_u64s[t - b->set];
+}
+
+static inline unsigned bset_byte_offset(struct btree *b, void *i)
+{
+ return i - (void *) b->data;
+}
+
+enum btree_node_type {
+#define x(kwd, val, ...) BKEY_TYPE_##kwd = val,
+ BCH_BTREE_IDS()
+#undef x
+ BKEY_TYPE_btree,
+};
+
+/* Type of a key in btree @id at level @level: */
+static inline enum btree_node_type __btree_node_type(unsigned level, enum btree_id id)
+{
+ return level ? BKEY_TYPE_btree : (enum btree_node_type) id;
+}
+
+/* Type of keys @b contains: */
+static inline enum btree_node_type btree_node_type(struct btree *b)
+{
+ return __btree_node_type(b->c.level, b->c.btree_id);
+}
+
+#define BTREE_NODE_TYPE_HAS_TRANS_TRIGGERS \
+ (BIT(BKEY_TYPE_extents)| \
+ BIT(BKEY_TYPE_alloc)| \
+ BIT(BKEY_TYPE_inodes)| \
+ BIT(BKEY_TYPE_stripes)| \
+ BIT(BKEY_TYPE_reflink)| \
+ BIT(BKEY_TYPE_btree))
+
+#define BTREE_NODE_TYPE_HAS_MEM_TRIGGERS \
+ (BIT(BKEY_TYPE_alloc)| \
+ BIT(BKEY_TYPE_inodes)| \
+ BIT(BKEY_TYPE_stripes)| \
+ BIT(BKEY_TYPE_snapshots))
+
+#define BTREE_NODE_TYPE_HAS_TRIGGERS \
+ (BTREE_NODE_TYPE_HAS_TRANS_TRIGGERS| \
+ BTREE_NODE_TYPE_HAS_MEM_TRIGGERS)
+
+static inline bool btree_node_type_needs_gc(enum btree_node_type type)
+{
+ return BTREE_NODE_TYPE_HAS_TRIGGERS & (1U << type);
+}
+
+static inline bool btree_node_type_is_extents(enum btree_node_type type)
+{
+ const unsigned mask = 0
+#define x(name, nr, flags, ...) |((!!((flags) & BTREE_ID_EXTENTS)) << nr)
+ BCH_BTREE_IDS()
+#undef x
+ ;
+
+ return (1U << type) & mask;
+}
+
+static inline bool btree_id_is_extents(enum btree_id btree)
+{
+ return btree_node_type_is_extents((enum btree_node_type) btree);
+}
+
+static inline bool btree_type_has_snapshots(enum btree_id id)
+{
+ const unsigned mask = 0
+#define x(name, nr, flags, ...) |((!!((flags) & BTREE_ID_SNAPSHOTS)) << nr)
+ BCH_BTREE_IDS()
+#undef x
+ ;
+
+ return (1U << id) & mask;
+}
+
+static inline bool btree_type_has_ptrs(enum btree_id id)
+{
+ const unsigned mask = 0
+#define x(name, nr, flags, ...) |((!!((flags) & BTREE_ID_DATA)) << nr)
+ BCH_BTREE_IDS()
+#undef x
+ ;
+
+ return (1U << id) & mask;
+}
+
+struct btree_root {
+ struct btree *b;
+
+ /* On disk root - see async splits: */
+ __BKEY_PADDED(key, BKEY_BTREE_PTR_VAL_U64s_MAX);
+ u8 level;
+ u8 alive;
+ s8 error;
+};
+
+enum btree_gc_coalesce_fail_reason {
+ BTREE_GC_COALESCE_FAIL_RESERVE_GET,
+ BTREE_GC_COALESCE_FAIL_KEYLIST_REALLOC,
+ BTREE_GC_COALESCE_FAIL_FORMAT_FITS,
+};
+
+enum btree_node_sibling {
+ btree_prev_sib,
+ btree_next_sib,
+};
+
+#endif /* _BCACHEFS_BTREE_TYPES_H */
--- /dev/null
- sb->s_shrink.seeks = 0;
+// SPDX-License-Identifier: GPL-2.0
+#ifndef NO_BCACHEFS_FS
+
+#include "bcachefs.h"
+#include "acl.h"
+#include "bkey_buf.h"
+#include "btree_update.h"
+#include "buckets.h"
+#include "chardev.h"
+#include "dirent.h"
+#include "errcode.h"
+#include "extents.h"
+#include "fs.h"
+#include "fs-common.h"
+#include "fs-io.h"
+#include "fs-ioctl.h"
+#include "fs-io-buffered.h"
+#include "fs-io-direct.h"
+#include "fs-io-pagecache.h"
+#include "fsck.h"
+#include "inode.h"
+#include "io_read.h"
+#include "journal.h"
+#include "keylist.h"
+#include "quota.h"
+#include "snapshot.h"
+#include "super.h"
+#include "xattr.h"
+
+#include <linux/aio.h>
+#include <linux/backing-dev.h>
+#include <linux/exportfs.h>
+#include <linux/fiemap.h>
+#include <linux/module.h>
+#include <linux/pagemap.h>
+#include <linux/posix_acl.h>
+#include <linux/random.h>
+#include <linux/seq_file.h>
+#include <linux/statfs.h>
+#include <linux/string.h>
+#include <linux/xattr.h>
+
+static struct kmem_cache *bch2_inode_cache;
+
+static void bch2_vfs_inode_init(struct btree_trans *, subvol_inum,
+ struct bch_inode_info *,
+ struct bch_inode_unpacked *,
+ struct bch_subvolume *);
+
+void bch2_inode_update_after_write(struct btree_trans *trans,
+ struct bch_inode_info *inode,
+ struct bch_inode_unpacked *bi,
+ unsigned fields)
+{
+ struct bch_fs *c = trans->c;
+
+ BUG_ON(bi->bi_inum != inode->v.i_ino);
+
+ bch2_assert_pos_locked(trans, BTREE_ID_inodes,
+ POS(0, bi->bi_inum),
+ c->opts.inodes_use_key_cache);
+
+ set_nlink(&inode->v, bch2_inode_nlink_get(bi));
+ i_uid_write(&inode->v, bi->bi_uid);
+ i_gid_write(&inode->v, bi->bi_gid);
+ inode->v.i_mode = bi->bi_mode;
+
+ if (fields & ATTR_ATIME)
+ inode_set_atime_to_ts(&inode->v, bch2_time_to_timespec(c, bi->bi_atime));
+ if (fields & ATTR_MTIME)
+ inode_set_mtime_to_ts(&inode->v, bch2_time_to_timespec(c, bi->bi_mtime));
+ if (fields & ATTR_CTIME)
+ inode_set_ctime_to_ts(&inode->v, bch2_time_to_timespec(c, bi->bi_ctime));
+
+ inode->ei_inode = *bi;
+
+ bch2_inode_flags_to_vfs(inode);
+}
+
+int __must_check bch2_write_inode(struct bch_fs *c,
+ struct bch_inode_info *inode,
+ inode_set_fn set,
+ void *p, unsigned fields)
+{
+ struct btree_trans *trans = bch2_trans_get(c);
+ struct btree_iter iter = { NULL };
+ struct bch_inode_unpacked inode_u;
+ int ret;
+retry:
+ bch2_trans_begin(trans);
+
+ ret = bch2_inode_peek(trans, &iter, &inode_u, inode_inum(inode),
+ BTREE_ITER_INTENT) ?:
+ (set ? set(trans, inode, &inode_u, p) : 0) ?:
+ bch2_inode_write(trans, &iter, &inode_u) ?:
+ bch2_trans_commit(trans, NULL, NULL, BTREE_INSERT_NOFAIL);
+
+ /*
+ * the btree node lock protects inode->ei_inode, not ei_update_lock;
+ * this is important for inode updates via bchfs_write_index_update
+ */
+ if (!ret)
+ bch2_inode_update_after_write(trans, inode, &inode_u, fields);
+
+ bch2_trans_iter_exit(trans, &iter);
+
+ if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
+ goto retry;
+
+ bch2_fs_fatal_err_on(bch2_err_matches(ret, ENOENT), c,
+ "inode %u:%llu not found when updating",
+ inode_inum(inode).subvol,
+ inode_inum(inode).inum);
+
+ bch2_trans_put(trans);
+ return ret < 0 ? ret : 0;
+}
+
+int bch2_fs_quota_transfer(struct bch_fs *c,
+ struct bch_inode_info *inode,
+ struct bch_qid new_qid,
+ unsigned qtypes,
+ enum quota_acct_mode mode)
+{
+ unsigned i;
+ int ret;
+
+ qtypes &= enabled_qtypes(c);
+
+ for (i = 0; i < QTYP_NR; i++)
+ if (new_qid.q[i] == inode->ei_qid.q[i])
+ qtypes &= ~(1U << i);
+
+ if (!qtypes)
+ return 0;
+
+ mutex_lock(&inode->ei_quota_lock);
+
+ ret = bch2_quota_transfer(c, qtypes, new_qid,
+ inode->ei_qid,
+ inode->v.i_blocks +
+ inode->ei_quota_reserved,
+ mode);
+ if (!ret)
+ for (i = 0; i < QTYP_NR; i++)
+ if (qtypes & (1 << i))
+ inode->ei_qid.q[i] = new_qid.q[i];
+
+ mutex_unlock(&inode->ei_quota_lock);
+
+ return ret;
+}
+
+static int bch2_iget5_test(struct inode *vinode, void *p)
+{
+ struct bch_inode_info *inode = to_bch_ei(vinode);
+ subvol_inum *inum = p;
+
+ return inode->ei_subvol == inum->subvol &&
+ inode->ei_inode.bi_inum == inum->inum;
+}
+
+static int bch2_iget5_set(struct inode *vinode, void *p)
+{
+ struct bch_inode_info *inode = to_bch_ei(vinode);
+ subvol_inum *inum = p;
+
+ inode->v.i_ino = inum->inum;
+ inode->ei_subvol = inum->subvol;
+ inode->ei_inode.bi_inum = inum->inum;
+ return 0;
+}
+
+static unsigned bch2_inode_hash(subvol_inum inum)
+{
+ return jhash_3words(inum.subvol, inum.inum >> 32, inum.inum, JHASH_INITVAL);
+}
+
+struct inode *bch2_vfs_inode_get(struct bch_fs *c, subvol_inum inum)
+{
+ struct bch_inode_unpacked inode_u;
+ struct bch_inode_info *inode;
+ struct btree_trans *trans;
+ struct bch_subvolume subvol;
+ int ret;
+
+ inode = to_bch_ei(iget5_locked(c->vfs_sb,
+ bch2_inode_hash(inum),
+ bch2_iget5_test,
+ bch2_iget5_set,
+ &inum));
+ if (unlikely(!inode))
+ return ERR_PTR(-ENOMEM);
+ if (!(inode->v.i_state & I_NEW))
+ return &inode->v;
+
+ trans = bch2_trans_get(c);
+ ret = lockrestart_do(trans,
+ bch2_subvolume_get(trans, inum.subvol, true, 0, &subvol) ?:
+ bch2_inode_find_by_inum_trans(trans, inum, &inode_u));
+
+ if (!ret)
+ bch2_vfs_inode_init(trans, inum, inode, &inode_u, &subvol);
+ bch2_trans_put(trans);
+
+ if (ret) {
+ iget_failed(&inode->v);
+ return ERR_PTR(bch2_err_class(ret));
+ }
+
+ mutex_lock(&c->vfs_inodes_lock);
+ list_add(&inode->ei_vfs_inode_list, &c->vfs_inodes_list);
+ mutex_unlock(&c->vfs_inodes_lock);
+
+ unlock_new_inode(&inode->v);
+
+ return &inode->v;
+}
+
+struct bch_inode_info *
+__bch2_create(struct mnt_idmap *idmap,
+ struct bch_inode_info *dir, struct dentry *dentry,
+ umode_t mode, dev_t rdev, subvol_inum snapshot_src,
+ unsigned flags)
+{
+ struct bch_fs *c = dir->v.i_sb->s_fs_info;
+ struct btree_trans *trans;
+ struct bch_inode_unpacked dir_u;
+ struct bch_inode_info *inode, *old;
+ struct bch_inode_unpacked inode_u;
+ struct posix_acl *default_acl = NULL, *acl = NULL;
+ subvol_inum inum;
+ struct bch_subvolume subvol;
+ u64 journal_seq = 0;
+ int ret;
+
+ /*
+ * preallocate acls + vfs inode before btree transaction, so that
+ * nothing can fail after the transaction succeeds:
+ */
+#ifdef CONFIG_BCACHEFS_POSIX_ACL
+ ret = posix_acl_create(&dir->v, &mode, &default_acl, &acl);
+ if (ret)
+ return ERR_PTR(ret);
+#endif
+ inode = to_bch_ei(new_inode(c->vfs_sb));
+ if (unlikely(!inode)) {
+ inode = ERR_PTR(-ENOMEM);
+ goto err;
+ }
+
+ bch2_inode_init_early(c, &inode_u);
+
+ if (!(flags & BCH_CREATE_TMPFILE))
+ mutex_lock(&dir->ei_update_lock);
+
+ trans = bch2_trans_get(c);
+retry:
+ bch2_trans_begin(trans);
+
+ ret = bch2_create_trans(trans,
+ inode_inum(dir), &dir_u, &inode_u,
+ !(flags & BCH_CREATE_TMPFILE)
+ ? &dentry->d_name : NULL,
+ from_kuid(i_user_ns(&dir->v), current_fsuid()),
+ from_kgid(i_user_ns(&dir->v), current_fsgid()),
+ mode, rdev,
+ default_acl, acl, snapshot_src, flags) ?:
+ bch2_quota_acct(c, bch_qid(&inode_u), Q_INO, 1,
+ KEY_TYPE_QUOTA_PREALLOC);
+ if (unlikely(ret))
+ goto err_before_quota;
+
+ inum.subvol = inode_u.bi_subvol ?: dir->ei_subvol;
+ inum.inum = inode_u.bi_inum;
+
+ ret = bch2_subvolume_get(trans, inum.subvol, true,
+ BTREE_ITER_WITH_UPDATES, &subvol) ?:
+ bch2_trans_commit(trans, NULL, &journal_seq, 0);
+ if (unlikely(ret)) {
+ bch2_quota_acct(c, bch_qid(&inode_u), Q_INO, -1,
+ KEY_TYPE_QUOTA_WARN);
+err_before_quota:
+ if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
+ goto retry;
+ goto err_trans;
+ }
+
+ if (!(flags & BCH_CREATE_TMPFILE)) {
+ bch2_inode_update_after_write(trans, dir, &dir_u,
+ ATTR_MTIME|ATTR_CTIME);
+ mutex_unlock(&dir->ei_update_lock);
+ }
+
+ bch2_iget5_set(&inode->v, &inum);
+ bch2_vfs_inode_init(trans, inum, inode, &inode_u, &subvol);
+
+ set_cached_acl(&inode->v, ACL_TYPE_ACCESS, acl);
+ set_cached_acl(&inode->v, ACL_TYPE_DEFAULT, default_acl);
+
+ /*
+ * we must insert the new inode into the inode cache before calling
+ * bch2_trans_exit() and dropping locks, else we could race with another
+ * thread pulling the inode in and modifying it:
+ */
+
+ inode->v.i_state |= I_CREATING;
+
+ old = to_bch_ei(inode_insert5(&inode->v,
+ bch2_inode_hash(inum),
+ bch2_iget5_test,
+ bch2_iget5_set,
+ &inum));
+ BUG_ON(!old);
+
+ if (unlikely(old != inode)) {
+ /*
+ * We raced, another process pulled the new inode into cache
+ * before us:
+ */
+ make_bad_inode(&inode->v);
+ iput(&inode->v);
+
+ inode = old;
+ } else {
+ mutex_lock(&c->vfs_inodes_lock);
+ list_add(&inode->ei_vfs_inode_list, &c->vfs_inodes_list);
+ mutex_unlock(&c->vfs_inodes_lock);
+ /*
+ * we really don't want insert_inode_locked2() to be setting
+ * I_NEW...
+ */
+ unlock_new_inode(&inode->v);
+ }
+
+ bch2_trans_put(trans);
+err:
+ posix_acl_release(default_acl);
+ posix_acl_release(acl);
+ return inode;
+err_trans:
+ if (!(flags & BCH_CREATE_TMPFILE))
+ mutex_unlock(&dir->ei_update_lock);
+
+ bch2_trans_put(trans);
+ make_bad_inode(&inode->v);
+ iput(&inode->v);
+ inode = ERR_PTR(ret);
+ goto err;
+}
+
+/* methods */
+
+static struct dentry *bch2_lookup(struct inode *vdir, struct dentry *dentry,
+ unsigned int flags)
+{
+ struct bch_fs *c = vdir->i_sb->s_fs_info;
+ struct bch_inode_info *dir = to_bch_ei(vdir);
+ struct bch_hash_info hash = bch2_hash_info_init(c, &dir->ei_inode);
+ struct inode *vinode = NULL;
+ subvol_inum inum = { .subvol = 1 };
+ int ret;
+
+ ret = bch2_dirent_lookup(c, inode_inum(dir), &hash,
+ &dentry->d_name, &inum);
+
+ if (!ret)
+ vinode = bch2_vfs_inode_get(c, inum);
+
+ return d_splice_alias(vinode, dentry);
+}
+
+static int bch2_mknod(struct mnt_idmap *idmap,
+ struct inode *vdir, struct dentry *dentry,
+ umode_t mode, dev_t rdev)
+{
+ struct bch_inode_info *inode =
+ __bch2_create(idmap, to_bch_ei(vdir), dentry, mode, rdev,
+ (subvol_inum) { 0 }, 0);
+
+ if (IS_ERR(inode))
+ return bch2_err_class(PTR_ERR(inode));
+
+ d_instantiate(dentry, &inode->v);
+ return 0;
+}
+
+static int bch2_create(struct mnt_idmap *idmap,
+ struct inode *vdir, struct dentry *dentry,
+ umode_t mode, bool excl)
+{
+ return bch2_mknod(idmap, vdir, dentry, mode|S_IFREG, 0);
+}
+
+static int __bch2_link(struct bch_fs *c,
+ struct bch_inode_info *inode,
+ struct bch_inode_info *dir,
+ struct dentry *dentry)
+{
+ struct btree_trans *trans = bch2_trans_get(c);
+ struct bch_inode_unpacked dir_u, inode_u;
+ int ret;
+
+ mutex_lock(&inode->ei_update_lock);
+
+ ret = commit_do(trans, NULL, NULL, 0,
+ bch2_link_trans(trans,
+ inode_inum(dir), &dir_u,
+ inode_inum(inode), &inode_u,
+ &dentry->d_name));
+
+ if (likely(!ret)) {
+ bch2_inode_update_after_write(trans, dir, &dir_u,
+ ATTR_MTIME|ATTR_CTIME);
+ bch2_inode_update_after_write(trans, inode, &inode_u, ATTR_CTIME);
+ }
+
+ bch2_trans_put(trans);
+ mutex_unlock(&inode->ei_update_lock);
+ return ret;
+}
+
+static int bch2_link(struct dentry *old_dentry, struct inode *vdir,
+ struct dentry *dentry)
+{
+ struct bch_fs *c = vdir->i_sb->s_fs_info;
+ struct bch_inode_info *dir = to_bch_ei(vdir);
+ struct bch_inode_info *inode = to_bch_ei(old_dentry->d_inode);
+ int ret;
+
+ lockdep_assert_held(&inode->v.i_rwsem);
+
+ ret = __bch2_link(c, inode, dir, dentry);
+ if (unlikely(ret))
+ return ret;
+
+ ihold(&inode->v);
+ d_instantiate(dentry, &inode->v);
+ return 0;
+}
+
+int __bch2_unlink(struct inode *vdir, struct dentry *dentry,
+ bool deleting_snapshot)
+{
+ struct bch_fs *c = vdir->i_sb->s_fs_info;
+ struct bch_inode_info *dir = to_bch_ei(vdir);
+ struct bch_inode_info *inode = to_bch_ei(dentry->d_inode);
+ struct bch_inode_unpacked dir_u, inode_u;
+ struct btree_trans *trans = bch2_trans_get(c);
+ int ret;
+
+ bch2_lock_inodes(INODE_UPDATE_LOCK, dir, inode);
+
+ ret = commit_do(trans, NULL, NULL,
+ BTREE_INSERT_NOFAIL,
+ bch2_unlink_trans(trans,
+ inode_inum(dir), &dir_u,
+ &inode_u, &dentry->d_name,
+ deleting_snapshot));
+ if (unlikely(ret))
+ goto err;
+
+ bch2_inode_update_after_write(trans, dir, &dir_u,
+ ATTR_MTIME|ATTR_CTIME);
+ bch2_inode_update_after_write(trans, inode, &inode_u,
+ ATTR_MTIME);
+
+ if (inode_u.bi_subvol) {
+ /*
+ * Subvolume deletion is asynchronous, but we still want to tell
+ * the VFS that it's been deleted here:
+ */
+ set_nlink(&inode->v, 0);
+ }
+err:
+ bch2_unlock_inodes(INODE_UPDATE_LOCK, dir, inode);
+ bch2_trans_put(trans);
+
+ return ret;
+}
+
+static int bch2_unlink(struct inode *vdir, struct dentry *dentry)
+{
+ return __bch2_unlink(vdir, dentry, false);
+}
+
+static int bch2_symlink(struct mnt_idmap *idmap,
+ struct inode *vdir, struct dentry *dentry,
+ const char *symname)
+{
+ struct bch_fs *c = vdir->i_sb->s_fs_info;
+ struct bch_inode_info *dir = to_bch_ei(vdir), *inode;
+ int ret;
+
+ inode = __bch2_create(idmap, dir, dentry, S_IFLNK|S_IRWXUGO, 0,
+ (subvol_inum) { 0 }, BCH_CREATE_TMPFILE);
+ if (IS_ERR(inode))
+ return bch2_err_class(PTR_ERR(inode));
+
+ inode_lock(&inode->v);
+ ret = page_symlink(&inode->v, symname, strlen(symname) + 1);
+ inode_unlock(&inode->v);
+
+ if (unlikely(ret))
+ goto err;
+
+ ret = filemap_write_and_wait_range(inode->v.i_mapping, 0, LLONG_MAX);
+ if (unlikely(ret))
+ goto err;
+
+ ret = __bch2_link(c, inode, dir, dentry);
+ if (unlikely(ret))
+ goto err;
+
+ d_instantiate(dentry, &inode->v);
+ return 0;
+err:
+ iput(&inode->v);
+ return ret;
+}
+
+static int bch2_mkdir(struct mnt_idmap *idmap,
+ struct inode *vdir, struct dentry *dentry, umode_t mode)
+{
+ return bch2_mknod(idmap, vdir, dentry, mode|S_IFDIR, 0);
+}
+
+static int bch2_rename2(struct mnt_idmap *idmap,
+ struct inode *src_vdir, struct dentry *src_dentry,
+ struct inode *dst_vdir, struct dentry *dst_dentry,
+ unsigned flags)
+{
+ struct bch_fs *c = src_vdir->i_sb->s_fs_info;
+ struct bch_inode_info *src_dir = to_bch_ei(src_vdir);
+ struct bch_inode_info *dst_dir = to_bch_ei(dst_vdir);
+ struct bch_inode_info *src_inode = to_bch_ei(src_dentry->d_inode);
+ struct bch_inode_info *dst_inode = to_bch_ei(dst_dentry->d_inode);
+ struct bch_inode_unpacked dst_dir_u, src_dir_u;
+ struct bch_inode_unpacked src_inode_u, dst_inode_u;
+ struct btree_trans *trans;
+ enum bch_rename_mode mode = flags & RENAME_EXCHANGE
+ ? BCH_RENAME_EXCHANGE
+ : dst_dentry->d_inode
+ ? BCH_RENAME_OVERWRITE : BCH_RENAME;
+ int ret;
+
+ if (flags & ~(RENAME_NOREPLACE|RENAME_EXCHANGE))
+ return -EINVAL;
+
+ if (mode == BCH_RENAME_OVERWRITE) {
+ ret = filemap_write_and_wait_range(src_inode->v.i_mapping,
+ 0, LLONG_MAX);
+ if (ret)
+ return ret;
+ }
+
+ trans = bch2_trans_get(c);
+
+ bch2_lock_inodes(INODE_UPDATE_LOCK,
+ src_dir,
+ dst_dir,
+ src_inode,
+ dst_inode);
+
+ if (inode_attr_changing(dst_dir, src_inode, Inode_opt_project)) {
+ ret = bch2_fs_quota_transfer(c, src_inode,
+ dst_dir->ei_qid,
+ 1 << QTYP_PRJ,
+ KEY_TYPE_QUOTA_PREALLOC);
+ if (ret)
+ goto err;
+ }
+
+ if (mode == BCH_RENAME_EXCHANGE &&
+ inode_attr_changing(src_dir, dst_inode, Inode_opt_project)) {
+ ret = bch2_fs_quota_transfer(c, dst_inode,
+ src_dir->ei_qid,
+ 1 << QTYP_PRJ,
+ KEY_TYPE_QUOTA_PREALLOC);
+ if (ret)
+ goto err;
+ }
+
+ ret = commit_do(trans, NULL, NULL, 0,
+ bch2_rename_trans(trans,
+ inode_inum(src_dir), &src_dir_u,
+ inode_inum(dst_dir), &dst_dir_u,
+ &src_inode_u,
+ &dst_inode_u,
+ &src_dentry->d_name,
+ &dst_dentry->d_name,
+ mode));
+ if (unlikely(ret))
+ goto err;
+
+ BUG_ON(src_inode->v.i_ino != src_inode_u.bi_inum);
+ BUG_ON(dst_inode &&
+ dst_inode->v.i_ino != dst_inode_u.bi_inum);
+
+ bch2_inode_update_after_write(trans, src_dir, &src_dir_u,
+ ATTR_MTIME|ATTR_CTIME);
+
+ if (src_dir != dst_dir)
+ bch2_inode_update_after_write(trans, dst_dir, &dst_dir_u,
+ ATTR_MTIME|ATTR_CTIME);
+
+ bch2_inode_update_after_write(trans, src_inode, &src_inode_u,
+ ATTR_CTIME);
+
+ if (dst_inode)
+ bch2_inode_update_after_write(trans, dst_inode, &dst_inode_u,
+ ATTR_CTIME);
+err:
+ bch2_trans_put(trans);
+
+ bch2_fs_quota_transfer(c, src_inode,
+ bch_qid(&src_inode->ei_inode),
+ 1 << QTYP_PRJ,
+ KEY_TYPE_QUOTA_NOCHECK);
+ if (dst_inode)
+ bch2_fs_quota_transfer(c, dst_inode,
+ bch_qid(&dst_inode->ei_inode),
+ 1 << QTYP_PRJ,
+ KEY_TYPE_QUOTA_NOCHECK);
+
+ bch2_unlock_inodes(INODE_UPDATE_LOCK,
+ src_dir,
+ dst_dir,
+ src_inode,
+ dst_inode);
+
+ return ret;
+}
+
+static void bch2_setattr_copy(struct mnt_idmap *idmap,
+ struct bch_inode_info *inode,
+ struct bch_inode_unpacked *bi,
+ struct iattr *attr)
+{
+ struct bch_fs *c = inode->v.i_sb->s_fs_info;
+ unsigned int ia_valid = attr->ia_valid;
+
+ if (ia_valid & ATTR_UID)
+ bi->bi_uid = from_kuid(i_user_ns(&inode->v), attr->ia_uid);
+ if (ia_valid & ATTR_GID)
+ bi->bi_gid = from_kgid(i_user_ns(&inode->v), attr->ia_gid);
+
+ if (ia_valid & ATTR_SIZE)
+ bi->bi_size = attr->ia_size;
+
+ if (ia_valid & ATTR_ATIME)
+ bi->bi_atime = timespec_to_bch2_time(c, attr->ia_atime);
+ if (ia_valid & ATTR_MTIME)
+ bi->bi_mtime = timespec_to_bch2_time(c, attr->ia_mtime);
+ if (ia_valid & ATTR_CTIME)
+ bi->bi_ctime = timespec_to_bch2_time(c, attr->ia_ctime);
+
+ if (ia_valid & ATTR_MODE) {
+ umode_t mode = attr->ia_mode;
+ kgid_t gid = ia_valid & ATTR_GID
+ ? attr->ia_gid
+ : inode->v.i_gid;
+
+ if (!in_group_p(gid) &&
+ !capable_wrt_inode_uidgid(idmap, &inode->v, CAP_FSETID))
+ mode &= ~S_ISGID;
+ bi->bi_mode = mode;
+ }
+}
+
+int bch2_setattr_nonsize(struct mnt_idmap *idmap,
+ struct bch_inode_info *inode,
+ struct iattr *attr)
+{
+ struct bch_fs *c = inode->v.i_sb->s_fs_info;
+ struct bch_qid qid;
+ struct btree_trans *trans;
+ struct btree_iter inode_iter = { NULL };
+ struct bch_inode_unpacked inode_u;
+ struct posix_acl *acl = NULL;
+ int ret;
+
+ mutex_lock(&inode->ei_update_lock);
+
+ qid = inode->ei_qid;
+
+ if (attr->ia_valid & ATTR_UID)
+ qid.q[QTYP_USR] = from_kuid(i_user_ns(&inode->v), attr->ia_uid);
+
+ if (attr->ia_valid & ATTR_GID)
+ qid.q[QTYP_GRP] = from_kgid(i_user_ns(&inode->v), attr->ia_gid);
+
+ ret = bch2_fs_quota_transfer(c, inode, qid, ~0,
+ KEY_TYPE_QUOTA_PREALLOC);
+ if (ret)
+ goto err;
+
+ trans = bch2_trans_get(c);
+retry:
+ bch2_trans_begin(trans);
+ kfree(acl);
+ acl = NULL;
+
+ ret = bch2_inode_peek(trans, &inode_iter, &inode_u, inode_inum(inode),
+ BTREE_ITER_INTENT);
+ if (ret)
+ goto btree_err;
+
+ bch2_setattr_copy(idmap, inode, &inode_u, attr);
+
+ if (attr->ia_valid & ATTR_MODE) {
+ ret = bch2_acl_chmod(trans, inode_inum(inode), &inode_u,
+ inode_u.bi_mode, &acl);
+ if (ret)
+ goto btree_err;
+ }
+
+ ret = bch2_inode_write(trans, &inode_iter, &inode_u) ?:
+ bch2_trans_commit(trans, NULL, NULL,
+ BTREE_INSERT_NOFAIL);
+btree_err:
+ bch2_trans_iter_exit(trans, &inode_iter);
+
+ if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
+ goto retry;
+ if (unlikely(ret))
+ goto err_trans;
+
+ bch2_inode_update_after_write(trans, inode, &inode_u, attr->ia_valid);
+
+ if (acl)
+ set_cached_acl(&inode->v, ACL_TYPE_ACCESS, acl);
+err_trans:
+ bch2_trans_put(trans);
+err:
+ mutex_unlock(&inode->ei_update_lock);
+
+ return bch2_err_class(ret);
+}
+
+static int bch2_getattr(struct mnt_idmap *idmap,
+ const struct path *path, struct kstat *stat,
+ u32 request_mask, unsigned query_flags)
+{
+ struct bch_inode_info *inode = to_bch_ei(d_inode(path->dentry));
+ struct bch_fs *c = inode->v.i_sb->s_fs_info;
+
+ stat->dev = inode->v.i_sb->s_dev;
+ stat->ino = inode->v.i_ino;
+ stat->mode = inode->v.i_mode;
+ stat->nlink = inode->v.i_nlink;
+ stat->uid = inode->v.i_uid;
+ stat->gid = inode->v.i_gid;
+ stat->rdev = inode->v.i_rdev;
+ stat->size = i_size_read(&inode->v);
+ stat->atime = inode_get_atime(&inode->v);
+ stat->mtime = inode_get_mtime(&inode->v);
+ stat->ctime = inode_get_ctime(&inode->v);
+ stat->blksize = block_bytes(c);
+ stat->blocks = inode->v.i_blocks;
+
+ if (request_mask & STATX_BTIME) {
+ stat->result_mask |= STATX_BTIME;
+ stat->btime = bch2_time_to_timespec(c, inode->ei_inode.bi_otime);
+ }
+
+ if (inode->ei_inode.bi_flags & BCH_INODE_IMMUTABLE)
+ stat->attributes |= STATX_ATTR_IMMUTABLE;
+ stat->attributes_mask |= STATX_ATTR_IMMUTABLE;
+
+ if (inode->ei_inode.bi_flags & BCH_INODE_APPEND)
+ stat->attributes |= STATX_ATTR_APPEND;
+ stat->attributes_mask |= STATX_ATTR_APPEND;
+
+ if (inode->ei_inode.bi_flags & BCH_INODE_NODUMP)
+ stat->attributes |= STATX_ATTR_NODUMP;
+ stat->attributes_mask |= STATX_ATTR_NODUMP;
+
+ return 0;
+}
+
+static int bch2_setattr(struct mnt_idmap *idmap,
+ struct dentry *dentry, struct iattr *iattr)
+{
+ struct bch_inode_info *inode = to_bch_ei(dentry->d_inode);
+ int ret;
+
+ lockdep_assert_held(&inode->v.i_rwsem);
+
+ ret = setattr_prepare(idmap, dentry, iattr);
+ if (ret)
+ return ret;
+
+ return iattr->ia_valid & ATTR_SIZE
+ ? bchfs_truncate(idmap, inode, iattr)
+ : bch2_setattr_nonsize(idmap, inode, iattr);
+}
+
+static int bch2_tmpfile(struct mnt_idmap *idmap,
+ struct inode *vdir, struct file *file, umode_t mode)
+{
+ struct bch_inode_info *inode =
+ __bch2_create(idmap, to_bch_ei(vdir),
+ file->f_path.dentry, mode, 0,
+ (subvol_inum) { 0 }, BCH_CREATE_TMPFILE);
+
+ if (IS_ERR(inode))
+ return bch2_err_class(PTR_ERR(inode));
+
+ d_mark_tmpfile(file, &inode->v);
+ d_instantiate(file->f_path.dentry, &inode->v);
+ return finish_open_simple(file, 0);
+}
+
+static int bch2_fill_extent(struct bch_fs *c,
+ struct fiemap_extent_info *info,
+ struct bkey_s_c k, unsigned flags)
+{
+ if (bkey_extent_is_direct_data(k.k)) {
+ struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
+ const union bch_extent_entry *entry;
+ struct extent_ptr_decoded p;
+ int ret;
+
+ if (k.k->type == KEY_TYPE_reflink_v)
+ flags |= FIEMAP_EXTENT_SHARED;
+
+ bkey_for_each_ptr_decode(k.k, ptrs, p, entry) {
+ int flags2 = 0;
+ u64 offset = p.ptr.offset;
+
+ if (p.ptr.unwritten)
+ flags2 |= FIEMAP_EXTENT_UNWRITTEN;
+
+ if (p.crc.compression_type)
+ flags2 |= FIEMAP_EXTENT_ENCODED;
+ else
+ offset += p.crc.offset;
+
+ if ((offset & (block_sectors(c) - 1)) ||
+ (k.k->size & (block_sectors(c) - 1)))
+ flags2 |= FIEMAP_EXTENT_NOT_ALIGNED;
+
+ ret = fiemap_fill_next_extent(info,
+ bkey_start_offset(k.k) << 9,
+ offset << 9,
+ k.k->size << 9, flags|flags2);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+ } else if (bkey_extent_is_inline_data(k.k)) {
+ return fiemap_fill_next_extent(info,
+ bkey_start_offset(k.k) << 9,
+ 0, k.k->size << 9,
+ flags|
+ FIEMAP_EXTENT_DATA_INLINE);
+ } else if (k.k->type == KEY_TYPE_reservation) {
+ return fiemap_fill_next_extent(info,
+ bkey_start_offset(k.k) << 9,
+ 0, k.k->size << 9,
+ flags|
+ FIEMAP_EXTENT_DELALLOC|
+ FIEMAP_EXTENT_UNWRITTEN);
+ } else {
+ BUG();
+ }
+}
+
+static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
+ u64 start, u64 len)
+{
+ struct bch_fs *c = vinode->i_sb->s_fs_info;
+ struct bch_inode_info *ei = to_bch_ei(vinode);
+ struct btree_trans *trans;
+ struct btree_iter iter;
+ struct bkey_s_c k;
+ struct bkey_buf cur, prev;
+ struct bpos end = POS(ei->v.i_ino, (start + len) >> 9);
+ unsigned offset_into_extent, sectors;
+ bool have_extent = false;
+ u32 snapshot;
+ int ret = 0;
+
+ ret = fiemap_prep(&ei->v, info, start, &len, FIEMAP_FLAG_SYNC);
+ if (ret)
+ return ret;
+
+ if (start + len < start)
+ return -EINVAL;
+
+ start >>= 9;
+
+ bch2_bkey_buf_init(&cur);
+ bch2_bkey_buf_init(&prev);
+ trans = bch2_trans_get(c);
+retry:
+ bch2_trans_begin(trans);
+
+ ret = bch2_subvolume_get_snapshot(trans, ei->ei_subvol, &snapshot);
+ if (ret)
+ goto err;
+
+ bch2_trans_iter_init(trans, &iter, BTREE_ID_extents,
+ SPOS(ei->v.i_ino, start, snapshot), 0);
+
+ while (!(ret = btree_trans_too_many_iters(trans)) &&
+ (k = bch2_btree_iter_peek_upto(&iter, end)).k &&
+ !(ret = bkey_err(k))) {
+ enum btree_id data_btree = BTREE_ID_extents;
+
+ if (!bkey_extent_is_data(k.k) &&
+ k.k->type != KEY_TYPE_reservation) {
+ bch2_btree_iter_advance(&iter);
+ continue;
+ }
+
+ offset_into_extent = iter.pos.offset -
+ bkey_start_offset(k.k);
+ sectors = k.k->size - offset_into_extent;
+
+ bch2_bkey_buf_reassemble(&cur, c, k);
+
+ ret = bch2_read_indirect_extent(trans, &data_btree,
+ &offset_into_extent, &cur);
+ if (ret)
+ break;
+
+ k = bkey_i_to_s_c(cur.k);
+ bch2_bkey_buf_realloc(&prev, c, k.k->u64s);
+
+ sectors = min(sectors, k.k->size - offset_into_extent);
+
+ bch2_cut_front(POS(k.k->p.inode,
+ bkey_start_offset(k.k) +
+ offset_into_extent),
+ cur.k);
+ bch2_key_resize(&cur.k->k, sectors);
+ cur.k->k.p = iter.pos;
+ cur.k->k.p.offset += cur.k->k.size;
+
+ if (have_extent) {
+ bch2_trans_unlock(trans);
+ ret = bch2_fill_extent(c, info,
+ bkey_i_to_s_c(prev.k), 0);
+ if (ret)
+ break;
+ }
+
+ bkey_copy(prev.k, cur.k);
+ have_extent = true;
+
+ bch2_btree_iter_set_pos(&iter,
+ POS(iter.pos.inode, iter.pos.offset + sectors));
+ }
+ start = iter.pos.offset;
+ bch2_trans_iter_exit(trans, &iter);
+err:
+ if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
+ goto retry;
+
+ if (!ret && have_extent) {
+ bch2_trans_unlock(trans);
+ ret = bch2_fill_extent(c, info, bkey_i_to_s_c(prev.k),
+ FIEMAP_EXTENT_LAST);
+ }
+
+ bch2_trans_put(trans);
+ bch2_bkey_buf_exit(&cur, c);
+ bch2_bkey_buf_exit(&prev, c);
+ return ret < 0 ? ret : 0;
+}
+
+static const struct vm_operations_struct bch_vm_ops = {
+ .fault = bch2_page_fault,
+ .map_pages = filemap_map_pages,
+ .page_mkwrite = bch2_page_mkwrite,
+};
+
+static int bch2_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ file_accessed(file);
+
+ vma->vm_ops = &bch_vm_ops;
+ return 0;
+}
+
+/* Directories: */
+
+static loff_t bch2_dir_llseek(struct file *file, loff_t offset, int whence)
+{
+ return generic_file_llseek_size(file, offset, whence,
+ S64_MAX, S64_MAX);
+}
+
+static int bch2_vfs_readdir(struct file *file, struct dir_context *ctx)
+{
+ struct bch_inode_info *inode = file_bch_inode(file);
+ struct bch_fs *c = inode->v.i_sb->s_fs_info;
+ int ret;
+
+ if (!dir_emit_dots(file, ctx))
+ return 0;
+
+ ret = bch2_readdir(c, inode_inum(inode), ctx);
+ if (ret)
+ bch_err_fn(c, ret);
+
+ return bch2_err_class(ret);
+}
+
+static const struct file_operations bch_file_operations = {
+ .llseek = bch2_llseek,
+ .read_iter = bch2_read_iter,
+ .write_iter = bch2_write_iter,
+ .mmap = bch2_mmap,
+ .open = generic_file_open,
+ .fsync = bch2_fsync,
+ .splice_read = filemap_splice_read,
+ .splice_write = iter_file_splice_write,
+ .fallocate = bch2_fallocate_dispatch,
+ .unlocked_ioctl = bch2_fs_file_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = bch2_compat_fs_ioctl,
+#endif
+ .remap_file_range = bch2_remap_file_range,
+};
+
+static const struct inode_operations bch_file_inode_operations = {
+ .getattr = bch2_getattr,
+ .setattr = bch2_setattr,
+ .fiemap = bch2_fiemap,
+ .listxattr = bch2_xattr_list,
+#ifdef CONFIG_BCACHEFS_POSIX_ACL
+ .get_acl = bch2_get_acl,
+ .set_acl = bch2_set_acl,
+#endif
+};
+
+static const struct inode_operations bch_dir_inode_operations = {
+ .lookup = bch2_lookup,
+ .create = bch2_create,
+ .link = bch2_link,
+ .unlink = bch2_unlink,
+ .symlink = bch2_symlink,
+ .mkdir = bch2_mkdir,
+ .rmdir = bch2_unlink,
+ .mknod = bch2_mknod,
+ .rename = bch2_rename2,
+ .getattr = bch2_getattr,
+ .setattr = bch2_setattr,
+ .tmpfile = bch2_tmpfile,
+ .listxattr = bch2_xattr_list,
+#ifdef CONFIG_BCACHEFS_POSIX_ACL
+ .get_acl = bch2_get_acl,
+ .set_acl = bch2_set_acl,
+#endif
+};
+
+static const struct file_operations bch_dir_file_operations = {
+ .llseek = bch2_dir_llseek,
+ .read = generic_read_dir,
+ .iterate_shared = bch2_vfs_readdir,
+ .fsync = bch2_fsync,
+ .unlocked_ioctl = bch2_fs_file_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = bch2_compat_fs_ioctl,
+#endif
+};
+
+static const struct inode_operations bch_symlink_inode_operations = {
+ .get_link = page_get_link,
+ .getattr = bch2_getattr,
+ .setattr = bch2_setattr,
+ .listxattr = bch2_xattr_list,
+#ifdef CONFIG_BCACHEFS_POSIX_ACL
+ .get_acl = bch2_get_acl,
+ .set_acl = bch2_set_acl,
+#endif
+};
+
+static const struct inode_operations bch_special_inode_operations = {
+ .getattr = bch2_getattr,
+ .setattr = bch2_setattr,
+ .listxattr = bch2_xattr_list,
+#ifdef CONFIG_BCACHEFS_POSIX_ACL
+ .get_acl = bch2_get_acl,
+ .set_acl = bch2_set_acl,
+#endif
+};
+
+static const struct address_space_operations bch_address_space_operations = {
+ .read_folio = bch2_read_folio,
+ .writepages = bch2_writepages,
+ .readahead = bch2_readahead,
+ .dirty_folio = filemap_dirty_folio,
+ .write_begin = bch2_write_begin,
+ .write_end = bch2_write_end,
+ .invalidate_folio = bch2_invalidate_folio,
+ .release_folio = bch2_release_folio,
+ .direct_IO = noop_direct_IO,
+#ifdef CONFIG_MIGRATION
+ .migrate_folio = filemap_migrate_folio,
+#endif
+ .error_remove_page = generic_error_remove_page,
+};
+
+struct bcachefs_fid {
+ u64 inum;
+ u32 subvol;
+ u32 gen;
+} __packed;
+
+struct bcachefs_fid_with_parent {
+ struct bcachefs_fid fid;
+ struct bcachefs_fid dir;
+} __packed;
+
+static int bcachefs_fid_valid(int fh_len, int fh_type)
+{
+ switch (fh_type) {
+ case FILEID_BCACHEFS_WITHOUT_PARENT:
+ return fh_len == sizeof(struct bcachefs_fid) / sizeof(u32);
+ case FILEID_BCACHEFS_WITH_PARENT:
+ return fh_len == sizeof(struct bcachefs_fid_with_parent) / sizeof(u32);
+ default:
+ return false;
+ }
+}
+
+static struct bcachefs_fid bch2_inode_to_fid(struct bch_inode_info *inode)
+{
+ return (struct bcachefs_fid) {
+ .inum = inode->ei_inode.bi_inum,
+ .subvol = inode->ei_subvol,
+ .gen = inode->ei_inode.bi_generation,
+ };
+}
+
+static int bch2_encode_fh(struct inode *vinode, u32 *fh, int *len,
+ struct inode *vdir)
+{
+ struct bch_inode_info *inode = to_bch_ei(vinode);
+ struct bch_inode_info *dir = to_bch_ei(vdir);
+
+ if (*len < sizeof(struct bcachefs_fid_with_parent) / sizeof(u32))
+ return FILEID_INVALID;
+
+ if (!S_ISDIR(inode->v.i_mode) && dir) {
+ struct bcachefs_fid_with_parent *fid = (void *) fh;
+
+ fid->fid = bch2_inode_to_fid(inode);
+ fid->dir = bch2_inode_to_fid(dir);
+
+ *len = sizeof(*fid) / sizeof(u32);
+ return FILEID_BCACHEFS_WITH_PARENT;
+ } else {
+ struct bcachefs_fid *fid = (void *) fh;
+
+ *fid = bch2_inode_to_fid(inode);
+
+ *len = sizeof(*fid) / sizeof(u32);
+ return FILEID_BCACHEFS_WITHOUT_PARENT;
+ }
+}
+
+static struct inode *bch2_nfs_get_inode(struct super_block *sb,
+ struct bcachefs_fid fid)
+{
+ struct bch_fs *c = sb->s_fs_info;
+ struct inode *vinode = bch2_vfs_inode_get(c, (subvol_inum) {
+ .subvol = fid.subvol,
+ .inum = fid.inum,
+ });
+ if (!IS_ERR(vinode) && vinode->i_generation != fid.gen) {
+ iput(vinode);
+ vinode = ERR_PTR(-ESTALE);
+ }
+ return vinode;
+}
+
+static struct dentry *bch2_fh_to_dentry(struct super_block *sb, struct fid *_fid,
+ int fh_len, int fh_type)
+{
+ struct bcachefs_fid *fid = (void *) _fid;
+
+ if (!bcachefs_fid_valid(fh_len, fh_type))
+ return NULL;
+
+ return d_obtain_alias(bch2_nfs_get_inode(sb, *fid));
+}
+
+static struct dentry *bch2_fh_to_parent(struct super_block *sb, struct fid *_fid,
+ int fh_len, int fh_type)
+{
+ struct bcachefs_fid_with_parent *fid = (void *) _fid;
+
+ if (!bcachefs_fid_valid(fh_len, fh_type) ||
+ fh_type != FILEID_BCACHEFS_WITH_PARENT)
+ return NULL;
+
+ return d_obtain_alias(bch2_nfs_get_inode(sb, fid->dir));
+}
+
+static struct dentry *bch2_get_parent(struct dentry *child)
+{
+ struct bch_inode_info *inode = to_bch_ei(child->d_inode);
+ struct bch_fs *c = inode->v.i_sb->s_fs_info;
+ subvol_inum parent_inum = {
+ .subvol = inode->ei_inode.bi_parent_subvol ?:
+ inode->ei_subvol,
+ .inum = inode->ei_inode.bi_dir,
+ };
+
+ if (!parent_inum.inum)
+ return NULL;
+
+ return d_obtain_alias(bch2_vfs_inode_get(c, parent_inum));
+}
+
+static int bch2_get_name(struct dentry *parent, char *name, struct dentry *child)
+{
+ struct bch_inode_info *inode = to_bch_ei(child->d_inode);
+ struct bch_inode_info *dir = to_bch_ei(parent->d_inode);
+ struct bch_fs *c = inode->v.i_sb->s_fs_info;
+ struct btree_trans *trans;
+ struct btree_iter iter1;
+ struct btree_iter iter2;
+ struct bkey_s_c k;
+ struct bkey_s_c_dirent d;
+ struct bch_inode_unpacked inode_u;
+ subvol_inum target;
+ u32 snapshot;
+ struct qstr dirent_name;
+ unsigned name_len = 0;
+ int ret;
+
+ if (!S_ISDIR(dir->v.i_mode))
+ return -EINVAL;
+
+ trans = bch2_trans_get(c);
+
+ bch2_trans_iter_init(trans, &iter1, BTREE_ID_dirents,
+ POS(dir->ei_inode.bi_inum, 0), 0);
+ bch2_trans_iter_init(trans, &iter2, BTREE_ID_dirents,
+ POS(dir->ei_inode.bi_inum, 0), 0);
+retry:
+ bch2_trans_begin(trans);
+
+ ret = bch2_subvolume_get_snapshot(trans, dir->ei_subvol, &snapshot);
+ if (ret)
+ goto err;
+
+ bch2_btree_iter_set_snapshot(&iter1, snapshot);
+ bch2_btree_iter_set_snapshot(&iter2, snapshot);
+
+ ret = bch2_inode_find_by_inum_trans(trans, inode_inum(inode), &inode_u);
+ if (ret)
+ goto err;
+
+ if (inode_u.bi_dir == dir->ei_inode.bi_inum) {
+ bch2_btree_iter_set_pos(&iter1, POS(inode_u.bi_dir, inode_u.bi_dir_offset));
+
+ k = bch2_btree_iter_peek_slot(&iter1);
+ ret = bkey_err(k);
+ if (ret)
+ goto err;
+
+ if (k.k->type != KEY_TYPE_dirent) {
+ ret = -BCH_ERR_ENOENT_dirent_doesnt_match_inode;
+ goto err;
+ }
+
+ d = bkey_s_c_to_dirent(k);
+ ret = bch2_dirent_read_target(trans, inode_inum(dir), d, &target);
+ if (ret > 0)
+ ret = -BCH_ERR_ENOENT_dirent_doesnt_match_inode;
+ if (ret)
+ goto err;
+
+ if (target.subvol == inode->ei_subvol &&
+ target.inum == inode->ei_inode.bi_inum)
+ goto found;
+ } else {
+ /*
+ * File with multiple hardlinks and our backref is to the wrong
+ * directory - linear search:
+ */
+ for_each_btree_key_continue_norestart(iter2, 0, k, ret) {
+ if (k.k->p.inode > dir->ei_inode.bi_inum)
+ break;
+
+ if (k.k->type != KEY_TYPE_dirent)
+ continue;
+
+ d = bkey_s_c_to_dirent(k);
+ ret = bch2_dirent_read_target(trans, inode_inum(dir), d, &target);
+ if (ret < 0)
+ break;
+ if (ret)
+ continue;
+
+ if (target.subvol == inode->ei_subvol &&
+ target.inum == inode->ei_inode.bi_inum)
+ goto found;
+ }
+ }
+
+ ret = -ENOENT;
+ goto err;
+found:
+ dirent_name = bch2_dirent_get_name(d);
+
+ name_len = min_t(unsigned, dirent_name.len, NAME_MAX);
+ memcpy(name, dirent_name.name, name_len);
+ name[name_len] = '\0';
+err:
+ if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
+ goto retry;
+
+ bch2_trans_iter_exit(trans, &iter1);
+ bch2_trans_iter_exit(trans, &iter2);
+ bch2_trans_put(trans);
+
+ return ret;
+}
+
+static const struct export_operations bch_export_ops = {
+ .encode_fh = bch2_encode_fh,
+ .fh_to_dentry = bch2_fh_to_dentry,
+ .fh_to_parent = bch2_fh_to_parent,
+ .get_parent = bch2_get_parent,
+ .get_name = bch2_get_name,
+};
+
+static void bch2_vfs_inode_init(struct btree_trans *trans, subvol_inum inum,
+ struct bch_inode_info *inode,
+ struct bch_inode_unpacked *bi,
+ struct bch_subvolume *subvol)
+{
+ bch2_inode_update_after_write(trans, inode, bi, ~0);
+
+ if (BCH_SUBVOLUME_SNAP(subvol))
+ set_bit(EI_INODE_SNAPSHOT, &inode->ei_flags);
+ else
+ clear_bit(EI_INODE_SNAPSHOT, &inode->ei_flags);
+
+ inode->v.i_blocks = bi->bi_sectors;
+ inode->v.i_ino = bi->bi_inum;
+ inode->v.i_rdev = bi->bi_dev;
+ inode->v.i_generation = bi->bi_generation;
+ inode->v.i_size = bi->bi_size;
+
+ inode->ei_flags = 0;
+ inode->ei_quota_reserved = 0;
+ inode->ei_qid = bch_qid(bi);
+ inode->ei_subvol = inum.subvol;
+
+ inode->v.i_mapping->a_ops = &bch_address_space_operations;
+
+ switch (inode->v.i_mode & S_IFMT) {
+ case S_IFREG:
+ inode->v.i_op = &bch_file_inode_operations;
+ inode->v.i_fop = &bch_file_operations;
+ break;
+ case S_IFDIR:
+ inode->v.i_op = &bch_dir_inode_operations;
+ inode->v.i_fop = &bch_dir_file_operations;
+ break;
+ case S_IFLNK:
+ inode_nohighmem(&inode->v);
+ inode->v.i_op = &bch_symlink_inode_operations;
+ break;
+ default:
+ init_special_inode(&inode->v, inode->v.i_mode, inode->v.i_rdev);
+ inode->v.i_op = &bch_special_inode_operations;
+ break;
+ }
+
+ mapping_set_large_folios(inode->v.i_mapping);
+}
+
+static struct inode *bch2_alloc_inode(struct super_block *sb)
+{
+ struct bch_inode_info *inode;
+
+ inode = kmem_cache_alloc(bch2_inode_cache, GFP_NOFS);
+ if (!inode)
+ return NULL;
+
+ inode_init_once(&inode->v);
+ mutex_init(&inode->ei_update_lock);
+ two_state_lock_init(&inode->ei_pagecache_lock);
+ INIT_LIST_HEAD(&inode->ei_vfs_inode_list);
+ mutex_init(&inode->ei_quota_lock);
+
+ return &inode->v;
+}
+
+static void bch2_i_callback(struct rcu_head *head)
+{
+ struct inode *vinode = container_of(head, struct inode, i_rcu);
+ struct bch_inode_info *inode = to_bch_ei(vinode);
+
+ kmem_cache_free(bch2_inode_cache, inode);
+}
+
+static void bch2_destroy_inode(struct inode *vinode)
+{
+ call_rcu(&vinode->i_rcu, bch2_i_callback);
+}
+
+static int inode_update_times_fn(struct btree_trans *trans,
+ struct bch_inode_info *inode,
+ struct bch_inode_unpacked *bi,
+ void *p)
+{
+ struct bch_fs *c = inode->v.i_sb->s_fs_info;
+
+ bi->bi_atime = timespec_to_bch2_time(c, inode_get_atime(&inode->v));
+ bi->bi_mtime = timespec_to_bch2_time(c, inode_get_mtime(&inode->v));
+ bi->bi_ctime = timespec_to_bch2_time(c, inode_get_ctime(&inode->v));
+
+ return 0;
+}
+
+static int bch2_vfs_write_inode(struct inode *vinode,
+ struct writeback_control *wbc)
+{
+ struct bch_fs *c = vinode->i_sb->s_fs_info;
+ struct bch_inode_info *inode = to_bch_ei(vinode);
+ int ret;
+
+ mutex_lock(&inode->ei_update_lock);
+ ret = bch2_write_inode(c, inode, inode_update_times_fn, NULL,
+ ATTR_ATIME|ATTR_MTIME|ATTR_CTIME);
+ mutex_unlock(&inode->ei_update_lock);
+
+ return bch2_err_class(ret);
+}
+
+static void bch2_evict_inode(struct inode *vinode)
+{
+ struct bch_fs *c = vinode->i_sb->s_fs_info;
+ struct bch_inode_info *inode = to_bch_ei(vinode);
+
+ truncate_inode_pages_final(&inode->v.i_data);
+
+ clear_inode(&inode->v);
+
+ BUG_ON(!is_bad_inode(&inode->v) && inode->ei_quota_reserved);
+
+ if (!inode->v.i_nlink && !is_bad_inode(&inode->v)) {
+ bch2_quota_acct(c, inode->ei_qid, Q_SPC, -((s64) inode->v.i_blocks),
+ KEY_TYPE_QUOTA_WARN);
+ bch2_quota_acct(c, inode->ei_qid, Q_INO, -1,
+ KEY_TYPE_QUOTA_WARN);
+ bch2_inode_rm(c, inode_inum(inode));
+ }
+
+ mutex_lock(&c->vfs_inodes_lock);
+ list_del_init(&inode->ei_vfs_inode_list);
+ mutex_unlock(&c->vfs_inodes_lock);
+}
+
+void bch2_evict_subvolume_inodes(struct bch_fs *c, snapshot_id_list *s)
+{
+ struct bch_inode_info *inode, **i;
+ DARRAY(struct bch_inode_info *) grabbed;
+ bool clean_pass = false, this_pass_clean;
+
+ /*
+ * Initially, we scan for inodes without I_DONTCACHE, then mark them to
+ * be pruned with d_mark_dontcache().
+ *
+ * Once we've had a clean pass where we didn't find any inodes without
+ * I_DONTCACHE, we wait for them to be freed:
+ */
+
+ darray_init(&grabbed);
+ darray_make_room(&grabbed, 1024);
+again:
+ cond_resched();
+ this_pass_clean = true;
+
+ mutex_lock(&c->vfs_inodes_lock);
+ list_for_each_entry(inode, &c->vfs_inodes_list, ei_vfs_inode_list) {
+ if (!snapshot_list_has_id(s, inode->ei_subvol))
+ continue;
+
+ if (!(inode->v.i_state & I_DONTCACHE) &&
+ !(inode->v.i_state & I_FREEING) &&
+ igrab(&inode->v)) {
+ this_pass_clean = false;
+
+ if (darray_push_gfp(&grabbed, inode, GFP_ATOMIC|__GFP_NOWARN)) {
+ iput(&inode->v);
+ break;
+ }
+ } else if (clean_pass && this_pass_clean) {
+ wait_queue_head_t *wq = bit_waitqueue(&inode->v.i_state, __I_NEW);
+ DEFINE_WAIT_BIT(wait, &inode->v.i_state, __I_NEW);
+
+ prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+ mutex_unlock(&c->vfs_inodes_lock);
+
+ schedule();
+ finish_wait(wq, &wait.wq_entry);
+ goto again;
+ }
+ }
+ mutex_unlock(&c->vfs_inodes_lock);
+
+ darray_for_each(grabbed, i) {
+ inode = *i;
+ d_mark_dontcache(&inode->v);
+ d_prune_aliases(&inode->v);
+ iput(&inode->v);
+ }
+ grabbed.nr = 0;
+
+ if (!clean_pass || !this_pass_clean) {
+ clean_pass = this_pass_clean;
+ goto again;
+ }
+
+ darray_exit(&grabbed);
+}
+
+static int bch2_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+ struct super_block *sb = dentry->d_sb;
+ struct bch_fs *c = sb->s_fs_info;
+ struct bch_fs_usage_short usage = bch2_fs_usage_read_short(c);
+ unsigned shift = sb->s_blocksize_bits - 9;
+ /*
+ * this assumes inodes take up 64 bytes, which is a decent average
+ * number:
+ */
+ u64 avail_inodes = ((usage.capacity - usage.used) << 3);
+ u64 fsid;
+
+ buf->f_type = BCACHEFS_STATFS_MAGIC;
+ buf->f_bsize = sb->s_blocksize;
+ buf->f_blocks = usage.capacity >> shift;
+ buf->f_bfree = usage.free >> shift;
+ buf->f_bavail = avail_factor(usage.free) >> shift;
+
+ buf->f_files = usage.nr_inodes + avail_inodes;
+ buf->f_ffree = avail_inodes;
+
+ fsid = le64_to_cpup((void *) c->sb.user_uuid.b) ^
+ le64_to_cpup((void *) c->sb.user_uuid.b + sizeof(u64));
+ buf->f_fsid.val[0] = fsid & 0xFFFFFFFFUL;
+ buf->f_fsid.val[1] = (fsid >> 32) & 0xFFFFFFFFUL;
+ buf->f_namelen = BCH_NAME_MAX;
+
+ return 0;
+}
+
+static int bch2_sync_fs(struct super_block *sb, int wait)
+{
+ struct bch_fs *c = sb->s_fs_info;
+ int ret;
+
+ if (c->opts.journal_flush_disabled)
+ return 0;
+
+ if (!wait) {
+ bch2_journal_flush_async(&c->journal, NULL);
+ return 0;
+ }
+
+ ret = bch2_journal_flush(&c->journal);
+ return bch2_err_class(ret);
+}
+
+static struct bch_fs *bch2_path_to_fs(const char *path)
+{
+ struct bch_fs *c;
+ dev_t dev;
+ int ret;
+
+ ret = lookup_bdev(path, &dev);
+ if (ret)
+ return ERR_PTR(ret);
+
+ c = bch2_dev_to_fs(dev);
+ if (c)
+ closure_put(&c->cl);
+ return c ?: ERR_PTR(-ENOENT);
+}
+
+static char **split_devs(const char *_dev_name, unsigned *nr)
+{
+ char *dev_name = NULL, **devs = NULL, *s;
+ size_t i = 0, nr_devs = 0;
+
+ dev_name = kstrdup(_dev_name, GFP_KERNEL);
+ if (!dev_name)
+ return NULL;
+
+ for (s = dev_name; s; s = strchr(s + 1, ':'))
+ nr_devs++;
+
+ devs = kcalloc(nr_devs + 1, sizeof(const char *), GFP_KERNEL);
+ if (!devs) {
+ kfree(dev_name);
+ return NULL;
+ }
+
+ while ((s = strsep(&dev_name, ":")))
+ devs[i++] = s;
+
+ *nr = nr_devs;
+ return devs;
+}
+
+static int bch2_remount(struct super_block *sb, int *flags, char *data)
+{
+ struct bch_fs *c = sb->s_fs_info;
+ struct bch_opts opts = bch2_opts_empty();
+ int ret;
+
+ opt_set(opts, read_only, (*flags & SB_RDONLY) != 0);
+
+ ret = bch2_parse_mount_opts(c, &opts, data);
+ if (ret)
+ goto err;
+
+ if (opts.read_only != c->opts.read_only) {
+ down_write(&c->state_lock);
+
+ if (opts.read_only) {
+ bch2_fs_read_only(c);
+
+ sb->s_flags |= SB_RDONLY;
+ } else {
+ ret = bch2_fs_read_write(c);
+ if (ret) {
+ bch_err(c, "error going rw: %i", ret);
+ up_write(&c->state_lock);
+ ret = -EINVAL;
+ goto err;
+ }
+
+ sb->s_flags &= ~SB_RDONLY;
+ }
+
+ c->opts.read_only = opts.read_only;
+
+ up_write(&c->state_lock);
+ }
+
+ if (opt_defined(opts, errors))
+ c->opts.errors = opts.errors;
+err:
+ return bch2_err_class(ret);
+}
+
+static int bch2_show_devname(struct seq_file *seq, struct dentry *root)
+{
+ struct bch_fs *c = root->d_sb->s_fs_info;
+ struct bch_dev *ca;
+ unsigned i;
+ bool first = true;
+
+ for_each_online_member(ca, c, i) {
+ if (!first)
+ seq_putc(seq, ':');
+ first = false;
+ seq_puts(seq, "/dev/");
+ seq_puts(seq, ca->name);
+ }
+
+ return 0;
+}
+
+static int bch2_show_options(struct seq_file *seq, struct dentry *root)
+{
+ struct bch_fs *c = root->d_sb->s_fs_info;
+ enum bch_opt_id i;
+ struct printbuf buf = PRINTBUF;
+ int ret = 0;
+
+ for (i = 0; i < bch2_opts_nr; i++) {
+ const struct bch_option *opt = &bch2_opt_table[i];
+ u64 v = bch2_opt_get_by_id(&c->opts, i);
+
+ if (!(opt->flags & OPT_MOUNT))
+ continue;
+
+ if (v == bch2_opt_get_by_id(&bch2_opts_default, i))
+ continue;
+
+ printbuf_reset(&buf);
+ bch2_opt_to_text(&buf, c, c->disk_sb.sb, opt, v,
+ OPT_SHOW_MOUNT_STYLE);
+ seq_putc(seq, ',');
+ seq_puts(seq, buf.buf);
+ }
+
+ if (buf.allocation_failure)
+ ret = -ENOMEM;
+ printbuf_exit(&buf);
+ return ret;
+}
+
+static void bch2_put_super(struct super_block *sb)
+{
+ struct bch_fs *c = sb->s_fs_info;
+
+ __bch2_fs_stop(c);
+}
+
+/*
+ * bcachefs doesn't currently integrate intwrite freeze protection but the
+ * internal write references serve the same purpose. Therefore reuse the
+ * read-only transition code to perform the quiesce. The caveat is that we don't
+ * currently have the ability to block tasks that want a write reference while
+ * the superblock is frozen. This is fine for now, but we should either add
+ * blocking support or find a way to integrate sb_start_intwrite() and friends.
+ */
+static int bch2_freeze(struct super_block *sb)
+{
+ struct bch_fs *c = sb->s_fs_info;
+
+ down_write(&c->state_lock);
+ bch2_fs_read_only(c);
+ up_write(&c->state_lock);
+ return 0;
+}
+
+static int bch2_unfreeze(struct super_block *sb)
+{
+ struct bch_fs *c = sb->s_fs_info;
+ int ret;
+
+ down_write(&c->state_lock);
+ ret = bch2_fs_read_write(c);
+ up_write(&c->state_lock);
+ return ret;
+}
+
+static const struct super_operations bch_super_operations = {
+ .alloc_inode = bch2_alloc_inode,
+ .destroy_inode = bch2_destroy_inode,
+ .write_inode = bch2_vfs_write_inode,
+ .evict_inode = bch2_evict_inode,
+ .sync_fs = bch2_sync_fs,
+ .statfs = bch2_statfs,
+ .show_devname = bch2_show_devname,
+ .show_options = bch2_show_options,
+ .remount_fs = bch2_remount,
+ .put_super = bch2_put_super,
+ .freeze_fs = bch2_freeze,
+ .unfreeze_fs = bch2_unfreeze,
+};
+
+static int bch2_set_super(struct super_block *s, void *data)
+{
+ s->s_fs_info = data;
+ return 0;
+}
+
+static int bch2_noset_super(struct super_block *s, void *data)
+{
+ return -EBUSY;
+}
+
+static int bch2_test_super(struct super_block *s, void *data)
+{
+ struct bch_fs *c = s->s_fs_info;
+ struct bch_fs **devs = data;
+ unsigned i;
+
+ if (!c)
+ return false;
+
+ for (i = 0; devs[i]; i++)
+ if (c != devs[i])
+ return false;
+ return true;
+}
+
+static struct dentry *bch2_mount(struct file_system_type *fs_type,
+ int flags, const char *dev_name, void *data)
+{
+ struct bch_fs *c;
+ struct bch_dev *ca;
+ struct super_block *sb;
+ struct inode *vinode;
+ struct bch_opts opts = bch2_opts_empty();
+ char **devs;
+ struct bch_fs **devs_to_fs = NULL;
+ unsigned i, nr_devs;
+ int ret;
+
+ opt_set(opts, read_only, (flags & SB_RDONLY) != 0);
+
+ ret = bch2_parse_mount_opts(NULL, &opts, data);
+ if (ret)
+ return ERR_PTR(ret);
+
+ if (!dev_name || strlen(dev_name) == 0)
+ return ERR_PTR(-EINVAL);
+
+ devs = split_devs(dev_name, &nr_devs);
+ if (!devs)
+ return ERR_PTR(-ENOMEM);
+
+ devs_to_fs = kcalloc(nr_devs + 1, sizeof(void *), GFP_KERNEL);
+ if (!devs_to_fs) {
+ sb = ERR_PTR(-ENOMEM);
+ goto got_sb;
+ }
+
+ for (i = 0; i < nr_devs; i++)
+ devs_to_fs[i] = bch2_path_to_fs(devs[i]);
+
+ sb = sget(fs_type, bch2_test_super, bch2_noset_super,
+ flags|SB_NOSEC, devs_to_fs);
+ if (!IS_ERR(sb))
+ goto got_sb;
+
+ c = bch2_fs_open(devs, nr_devs, opts);
+ if (IS_ERR(c)) {
+ sb = ERR_CAST(c);
+ goto got_sb;
+ }
+
+ /* Some options can't be parsed until after the fs is started: */
+ ret = bch2_parse_mount_opts(c, &opts, data);
+ if (ret) {
+ bch2_fs_stop(c);
+ sb = ERR_PTR(ret);
+ goto got_sb;
+ }
+
+ bch2_opts_apply(&c->opts, opts);
+
+ sb = sget(fs_type, NULL, bch2_set_super, flags|SB_NOSEC, c);
+ if (IS_ERR(sb))
+ bch2_fs_stop(c);
+got_sb:
+ kfree(devs_to_fs);
+ kfree(devs[0]);
+ kfree(devs);
+
+ if (IS_ERR(sb)) {
+ ret = PTR_ERR(sb);
+ ret = bch2_err_class(ret);
+ return ERR_PTR(ret);
+ }
+
+ c = sb->s_fs_info;
+
+ if (sb->s_root) {
+ if ((flags ^ sb->s_flags) & SB_RDONLY) {
+ ret = -EBUSY;
+ goto err_put_super;
+ }
+ goto out;
+ }
+
+ sb->s_blocksize = block_bytes(c);
+ sb->s_blocksize_bits = ilog2(block_bytes(c));
+ sb->s_maxbytes = MAX_LFS_FILESIZE;
+ sb->s_op = &bch_super_operations;
+ sb->s_export_op = &bch_export_ops;
+#ifdef CONFIG_BCACHEFS_QUOTA
+ sb->s_qcop = &bch2_quotactl_operations;
+ sb->s_quota_types = QTYPE_MASK_USR|QTYPE_MASK_GRP|QTYPE_MASK_PRJ;
+#endif
+ sb->s_xattr = bch2_xattr_handlers;
+ sb->s_magic = BCACHEFS_STATFS_MAGIC;
+ sb->s_time_gran = c->sb.nsec_per_time_unit;
+ sb->s_time_min = div_s64(S64_MIN, c->sb.time_units_per_sec) + 1;
+ sb->s_time_max = div_s64(S64_MAX, c->sb.time_units_per_sec);
+ c->vfs_sb = sb;
+ strscpy(sb->s_id, c->name, sizeof(sb->s_id));
+
+ ret = super_setup_bdi(sb);
+ if (ret)
+ goto err_put_super;
+
+ sb->s_bdi->ra_pages = VM_READAHEAD_PAGES;
+
+ for_each_online_member(ca, c, i) {
+ struct block_device *bdev = ca->disk_sb.bdev;
+
+ /* XXX: create an anonymous device for multi device filesystems */
+ sb->s_bdev = bdev;
+ sb->s_dev = bdev->bd_dev;
+ percpu_ref_put(&ca->io_ref);
+ break;
+ }
+
+ c->dev = sb->s_dev;
+
+#ifdef CONFIG_BCACHEFS_POSIX_ACL
+ if (c->opts.acl)
+ sb->s_flags |= SB_POSIXACL;
+#endif
+
++ sb->s_shrink->seeks = 0;
+
+ vinode = bch2_vfs_inode_get(c, BCACHEFS_ROOT_SUBVOL_INUM);
+ ret = PTR_ERR_OR_ZERO(vinode);
+ if (ret) {
+ bch_err_msg(c, ret, "mounting: error getting root inode");
+ goto err_put_super;
+ }
+
+ sb->s_root = d_make_root(vinode);
+ if (!sb->s_root) {
+ bch_err(c, "error mounting: error allocating root dentry");
+ ret = -ENOMEM;
+ goto err_put_super;
+ }
+
+ sb->s_flags |= SB_ACTIVE;
+out:
+ return dget(sb->s_root);
+
+err_put_super:
+ sb->s_fs_info = NULL;
+ c->vfs_sb = NULL;
+ deactivate_locked_super(sb);
+ bch2_fs_stop(c);
+ return ERR_PTR(bch2_err_class(ret));
+}
+
+static void bch2_kill_sb(struct super_block *sb)
+{
+ struct bch_fs *c = sb->s_fs_info;
+
+ if (c)
+ c->vfs_sb = NULL;
+ generic_shutdown_super(sb);
+ if (c)
+ bch2_fs_free(c);
+}
+
+static struct file_system_type bcache_fs_type = {
+ .owner = THIS_MODULE,
+ .name = "bcachefs",
+ .mount = bch2_mount,
+ .kill_sb = bch2_kill_sb,
+ .fs_flags = FS_REQUIRES_DEV,
+};
+
+MODULE_ALIAS_FS("bcachefs");
+
+void bch2_vfs_exit(void)
+{
+ unregister_filesystem(&bcache_fs_type);
+ kmem_cache_destroy(bch2_inode_cache);
+}
+
+int __init bch2_vfs_init(void)
+{
+ int ret = -ENOMEM;
+
+ bch2_inode_cache = KMEM_CACHE(bch_inode_info, SLAB_RECLAIM_ACCOUNT);
+ if (!bch2_inode_cache)
+ goto err;
+
+ ret = register_filesystem(&bcache_fs_type);
+ if (ret)
+ goto err;
+
+ return 0;
+err:
+ bch2_vfs_exit();
+ return ret;
+}
+
+#endif /* NO_BCACHEFS_FS */
--- /dev/null
- c->btree_cache.shrink.scan_objects(&c->btree_cache.shrink, &sc);
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * bcache sysfs interfaces
+ *
+ * Copyright 2012 Google, Inc.
+ */
+
+#ifndef NO_BCACHEFS_SYSFS
+
+#include "bcachefs.h"
+#include "alloc_background.h"
+#include "alloc_foreground.h"
+#include "sysfs.h"
+#include "btree_cache.h"
+#include "btree_io.h"
+#include "btree_iter.h"
+#include "btree_key_cache.h"
+#include "btree_update.h"
+#include "btree_update_interior.h"
+#include "btree_gc.h"
+#include "buckets.h"
+#include "clock.h"
+#include "disk_groups.h"
+#include "ec.h"
+#include "inode.h"
+#include "journal.h"
+#include "keylist.h"
+#include "move.h"
+#include "movinggc.h"
+#include "nocow_locking.h"
+#include "opts.h"
+#include "rebalance.h"
+#include "replicas.h"
+#include "super-io.h"
+#include "tests.h"
+
+#include <linux/blkdev.h>
+#include <linux/sort.h>
+#include <linux/sched/clock.h>
+
+#include "util.h"
+
+#define SYSFS_OPS(type) \
+const struct sysfs_ops type ## _sysfs_ops = { \
+ .show = type ## _show, \
+ .store = type ## _store \
+}
+
+#define SHOW(fn) \
+static ssize_t fn ## _to_text(struct printbuf *, \
+ struct kobject *, struct attribute *); \
+ \
+static ssize_t fn ## _show(struct kobject *kobj, struct attribute *attr,\
+ char *buf) \
+{ \
+ struct printbuf out = PRINTBUF; \
+ ssize_t ret = fn ## _to_text(&out, kobj, attr); \
+ \
+ if (out.pos && out.buf[out.pos - 1] != '\n') \
+ prt_newline(&out); \
+ \
+ if (!ret && out.allocation_failure) \
+ ret = -ENOMEM; \
+ \
+ if (!ret) { \
+ ret = min_t(size_t, out.pos, PAGE_SIZE - 1); \
+ memcpy(buf, out.buf, ret); \
+ } \
+ printbuf_exit(&out); \
+ return bch2_err_class(ret); \
+} \
+ \
+static ssize_t fn ## _to_text(struct printbuf *out, struct kobject *kobj,\
+ struct attribute *attr)
+
+#define STORE(fn) \
+static ssize_t fn ## _store_inner(struct kobject *, struct attribute *,\
+ const char *, size_t); \
+ \
+static ssize_t fn ## _store(struct kobject *kobj, struct attribute *attr,\
+ const char *buf, size_t size) \
+{ \
+ return bch2_err_class(fn##_store_inner(kobj, attr, buf, size)); \
+} \
+ \
+static ssize_t fn ## _store_inner(struct kobject *kobj, struct attribute *attr,\
+ const char *buf, size_t size)
+
+#define __sysfs_attribute(_name, _mode) \
+ static struct attribute sysfs_##_name = \
+ { .name = #_name, .mode = _mode }
+
+#define write_attribute(n) __sysfs_attribute(n, 0200)
+#define read_attribute(n) __sysfs_attribute(n, 0444)
+#define rw_attribute(n) __sysfs_attribute(n, 0644)
+
+#define sysfs_printf(file, fmt, ...) \
+do { \
+ if (attr == &sysfs_ ## file) \
+ prt_printf(out, fmt "\n", __VA_ARGS__); \
+} while (0)
+
+#define sysfs_print(file, var) \
+do { \
+ if (attr == &sysfs_ ## file) \
+ snprint(out, var); \
+} while (0)
+
+#define sysfs_hprint(file, val) \
+do { \
+ if (attr == &sysfs_ ## file) \
+ prt_human_readable_s64(out, val); \
+} while (0)
+
+#define sysfs_strtoul(file, var) \
+do { \
+ if (attr == &sysfs_ ## file) \
+ return strtoul_safe(buf, var) ?: (ssize_t) size; \
+} while (0)
+
+#define sysfs_strtoul_clamp(file, var, min, max) \
+do { \
+ if (attr == &sysfs_ ## file) \
+ return strtoul_safe_clamp(buf, var, min, max) \
+ ?: (ssize_t) size; \
+} while (0)
+
+#define strtoul_or_return(cp) \
+({ \
+ unsigned long _v; \
+ int _r = kstrtoul(cp, 10, &_v); \
+ if (_r) \
+ return _r; \
+ _v; \
+})
+
+write_attribute(trigger_gc);
+write_attribute(trigger_discards);
+write_attribute(trigger_invalidates);
+write_attribute(prune_cache);
+write_attribute(btree_wakeup);
+rw_attribute(btree_gc_periodic);
+rw_attribute(gc_gens_pos);
+
+read_attribute(uuid);
+read_attribute(minor);
+read_attribute(bucket_size);
+read_attribute(first_bucket);
+read_attribute(nbuckets);
+rw_attribute(durability);
+read_attribute(iodone);
+
+read_attribute(io_latency_read);
+read_attribute(io_latency_write);
+read_attribute(io_latency_stats_read);
+read_attribute(io_latency_stats_write);
+read_attribute(congested);
+
+read_attribute(btree_write_stats);
+
+read_attribute(btree_cache_size);
+read_attribute(compression_stats);
+read_attribute(journal_debug);
+read_attribute(btree_updates);
+read_attribute(btree_cache);
+read_attribute(btree_key_cache);
+read_attribute(stripes_heap);
+read_attribute(open_buckets);
+read_attribute(open_buckets_partial);
+read_attribute(write_points);
+read_attribute(nocow_lock_table);
+
+#ifdef BCH_WRITE_REF_DEBUG
+read_attribute(write_refs);
+
+static const char * const bch2_write_refs[] = {
+#define x(n) #n,
+ BCH_WRITE_REFS()
+#undef x
+ NULL
+};
+
+static void bch2_write_refs_to_text(struct printbuf *out, struct bch_fs *c)
+{
+ bch2_printbuf_tabstop_push(out, 24);
+
+ for (unsigned i = 0; i < ARRAY_SIZE(c->writes); i++) {
+ prt_str(out, bch2_write_refs[i]);
+ prt_tab(out);
+ prt_printf(out, "%li", atomic_long_read(&c->writes[i]));
+ prt_newline(out);
+ }
+}
+#endif
+
+read_attribute(internal_uuid);
+read_attribute(disk_groups);
+
+read_attribute(has_data);
+read_attribute(alloc_debug);
+
+#define x(t, n, ...) read_attribute(t);
+BCH_PERSISTENT_COUNTERS()
+#undef x
+
+rw_attribute(discard);
+rw_attribute(label);
+
+rw_attribute(copy_gc_enabled);
+read_attribute(copy_gc_wait);
+
+rw_attribute(rebalance_enabled);
+sysfs_pd_controller_attribute(rebalance);
+read_attribute(rebalance_work);
+rw_attribute(promote_whole_extents);
+
+read_attribute(new_stripes);
+
+read_attribute(io_timers_read);
+read_attribute(io_timers_write);
+
+read_attribute(moving_ctxts);
+
+#ifdef CONFIG_BCACHEFS_TESTS
+write_attribute(perf_test);
+#endif /* CONFIG_BCACHEFS_TESTS */
+
+#define x(_name) \
+ static struct attribute sysfs_time_stat_##_name = \
+ { .name = #_name, .mode = 0444 };
+ BCH_TIME_STATS()
+#undef x
+
+static struct attribute sysfs_state_rw = {
+ .name = "state",
+ .mode = 0444,
+};
+
+static size_t bch2_btree_cache_size(struct bch_fs *c)
+{
+ size_t ret = 0;
+ struct btree *b;
+
+ mutex_lock(&c->btree_cache.lock);
+ list_for_each_entry(b, &c->btree_cache.live, list)
+ ret += btree_bytes(c);
+
+ mutex_unlock(&c->btree_cache.lock);
+ return ret;
+}
+
+static int bch2_compression_stats_to_text(struct printbuf *out, struct bch_fs *c)
+{
+ struct btree_trans *trans;
+ struct btree_iter iter;
+ struct bkey_s_c k;
+ enum btree_id id;
+ u64 nr_uncompressed_extents = 0,
+ nr_compressed_extents = 0,
+ nr_incompressible_extents = 0,
+ uncompressed_sectors = 0,
+ incompressible_sectors = 0,
+ compressed_sectors_compressed = 0,
+ compressed_sectors_uncompressed = 0;
+ int ret = 0;
+
+ if (!test_bit(BCH_FS_STARTED, &c->flags))
+ return -EPERM;
+
+ trans = bch2_trans_get(c);
+
+ for (id = 0; id < BTREE_ID_NR; id++) {
+ if (!btree_type_has_ptrs(id))
+ continue;
+
+ for_each_btree_key(trans, iter, id, POS_MIN,
+ BTREE_ITER_ALL_SNAPSHOTS, k, ret) {
+ struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
+ const union bch_extent_entry *entry;
+ struct extent_ptr_decoded p;
+ bool compressed = false, uncompressed = false, incompressible = false;
+
+ bkey_for_each_ptr_decode(k.k, ptrs, p, entry) {
+ switch (p.crc.compression_type) {
+ case BCH_COMPRESSION_TYPE_none:
+ uncompressed = true;
+ uncompressed_sectors += k.k->size;
+ break;
+ case BCH_COMPRESSION_TYPE_incompressible:
+ incompressible = true;
+ incompressible_sectors += k.k->size;
+ break;
+ default:
+ compressed_sectors_compressed +=
+ p.crc.compressed_size;
+ compressed_sectors_uncompressed +=
+ p.crc.uncompressed_size;
+ compressed = true;
+ break;
+ }
+ }
+
+ if (incompressible)
+ nr_incompressible_extents++;
+ else if (uncompressed)
+ nr_uncompressed_extents++;
+ else if (compressed)
+ nr_compressed_extents++;
+ }
+ bch2_trans_iter_exit(trans, &iter);
+ }
+
+ bch2_trans_put(trans);
+
+ if (ret)
+ return ret;
+
+ prt_printf(out, "uncompressed:\n");
+ prt_printf(out, " nr extents: %llu\n", nr_uncompressed_extents);
+ prt_printf(out, " size: ");
+ prt_human_readable_u64(out, uncompressed_sectors << 9);
+ prt_printf(out, "\n");
+
+ prt_printf(out, "compressed:\n");
+ prt_printf(out, " nr extents: %llu\n", nr_compressed_extents);
+ prt_printf(out, " compressed size: ");
+ prt_human_readable_u64(out, compressed_sectors_compressed << 9);
+ prt_printf(out, "\n");
+ prt_printf(out, " uncompressed size: ");
+ prt_human_readable_u64(out, compressed_sectors_uncompressed << 9);
+ prt_printf(out, "\n");
+
+ prt_printf(out, "incompressible:\n");
+ prt_printf(out, " nr extents: %llu\n", nr_incompressible_extents);
+ prt_printf(out, " size: ");
+ prt_human_readable_u64(out, incompressible_sectors << 9);
+ prt_printf(out, "\n");
+ return 0;
+}
+
+static void bch2_gc_gens_pos_to_text(struct printbuf *out, struct bch_fs *c)
+{
+ prt_printf(out, "%s: ", bch2_btree_ids[c->gc_gens_btree]);
+ bch2_bpos_to_text(out, c->gc_gens_pos);
+ prt_printf(out, "\n");
+}
+
+static void bch2_btree_wakeup_all(struct bch_fs *c)
+{
+ struct btree_trans *trans;
+
+ seqmutex_lock(&c->btree_trans_lock);
+ list_for_each_entry(trans, &c->btree_trans_list, list) {
+ struct btree_bkey_cached_common *b = READ_ONCE(trans->locking);
+
+ if (b)
+ six_lock_wakeup_all(&b->lock);
+
+ }
+ seqmutex_unlock(&c->btree_trans_lock);
+}
+
+SHOW(bch2_fs)
+{
+ struct bch_fs *c = container_of(kobj, struct bch_fs, kobj);
+
+ sysfs_print(minor, c->minor);
+ sysfs_printf(internal_uuid, "%pU", c->sb.uuid.b);
+
+ sysfs_hprint(btree_cache_size, bch2_btree_cache_size(c));
+
+ if (attr == &sysfs_btree_write_stats)
+ bch2_btree_write_stats_to_text(out, c);
+
+ sysfs_printf(btree_gc_periodic, "%u", (int) c->btree_gc_periodic);
+
+ if (attr == &sysfs_gc_gens_pos)
+ bch2_gc_gens_pos_to_text(out, c);
+
+ sysfs_printf(copy_gc_enabled, "%i", c->copy_gc_enabled);
+
+ sysfs_printf(rebalance_enabled, "%i", c->rebalance.enabled);
+ sysfs_pd_controller_show(rebalance, &c->rebalance.pd); /* XXX */
+
+ if (attr == &sysfs_copy_gc_wait)
+ bch2_copygc_wait_to_text(out, c);
+
+ if (attr == &sysfs_rebalance_work)
+ bch2_rebalance_work_to_text(out, c);
+
+ sysfs_print(promote_whole_extents, c->promote_whole_extents);
+
+ /* Debugging: */
+
+ if (attr == &sysfs_journal_debug)
+ bch2_journal_debug_to_text(out, &c->journal);
+
+ if (attr == &sysfs_btree_updates)
+ bch2_btree_updates_to_text(out, c);
+
+ if (attr == &sysfs_btree_cache)
+ bch2_btree_cache_to_text(out, c);
+
+ if (attr == &sysfs_btree_key_cache)
+ bch2_btree_key_cache_to_text(out, &c->btree_key_cache);
+
+ if (attr == &sysfs_stripes_heap)
+ bch2_stripes_heap_to_text(out, c);
+
+ if (attr == &sysfs_open_buckets)
+ bch2_open_buckets_to_text(out, c);
+
+ if (attr == &sysfs_open_buckets_partial)
+ bch2_open_buckets_partial_to_text(out, c);
+
+ if (attr == &sysfs_write_points)
+ bch2_write_points_to_text(out, c);
+
+ if (attr == &sysfs_compression_stats)
+ bch2_compression_stats_to_text(out, c);
+
+ if (attr == &sysfs_new_stripes)
+ bch2_new_stripes_to_text(out, c);
+
+ if (attr == &sysfs_io_timers_read)
+ bch2_io_timers_to_text(out, &c->io_clock[READ]);
+
+ if (attr == &sysfs_io_timers_write)
+ bch2_io_timers_to_text(out, &c->io_clock[WRITE]);
+
+ if (attr == &sysfs_moving_ctxts)
+ bch2_fs_moving_ctxts_to_text(out, c);
+
+#ifdef BCH_WRITE_REF_DEBUG
+ if (attr == &sysfs_write_refs)
+ bch2_write_refs_to_text(out, c);
+#endif
+
+ if (attr == &sysfs_nocow_lock_table)
+ bch2_nocow_locks_to_text(out, &c->nocow_locks);
+
+ if (attr == &sysfs_disk_groups)
+ bch2_disk_groups_to_text(out, c);
+
+ return 0;
+}
+
+STORE(bch2_fs)
+{
+ struct bch_fs *c = container_of(kobj, struct bch_fs, kobj);
+
+ if (attr == &sysfs_btree_gc_periodic) {
+ ssize_t ret = strtoul_safe(buf, c->btree_gc_periodic)
+ ?: (ssize_t) size;
+
+ wake_up_process(c->gc_thread);
+ return ret;
+ }
+
+ if (attr == &sysfs_copy_gc_enabled) {
+ ssize_t ret = strtoul_safe(buf, c->copy_gc_enabled)
+ ?: (ssize_t) size;
+
+ if (c->copygc_thread)
+ wake_up_process(c->copygc_thread);
+ return ret;
+ }
+
+ if (attr == &sysfs_rebalance_enabled) {
+ ssize_t ret = strtoul_safe(buf, c->rebalance.enabled)
+ ?: (ssize_t) size;
+
+ rebalance_wakeup(c);
+ return ret;
+ }
+
+ sysfs_pd_controller_store(rebalance, &c->rebalance.pd);
+
+ sysfs_strtoul(promote_whole_extents, c->promote_whole_extents);
+
+ /* Debugging: */
+
+ if (!test_bit(BCH_FS_STARTED, &c->flags))
+ return -EPERM;
+
+ /* Debugging: */
+
+ if (!test_bit(BCH_FS_RW, &c->flags))
+ return -EROFS;
+
+ if (attr == &sysfs_prune_cache) {
+ struct shrink_control sc;
+
+ sc.gfp_mask = GFP_KERNEL;
+ sc.nr_to_scan = strtoul_or_return(buf);
++ c->btree_cache.shrink->scan_objects(c->btree_cache.shrink, &sc);
+ }
+
+ if (attr == &sysfs_btree_wakeup)
+ bch2_btree_wakeup_all(c);
+
+ if (attr == &sysfs_trigger_gc) {
+ /*
+ * Full gc is currently incompatible with btree key cache:
+ */
+#if 0
+ down_read(&c->state_lock);
+ bch2_gc(c, false, false);
+ up_read(&c->state_lock);
+#else
+ bch2_gc_gens(c);
+#endif
+ }
+
+ if (attr == &sysfs_trigger_discards)
+ bch2_do_discards(c);
+
+ if (attr == &sysfs_trigger_invalidates)
+ bch2_do_invalidates(c);
+
+#ifdef CONFIG_BCACHEFS_TESTS
+ if (attr == &sysfs_perf_test) {
+ char *tmp = kstrdup(buf, GFP_KERNEL), *p = tmp;
+ char *test = strsep(&p, " \t\n");
+ char *nr_str = strsep(&p, " \t\n");
+ char *threads_str = strsep(&p, " \t\n");
+ unsigned threads;
+ u64 nr;
+ int ret = -EINVAL;
+
+ if (threads_str &&
+ !(ret = kstrtouint(threads_str, 10, &threads)) &&
+ !(ret = bch2_strtoull_h(nr_str, &nr)))
+ ret = bch2_btree_perf_test(c, test, nr, threads);
+ kfree(tmp);
+
+ if (ret)
+ size = ret;
+ }
+#endif
+ return size;
+}
+SYSFS_OPS(bch2_fs);
+
+struct attribute *bch2_fs_files[] = {
+ &sysfs_minor,
+ &sysfs_btree_cache_size,
+ &sysfs_btree_write_stats,
+
+ &sysfs_promote_whole_extents,
+
+ &sysfs_compression_stats,
+
+#ifdef CONFIG_BCACHEFS_TESTS
+ &sysfs_perf_test,
+#endif
+ NULL
+};
+
+/* counters dir */
+
+SHOW(bch2_fs_counters)
+{
+ struct bch_fs *c = container_of(kobj, struct bch_fs, counters_kobj);
+ u64 counter = 0;
+ u64 counter_since_mount = 0;
+
+ printbuf_tabstop_push(out, 32);
+
+ #define x(t, ...) \
+ if (attr == &sysfs_##t) { \
+ counter = percpu_u64_get(&c->counters[BCH_COUNTER_##t]);\
+ counter_since_mount = counter - c->counters_on_mount[BCH_COUNTER_##t];\
+ prt_printf(out, "since mount:"); \
+ prt_tab(out); \
+ prt_human_readable_u64(out, counter_since_mount); \
+ prt_newline(out); \
+ \
+ prt_printf(out, "since filesystem creation:"); \
+ prt_tab(out); \
+ prt_human_readable_u64(out, counter); \
+ prt_newline(out); \
+ }
+ BCH_PERSISTENT_COUNTERS()
+ #undef x
+ return 0;
+}
+
+STORE(bch2_fs_counters) {
+ return 0;
+}
+
+SYSFS_OPS(bch2_fs_counters);
+
+struct attribute *bch2_fs_counters_files[] = {
+#define x(t, ...) \
+ &sysfs_##t,
+ BCH_PERSISTENT_COUNTERS()
+#undef x
+ NULL
+};
+/* internal dir - just a wrapper */
+
+SHOW(bch2_fs_internal)
+{
+ struct bch_fs *c = container_of(kobj, struct bch_fs, internal);
+
+ return bch2_fs_to_text(out, &c->kobj, attr);
+}
+
+STORE(bch2_fs_internal)
+{
+ struct bch_fs *c = container_of(kobj, struct bch_fs, internal);
+
+ return bch2_fs_store(&c->kobj, attr, buf, size);
+}
+SYSFS_OPS(bch2_fs_internal);
+
+struct attribute *bch2_fs_internal_files[] = {
+ &sysfs_journal_debug,
+ &sysfs_btree_updates,
+ &sysfs_btree_cache,
+ &sysfs_btree_key_cache,
+ &sysfs_new_stripes,
+ &sysfs_stripes_heap,
+ &sysfs_open_buckets,
+ &sysfs_open_buckets_partial,
+ &sysfs_write_points,
+#ifdef BCH_WRITE_REF_DEBUG
+ &sysfs_write_refs,
+#endif
+ &sysfs_nocow_lock_table,
+ &sysfs_io_timers_read,
+ &sysfs_io_timers_write,
+
+ &sysfs_trigger_gc,
+ &sysfs_trigger_discards,
+ &sysfs_trigger_invalidates,
+ &sysfs_prune_cache,
+ &sysfs_btree_wakeup,
+
+ &sysfs_gc_gens_pos,
+
+ &sysfs_copy_gc_enabled,
+ &sysfs_copy_gc_wait,
+
+ &sysfs_rebalance_enabled,
+ &sysfs_rebalance_work,
+ sysfs_pd_controller_files(rebalance),
+
+ &sysfs_moving_ctxts,
+
+ &sysfs_internal_uuid,
+
+ &sysfs_disk_groups,
+ NULL
+};
+
+/* options */
+
+SHOW(bch2_fs_opts_dir)
+{
+ struct bch_fs *c = container_of(kobj, struct bch_fs, opts_dir);
+ const struct bch_option *opt = container_of(attr, struct bch_option, attr);
+ int id = opt - bch2_opt_table;
+ u64 v = bch2_opt_get_by_id(&c->opts, id);
+
+ bch2_opt_to_text(out, c, c->disk_sb.sb, opt, v, OPT_SHOW_FULL_LIST);
+ prt_char(out, '\n');
+
+ return 0;
+}
+
+STORE(bch2_fs_opts_dir)
+{
+ struct bch_fs *c = container_of(kobj, struct bch_fs, opts_dir);
+ const struct bch_option *opt = container_of(attr, struct bch_option, attr);
+ int ret, id = opt - bch2_opt_table;
+ char *tmp;
+ u64 v;
+
+ /*
+ * We don't need to take c->writes for correctness, but it eliminates an
+ * unsightly error message in the dmesg log when we're RO:
+ */
+ if (unlikely(!bch2_write_ref_tryget(c, BCH_WRITE_REF_sysfs)))
+ return -EROFS;
+
+ tmp = kstrdup(buf, GFP_KERNEL);
+ if (!tmp) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ ret = bch2_opt_parse(c, opt, strim(tmp), &v, NULL);
+ kfree(tmp);
+
+ if (ret < 0)
+ goto err;
+
+ ret = bch2_opt_check_may_set(c, id, v);
+ if (ret < 0)
+ goto err;
+
+ bch2_opt_set_sb(c, opt, v);
+ bch2_opt_set_by_id(&c->opts, id, v);
+
+ if ((id == Opt_background_target ||
+ id == Opt_background_compression) && v) {
+ bch2_rebalance_add_work(c, S64_MAX);
+ rebalance_wakeup(c);
+ }
+
+ ret = size;
+err:
+ bch2_write_ref_put(c, BCH_WRITE_REF_sysfs);
+ return ret;
+}
+SYSFS_OPS(bch2_fs_opts_dir);
+
+struct attribute *bch2_fs_opts_dir_files[] = { NULL };
+
+int bch2_opts_create_sysfs_files(struct kobject *kobj)
+{
+ const struct bch_option *i;
+ int ret;
+
+ for (i = bch2_opt_table;
+ i < bch2_opt_table + bch2_opts_nr;
+ i++) {
+ if (!(i->flags & OPT_FS))
+ continue;
+
+ ret = sysfs_create_file(kobj, &i->attr);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+/* time stats */
+
+SHOW(bch2_fs_time_stats)
+{
+ struct bch_fs *c = container_of(kobj, struct bch_fs, time_stats);
+
+#define x(name) \
+ if (attr == &sysfs_time_stat_##name) \
+ bch2_time_stats_to_text(out, &c->times[BCH_TIME_##name]);
+ BCH_TIME_STATS()
+#undef x
+
+ return 0;
+}
+
+STORE(bch2_fs_time_stats)
+{
+ return size;
+}
+SYSFS_OPS(bch2_fs_time_stats);
+
+struct attribute *bch2_fs_time_stats_files[] = {
+#define x(name) \
+ &sysfs_time_stat_##name,
+ BCH_TIME_STATS()
+#undef x
+ NULL
+};
+
+static void dev_alloc_debug_to_text(struct printbuf *out, struct bch_dev *ca)
+{
+ struct bch_fs *c = ca->fs;
+ struct bch_dev_usage stats = bch2_dev_usage_read(ca);
+ unsigned i, nr[BCH_DATA_NR];
+
+ memset(nr, 0, sizeof(nr));
+
+ for (i = 0; i < ARRAY_SIZE(c->open_buckets); i++)
+ nr[c->open_buckets[i].data_type]++;
+
+ printbuf_tabstop_push(out, 8);
+ printbuf_tabstop_push(out, 16);
+ printbuf_tabstop_push(out, 16);
+ printbuf_tabstop_push(out, 16);
+ printbuf_tabstop_push(out, 16);
+
+ prt_tab(out);
+ prt_str(out, "buckets");
+ prt_tab_rjust(out);
+ prt_str(out, "sectors");
+ prt_tab_rjust(out);
+ prt_str(out, "fragmented");
+ prt_tab_rjust(out);
+ prt_newline(out);
+
+ for (i = 0; i < BCH_DATA_NR; i++) {
+ prt_str(out, bch2_data_types[i]);
+ prt_tab(out);
+ prt_u64(out, stats.d[i].buckets);
+ prt_tab_rjust(out);
+ prt_u64(out, stats.d[i].sectors);
+ prt_tab_rjust(out);
+ prt_u64(out, stats.d[i].fragmented);
+ prt_tab_rjust(out);
+ prt_newline(out);
+ }
+
+ prt_str(out, "ec");
+ prt_tab(out);
+ prt_u64(out, stats.buckets_ec);
+ prt_tab_rjust(out);
+ prt_newline(out);
+
+ prt_newline(out);
+
+ prt_printf(out, "reserves:");
+ prt_newline(out);
+ for (i = 0; i < BCH_WATERMARK_NR; i++) {
+ prt_str(out, bch2_watermarks[i]);
+ prt_tab(out);
+ prt_u64(out, bch2_dev_buckets_reserved(ca, i));
+ prt_tab_rjust(out);
+ prt_newline(out);
+ }
+
+ prt_newline(out);
+
+ printbuf_tabstops_reset(out);
+ printbuf_tabstop_push(out, 24);
+
+ prt_str(out, "freelist_wait");
+ prt_tab(out);
+ prt_str(out, c->freelist_wait.list.first ? "waiting" : "empty");
+ prt_newline(out);
+
+ prt_str(out, "open buckets allocated");
+ prt_tab(out);
+ prt_u64(out, OPEN_BUCKETS_COUNT - c->open_buckets_nr_free);
+ prt_newline(out);
+
+ prt_str(out, "open buckets this dev");
+ prt_tab(out);
+ prt_u64(out, ca->nr_open_buckets);
+ prt_newline(out);
+
+ prt_str(out, "open buckets total");
+ prt_tab(out);
+ prt_u64(out, OPEN_BUCKETS_COUNT);
+ prt_newline(out);
+
+ prt_str(out, "open_buckets_wait");
+ prt_tab(out);
+ prt_str(out, c->open_buckets_wait.list.first ? "waiting" : "empty");
+ prt_newline(out);
+
+ prt_str(out, "open_buckets_btree");
+ prt_tab(out);
+ prt_u64(out, nr[BCH_DATA_btree]);
+ prt_newline(out);
+
+ prt_str(out, "open_buckets_user");
+ prt_tab(out);
+ prt_u64(out, nr[BCH_DATA_user]);
+ prt_newline(out);
+
+ prt_str(out, "buckets_to_invalidate");
+ prt_tab(out);
+ prt_u64(out, should_invalidate_buckets(ca, stats));
+ prt_newline(out);
+
+ prt_str(out, "btree reserve cache");
+ prt_tab(out);
+ prt_u64(out, c->btree_reserve_cache_nr);
+ prt_newline(out);
+}
+
+static const char * const bch2_rw[] = {
+ "read",
+ "write",
+ NULL
+};
+
+static void dev_iodone_to_text(struct printbuf *out, struct bch_dev *ca)
+{
+ int rw, i;
+
+ for (rw = 0; rw < 2; rw++) {
+ prt_printf(out, "%s:\n", bch2_rw[rw]);
+
+ for (i = 1; i < BCH_DATA_NR; i++)
+ prt_printf(out, "%-12s:%12llu\n",
+ bch2_data_types[i],
+ percpu_u64_get(&ca->io_done->sectors[rw][i]) << 9);
+ }
+}
+
+SHOW(bch2_dev)
+{
+ struct bch_dev *ca = container_of(kobj, struct bch_dev, kobj);
+ struct bch_fs *c = ca->fs;
+
+ sysfs_printf(uuid, "%pU\n", ca->uuid.b);
+
+ sysfs_print(bucket_size, bucket_bytes(ca));
+ sysfs_print(first_bucket, ca->mi.first_bucket);
+ sysfs_print(nbuckets, ca->mi.nbuckets);
+ sysfs_print(durability, ca->mi.durability);
+ sysfs_print(discard, ca->mi.discard);
+
+ if (attr == &sysfs_label) {
+ if (ca->mi.group) {
+ mutex_lock(&c->sb_lock);
+ bch2_disk_path_to_text(out, c->disk_sb.sb,
+ ca->mi.group - 1);
+ mutex_unlock(&c->sb_lock);
+ }
+
+ prt_char(out, '\n');
+ }
+
+ if (attr == &sysfs_has_data) {
+ prt_bitflags(out, bch2_data_types, bch2_dev_has_data(c, ca));
+ prt_char(out, '\n');
+ }
+
+ if (attr == &sysfs_state_rw) {
+ prt_string_option(out, bch2_member_states, ca->mi.state);
+ prt_char(out, '\n');
+ }
+
+ if (attr == &sysfs_iodone)
+ dev_iodone_to_text(out, ca);
+
+ sysfs_print(io_latency_read, atomic64_read(&ca->cur_latency[READ]));
+ sysfs_print(io_latency_write, atomic64_read(&ca->cur_latency[WRITE]));
+
+ if (attr == &sysfs_io_latency_stats_read)
+ bch2_time_stats_to_text(out, &ca->io_latency[READ]);
+
+ if (attr == &sysfs_io_latency_stats_write)
+ bch2_time_stats_to_text(out, &ca->io_latency[WRITE]);
+
+ sysfs_printf(congested, "%u%%",
+ clamp(atomic_read(&ca->congested), 0, CONGESTED_MAX)
+ * 100 / CONGESTED_MAX);
+
+ if (attr == &sysfs_alloc_debug)
+ dev_alloc_debug_to_text(out, ca);
+
+ return 0;
+}
+
+STORE(bch2_dev)
+{
+ struct bch_dev *ca = container_of(kobj, struct bch_dev, kobj);
+ struct bch_fs *c = ca->fs;
+ struct bch_member *mi;
+
+ if (attr == &sysfs_discard) {
+ bool v = strtoul_or_return(buf);
+
+ mutex_lock(&c->sb_lock);
+ mi = bch2_members_v2_get_mut(c->disk_sb.sb, ca->dev_idx);
+
+ if (v != BCH_MEMBER_DISCARD(mi)) {
+ SET_BCH_MEMBER_DISCARD(mi, v);
+ bch2_write_super(c);
+ }
+ mutex_unlock(&c->sb_lock);
+ }
+
+ if (attr == &sysfs_durability) {
+ u64 v = strtoul_or_return(buf);
+
+ mutex_lock(&c->sb_lock);
+ mi = bch2_members_v2_get_mut(c->disk_sb.sb, ca->dev_idx);
+
+ if (v + 1 != BCH_MEMBER_DURABILITY(mi)) {
+ SET_BCH_MEMBER_DURABILITY(mi, v + 1);
+ bch2_write_super(c);
+ }
+ mutex_unlock(&c->sb_lock);
+ }
+
+ if (attr == &sysfs_label) {
+ char *tmp;
+ int ret;
+
+ tmp = kstrdup(buf, GFP_KERNEL);
+ if (!tmp)
+ return -ENOMEM;
+
+ ret = bch2_dev_group_set(c, ca, strim(tmp));
+ kfree(tmp);
+ if (ret)
+ return ret;
+ }
+
+ return size;
+}
+SYSFS_OPS(bch2_dev);
+
+struct attribute *bch2_dev_files[] = {
+ &sysfs_uuid,
+ &sysfs_bucket_size,
+ &sysfs_first_bucket,
+ &sysfs_nbuckets,
+ &sysfs_durability,
+
+ /* settings: */
+ &sysfs_discard,
+ &sysfs_state_rw,
+ &sysfs_label,
+
+ &sysfs_has_data,
+ &sysfs_iodone,
+
+ &sysfs_io_latency_read,
+ &sysfs_io_latency_write,
+ &sysfs_io_latency_stats_read,
+ &sysfs_io_latency_stats_write,
+ &sysfs_congested,
+
+ /* debug: */
+ &sysfs_alloc_debug,
+ NULL
+};
+
+#endif /* _BCACHEFS_SYSFS_H_ */
#include <linux/ratelimit.h>
#include <linux/crc32c.h>
#include <linux/btrfs.h>
+#include <linux/security.h>
#include "messages.h"
#include "delayed-inode.h"
#include "ctree.h"
Opt_inode_cache, Opt_noinode_cache,
/* Debugging options */
- Opt_check_integrity,
- Opt_check_integrity_including_extent_data,
- Opt_check_integrity_print_mask,
Opt_enospc_debug, Opt_noenospc_debug,
#ifdef CONFIG_BTRFS_DEBUG
Opt_fragment_data, Opt_fragment_metadata, Opt_fragment_all,
{Opt_recovery, "recovery"},
/* Debugging options */
- {Opt_check_integrity, "check_int"},
- {Opt_check_integrity_including_extent_data, "check_int_data"},
- {Opt_check_integrity_print_mask, "check_int_print_mask=%u"},
{Opt_enospc_debug, "enospc_debug"},
{Opt_noenospc_debug, "noenospc_debug"},
#ifdef CONFIG_BTRFS_DEBUG
case Opt_skip_balance:
btrfs_set_opt(info->mount_opt, SKIP_BALANCE);
break;
-#ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
- case Opt_check_integrity_including_extent_data:
- btrfs_warn(info,
- "integrity checker is deprecated and will be removed in 6.7");
- btrfs_info(info,
- "enabling check integrity including extent data");
- btrfs_set_opt(info->mount_opt, CHECK_INTEGRITY_DATA);
- btrfs_set_opt(info->mount_opt, CHECK_INTEGRITY);
- break;
- case Opt_check_integrity:
- btrfs_warn(info,
- "integrity checker is deprecated and will be removed in 6.7");
- btrfs_info(info, "enabling check integrity");
- btrfs_set_opt(info->mount_opt, CHECK_INTEGRITY);
- break;
- case Opt_check_integrity_print_mask:
- ret = match_int(&args[0], &intarg);
- if (ret) {
- btrfs_err(info,
- "unrecognized check_integrity_print_mask value %s",
- args[0].from);
- goto out;
- }
- info->check_integrity_print_mask = intarg;
- btrfs_warn(info,
- "integrity checker is deprecated and will be removed in 6.7");
- btrfs_info(info, "check_integrity_print_mask 0x%x",
- info->check_integrity_print_mask);
- break;
-#else
- case Opt_check_integrity_including_extent_data:
- case Opt_check_integrity:
- case Opt_check_integrity_print_mask:
- btrfs_err(info,
- "support for check_integrity* not compiled in!");
- ret = -EINVAL;
- goto out;
-#endif
case Opt_fatal_errors:
if (strcmp(args[0].from, "panic") == 0) {
btrfs_set_opt(info->mount_opt,
error = -ENOMEM;
goto out;
}
- device = btrfs_scan_one_device(device_name, flags);
+ device = btrfs_scan_one_device(device_name, flags, false);
kfree(device_name);
if (IS_ERR(device)) {
error = PTR_ERR(device);
seq_puts(seq, ",autodefrag");
if (btrfs_test_opt(info, SKIP_BALANCE))
seq_puts(seq, ",skip_balance");
-#ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
- if (btrfs_test_opt(info, CHECK_INTEGRITY_DATA))
- seq_puts(seq, ",check_int_data");
- else if (btrfs_test_opt(info, CHECK_INTEGRITY))
- seq_puts(seq, ",check_int");
- if (info->check_integrity_print_mask)
- seq_printf(seq, ",check_int_print_mask=%d",
- info->check_integrity_print_mask);
-#endif
if (info->metadata_ratio)
seq_printf(seq, ",metadata_ratio=%u", info->metadata_ratio);
if (btrfs_test_opt(info, PANIC_ON_FATAL_ERROR))
goto error_fs_info;
}
- device = btrfs_scan_one_device(device_name, mode);
+ /*
+ * With 'true' passed to btrfs_scan_one_device() (mount time) we expect
+ * either a valid device or an error.
+ */
+ device = btrfs_scan_one_device(device_name, mode, true);
+ ASSERT(device != NULL);
if (IS_ERR(device)) {
mutex_unlock(&uuid_mutex);
error = PTR_ERR(device);
error = -EBUSY;
} else {
snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
- shrinker_debugfs_rename(&s->s_shrink, "sb-%s:%s", fs_type->name,
+ shrinker_debugfs_rename(s->s_shrink, "sb-%s:%s", fs_type->name,
s->s_id);
btrfs_sb(s)->bdev_holder = fs_type;
error = btrfs_fill_super(s, fs_devices, data);
switch (cmd) {
case BTRFS_IOC_SCAN_DEV:
mutex_lock(&uuid_mutex);
- device = btrfs_scan_one_device(vol->name, BLK_OPEN_READ);
+ /*
+ * Scanning outside of mount can return NULL which would turn
+ * into 0 error code.
+ */
+ device = btrfs_scan_one_device(vol->name, BLK_OPEN_READ, false);
ret = PTR_ERR_OR_ZERO(device);
mutex_unlock(&uuid_mutex);
break;
break;
case BTRFS_IOC_DEVICES_READY:
mutex_lock(&uuid_mutex);
- device = btrfs_scan_one_device(vol->name, BLK_OPEN_READ);
- if (IS_ERR(device)) {
+ /*
+ * Scanning outside of mount can return NULL which would turn
+ * into 0 error code.
+ */
+ device = btrfs_scan_one_device(vol->name, BLK_OPEN_READ, false);
+ if (IS_ERR_OR_NULL(device)) {
mutex_unlock(&uuid_mutex);
ret = PTR_ERR(device);
break;
{
struct btrfs_fs_info *fs_info = dev->fs_info;
struct btrfs_super_block *sb;
+ u64 last_trans;
u16 csum_type;
int ret = 0;
if (ret < 0)
goto out;
- if (btrfs_super_generation(sb) != fs_info->last_trans_committed) {
+ last_trans = btrfs_get_last_trans_committed(fs_info);
+ if (btrfs_super_generation(sb) != last_trans) {
btrfs_err(fs_info, "transid mismatch, has %llu expect %llu",
- btrfs_super_generation(sb),
- fs_info->last_trans_committed);
+ btrfs_super_generation(sb), last_trans);
ret = -EUCLEAN;
goto out;
}
#ifdef CONFIG_BTRFS_ASSERT
", assert=on"
#endif
-#ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
- ", integrity-checker=on"
-#endif
#ifdef CONFIG_BTRFS_FS_REF_VERIFY
", ref-verify=on"
#endif
struct erofs_sb_info *const sbi = EROFS_SB(sb);
struct erofs_workgroup *pre;
- /*
- * Bump up before making this visible to others for the XArray in order
- * to avoid potential UAF without serialized by xa_lock.
- */
- lockref_get(&grp->lockref);
-
+ DBG_BUGON(grp->lockref.count < 1);
repeat:
xa_lock(&sbi->managed_pslots);
pre = __xa_cmpxchg(&sbi->managed_pslots, grp->index,
cond_resched();
goto repeat;
}
- lockref_put_return(&grp->lockref);
grp = pre;
}
xa_unlock(&sbi->managed_pslots);
return freed;
}
- static struct shrinker erofs_shrinker_info = {
- .scan_objects = erofs_shrink_scan,
- .count_objects = erofs_shrink_count,
- .seeks = DEFAULT_SEEKS,
- };
+ static struct shrinker *erofs_shrinker_info;
int __init erofs_init_shrinker(void)
{
- return register_shrinker(&erofs_shrinker_info, "erofs-shrinker");
+ erofs_shrinker_info = shrinker_alloc(0, "erofs-shrinker");
+ if (!erofs_shrinker_info)
+ return -ENOMEM;
+
+ erofs_shrinker_info->count_objects = erofs_shrink_count;
+ erofs_shrinker_info->scan_objects = erofs_shrink_scan;
+
+ shrinker_register(erofs_shrinker_info);
+
+ return 0;
}
void erofs_exit_shrinker(void)
{
- unregister_shrinker(&erofs_shrinker_info);
+ shrinker_free(erofs_shrinker_info);
}
#endif /* !CONFIG_EROFS_FS_ZIP */
(raw_inode)->xtime = cpu_to_le32(clamp_t(int32_t, (ts).tv_sec, S32_MIN, S32_MAX)); \
} while (0)
-#define EXT4_INODE_SET_XTIME(xtime, inode, raw_inode) \
- EXT4_INODE_SET_XTIME_VAL(xtime, inode, raw_inode, (inode)->xtime)
+#define EXT4_INODE_SET_ATIME(inode, raw_inode) \
+ EXT4_INODE_SET_XTIME_VAL(i_atime, inode, raw_inode, inode_get_atime(inode))
-#define EXT4_INODE_SET_CTIME(inode, raw_inode) \
+#define EXT4_INODE_SET_MTIME(inode, raw_inode) \
+ EXT4_INODE_SET_XTIME_VAL(i_mtime, inode, raw_inode, inode_get_mtime(inode))
+
+#define EXT4_INODE_SET_CTIME(inode, raw_inode) \
EXT4_INODE_SET_XTIME_VAL(i_ctime, inode, raw_inode, inode_get_ctime(inode))
#define EXT4_EINODE_SET_XTIME(xtime, einode, raw_inode) \
.tv_sec = (signed)le32_to_cpu((raw_inode)->xtime) \
})
-#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode) \
+#define EXT4_INODE_GET_ATIME(inode, raw_inode) \
+do { \
+ inode_set_atime_to_ts(inode, \
+ EXT4_INODE_GET_XTIME_VAL(i_atime, inode, raw_inode)); \
+} while (0)
+
+#define EXT4_INODE_GET_MTIME(inode, raw_inode) \
do { \
- (inode)->xtime = EXT4_INODE_GET_XTIME_VAL(xtime, inode, raw_inode); \
+ inode_set_mtime_to_ts(inode, \
+ EXT4_INODE_GET_XTIME_VAL(i_mtime, inode, raw_inode)); \
} while (0)
#define EXT4_INODE_GET_CTIME(inode, raw_inode) \
loff_t s_bitmap_maxbytes; /* max bytes for bitmap files */
struct buffer_head * s_sbh; /* Buffer containing the super block */
struct ext4_super_block *s_es; /* Pointer to the super block in the buffer */
+ /* Array of bh's for the block group descriptors */
struct buffer_head * __rcu *s_group_desc;
unsigned int s_mount_opt;
unsigned int s_mount_opt2;
unsigned long s_commit_interval;
u32 s_max_batch_time;
u32 s_min_batch_time;
- struct block_device *s_journal_bdev;
+ struct bdev_handle *s_journal_bdev_handle;
#ifdef CONFIG_QUOTA
/* Names of quota files with journalled quota */
char __rcu *s_qf_names[EXT4_MAXQUOTAS];
unsigned int *s_mb_maxs;
unsigned int s_group_info_size;
unsigned int s_mb_free_pending;
- struct list_head s_freed_data_list; /* List of blocks to be freed
+ struct list_head s_freed_data_list[2]; /* List of blocks to be freed
after commit completed */
struct list_head s_discard_list;
struct work_struct s_discard_work;
__u32 s_csum_seed;
/* Reclaim extents from extent status tree */
- struct shrinker s_es_shrinker;
+ struct shrinker *s_es_shrinker;
struct list_head s_es_list; /* List of inodes with reclaimable extents */
long s_es_nr_inode;
struct ext4_es_stats s_es_stats;
/*
* Barrier between writepages ops and changing any inode's JOURNAL_DATA
- * or EXTENTS flag.
+ * or EXTENTS flag or between writepages ops and changing DELALLOC or
+ * DIOREAD_NOLOCK mount options on remount.
*/
struct percpu_rw_semaphore s_writepages_rwsem;
struct dax_device *s_daxdev;
extern int ext4_trim_fs(struct super_block *, struct fstrim_range *);
extern void ext4_process_freed_data(struct super_block *sb, tid_t commit_tid);
extern void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block,
- int len, int state);
+ int len, bool state);
static inline bool ext4_mb_cr_expensive(enum criteria cr)
{
return cr >= CR_GOAL_LEN_SLOW;
static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan);
static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
struct ext4_inode_info *locked_ei);
-static void __revise_pending(struct inode *inode, ext4_lblk_t lblk,
- ext4_lblk_t len);
+static int __revise_pending(struct inode *inode, ext4_lblk_t lblk,
+ ext4_lblk_t len,
+ struct pending_reservation **prealloc);
int __init ext4_init_es(void)
{
spin_unlock(&sbi->s_es_lock);
}
+static inline struct pending_reservation *__alloc_pending(bool nofail)
+{
+ if (!nofail)
+ return kmem_cache_alloc(ext4_pending_cachep, GFP_ATOMIC);
+
+ return kmem_cache_zalloc(ext4_pending_cachep, GFP_KERNEL | __GFP_NOFAIL);
+}
+
+static inline void __free_pending(struct pending_reservation *pr)
+{
+ kmem_cache_free(ext4_pending_cachep, pr);
+}
+
/*
* Returns true if we cannot fail to allocate memory for this extent_status
* entry and cannot reclaim it until its status changes.
{
struct extent_status newes;
ext4_lblk_t end = lblk + len - 1;
- int err1 = 0;
- int err2 = 0;
+ int err1 = 0, err2 = 0, err3 = 0;
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
struct extent_status *es1 = NULL;
struct extent_status *es2 = NULL;
+ struct pending_reservation *pr = NULL;
+ bool revise_pending = false;
if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
return;
ext4_es_insert_extent_check(inode, &newes);
+ revise_pending = sbi->s_cluster_ratio > 1 &&
+ test_opt(inode->i_sb, DELALLOC) &&
+ (status & (EXTENT_STATUS_WRITTEN |
+ EXTENT_STATUS_UNWRITTEN));
retry:
if (err1 && !es1)
es1 = __es_alloc_extent(true);
if ((err1 || err2) && !es2)
es2 = __es_alloc_extent(true);
+ if ((err1 || err2 || err3) && revise_pending && !pr)
+ pr = __alloc_pending(true);
write_lock(&EXT4_I(inode)->i_es_lock);
err1 = __es_remove_extent(inode, lblk, end, NULL, es1);
es2 = NULL;
}
- if (sbi->s_cluster_ratio > 1 && test_opt(inode->i_sb, DELALLOC) &&
- (status & EXTENT_STATUS_WRITTEN ||
- status & EXTENT_STATUS_UNWRITTEN))
- __revise_pending(inode, lblk, len);
+ if (revise_pending) {
+ err3 = __revise_pending(inode, lblk, len, &pr);
+ if (err3 != 0)
+ goto error;
+ if (pr) {
+ __free_pending(pr);
+ pr = NULL;
+ }
+ }
error:
write_unlock(&EXT4_I(inode)->i_es_lock);
- if (err1 || err2)
+ if (err1 || err2 || err3)
goto retry;
ext4_es_print_tree(inode);
rc->ndelonly--;
node = rb_next(&pr->rb_node);
rb_erase(&pr->rb_node, &tree->root);
- kmem_cache_free(ext4_pending_cachep, pr);
+ __free_pending(pr);
if (!node)
break;
pr = rb_entry(node, struct pending_reservation,
}
}
if (count_reserved)
- count_rsvd(inode, lblk, orig_es.es_len - len1 - len2,
- &orig_es, &rc);
+ count_rsvd(inode, orig_es.es_lblk + len1,
+ orig_es.es_len - len1 - len2, &orig_es, &rc);
goto out_get_reserved;
}
unsigned long nr;
struct ext4_sb_info *sbi;
- sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
+ sbi = shrink->private_data;
nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr);
return nr;
static unsigned long ext4_es_scan(struct shrinker *shrink,
struct shrink_control *sc)
{
- struct ext4_sb_info *sbi = container_of(shrink,
- struct ext4_sb_info, s_es_shrinker);
+ struct ext4_sb_info *sbi = shrink->private_data;
int nr_to_scan = sc->nr_to_scan;
int ret, nr_shrunk;
if (err)
goto err3;
- sbi->s_es_shrinker.scan_objects = ext4_es_scan;
- sbi->s_es_shrinker.count_objects = ext4_es_count;
- sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
- err = register_shrinker(&sbi->s_es_shrinker, "ext4-es:%s",
- sbi->s_sb->s_id);
- if (err)
+ sbi->s_es_shrinker = shrinker_alloc(0, "ext4-es:%s", sbi->s_sb->s_id);
+ if (!sbi->s_es_shrinker) {
+ err = -ENOMEM;
goto err4;
+ }
+
+ sbi->s_es_shrinker->scan_objects = ext4_es_scan;
+ sbi->s_es_shrinker->count_objects = ext4_es_count;
+ sbi->s_es_shrinker->private_data = sbi;
+
+ shrinker_register(sbi->s_es_shrinker);
return 0;
err4:
percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_misses);
percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
- unregister_shrinker(&sbi->s_es_shrinker);
+ shrinker_free(sbi->s_es_shrinker);
}
/*
*
* @inode - file containing the cluster
* @lblk - logical block in the cluster to be added
+ * @prealloc - preallocated pending entry
*
* Returns 0 on successful insertion and -ENOMEM on failure. If the
* pending reservation is already in the set, returns successfully.
*/
-static int __insert_pending(struct inode *inode, ext4_lblk_t lblk)
+static int __insert_pending(struct inode *inode, ext4_lblk_t lblk,
+ struct pending_reservation **prealloc)
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
struct ext4_pending_tree *tree = &EXT4_I(inode)->i_pending_tree;
}
}
- pr = kmem_cache_alloc(ext4_pending_cachep, GFP_ATOMIC);
- if (pr == NULL) {
- ret = -ENOMEM;
- goto out;
+ if (likely(*prealloc == NULL)) {
+ pr = __alloc_pending(false);
+ if (!pr) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ } else {
+ pr = *prealloc;
+ *prealloc = NULL;
}
pr->lclu = lclu;
if (pr != NULL) {
tree = &EXT4_I(inode)->i_pending_tree;
rb_erase(&pr->rb_node, &tree->root);
- kmem_cache_free(ext4_pending_cachep, pr);
+ __free_pending(pr);
}
}
bool allocated)
{
struct extent_status newes;
- int err1 = 0;
- int err2 = 0;
+ int err1 = 0, err2 = 0, err3 = 0;
struct extent_status *es1 = NULL;
struct extent_status *es2 = NULL;
+ struct pending_reservation *pr = NULL;
if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
return;
es1 = __es_alloc_extent(true);
if ((err1 || err2) && !es2)
es2 = __es_alloc_extent(true);
+ if ((err1 || err2 || err3) && allocated && !pr)
+ pr = __alloc_pending(true);
write_lock(&EXT4_I(inode)->i_es_lock);
err1 = __es_remove_extent(inode, lblk, lblk, NULL, es1);
es2 = NULL;
}
- if (allocated)
- __insert_pending(inode, lblk);
+ if (allocated) {
+ err3 = __insert_pending(inode, lblk, &pr);
+ if (err3 != 0)
+ goto error;
+ if (pr) {
+ __free_pending(pr);
+ pr = NULL;
+ }
+ }
error:
write_unlock(&EXT4_I(inode)->i_es_lock);
- if (err1 || err2)
+ if (err1 || err2 || err3)
goto retry;
ext4_es_print_tree(inode);
* @inode - file containing the range
* @lblk - logical block defining the start of range
* @len - length of range in blocks
+ * @prealloc - preallocated pending entry
*
* Used after a newly allocated extent is added to the extents status tree.
* Requires that the extents in the range have either written or unwritten
* status. Must be called while holding i_es_lock.
*/
-static void __revise_pending(struct inode *inode, ext4_lblk_t lblk,
- ext4_lblk_t len)
+static int __revise_pending(struct inode *inode, ext4_lblk_t lblk,
+ ext4_lblk_t len,
+ struct pending_reservation **prealloc)
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
ext4_lblk_t end = lblk + len - 1;
ext4_lblk_t first, last;
bool f_del = false, l_del = false;
+ int ret = 0;
if (len == 0)
- return;
+ return 0;
/*
* Two cases - block range within single cluster and block range
f_del = __es_scan_range(inode, &ext4_es_is_delonly,
first, lblk - 1);
if (f_del) {
- __insert_pending(inode, first);
+ ret = __insert_pending(inode, first, prealloc);
+ if (ret < 0)
+ goto out;
} else {
last = EXT4_LBLK_CMASK(sbi, end) +
sbi->s_cluster_ratio - 1;
l_del = __es_scan_range(inode,
&ext4_es_is_delonly,
end + 1, last);
- if (l_del)
- __insert_pending(inode, last);
- else
+ if (l_del) {
+ ret = __insert_pending(inode, last, prealloc);
+ if (ret < 0)
+ goto out;
+ } else
__remove_pending(inode, last);
}
} else {
if (first != lblk)
f_del = __es_scan_range(inode, &ext4_es_is_delonly,
first, lblk - 1);
- if (f_del)
- __insert_pending(inode, first);
- else
+ if (f_del) {
+ ret = __insert_pending(inode, first, prealloc);
+ if (ret < 0)
+ goto out;
+ } else
__remove_pending(inode, first);
last = EXT4_LBLK_CMASK(sbi, end) + sbi->s_cluster_ratio - 1;
if (last != end)
l_del = __es_scan_range(inode, &ext4_es_is_delonly,
end + 1, last);
- if (l_del)
- __insert_pending(inode, last);
- else
+ if (l_del) {
+ ret = __insert_pending(inode, last, prealloc);
+ if (ret < 0)
+ goto out;
+ } else
__remove_pending(inode, last);
}
+out:
+ return ret;
}
int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
+ int ret = 0;
+
ext4_debug("ext4_get_block_unwritten: inode %lu, create flag %d\n",
inode->i_ino, create);
- return _ext4_get_block(inode, iblock, bh_result,
+ ret = _ext4_get_block(inode, iblock, bh_result,
EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
+
+ /*
+ * If the buffer is marked unwritten, mark it as new to make sure it is
+ * zeroed out correctly in case of partial writes. Otherwise, there is
+ * a chance of stale data getting exposed.
+ */
+ if (ret == 0 && buffer_unwritten(bh_result))
+ set_buffer_new(bh_result);
+
+ return ret;
}
/* Maximum number of blocks we map for direct IO at once. */
BUG_ON(from > to);
head = folio_buffers(folio);
- if (!head) {
- create_empty_buffers(&folio->page, blocksize, 0);
- head = folio_buffers(folio);
- }
+ if (!head)
+ head = create_empty_buffers(folio, blocksize, 0);
bbits = ilog2(blocksize);
block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
* starting the handle.
*/
if (!folio_buffers(folio))
- create_empty_buffers(&folio->page, inode->i_sb->s_blocksize, 0);
+ create_empty_buffers(folio, inode->i_sb->s_blocksize, 0);
folio_unlock(folio);
iblock = index << (PAGE_SHIFT - inode->i_sb->s_blocksize_bits);
bh = folio_buffers(folio);
- if (!bh) {
- create_empty_buffers(&folio->page, blocksize, 0);
- bh = folio_buffers(folio);
- }
+ if (!bh)
+ bh = create_empty_buffers(folio, blocksize, 0);
/* Find the buffer that contains "offset" */
pos = blocksize;
if (IS_SYNC(inode))
ext4_handle_sync(handle);
- inode->i_mtime = inode_set_ctime_current(inode);
+ inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
ret2 = ext4_mark_inode_dirty(handle, inode);
if (unlikely(ret2))
ret = ret2;
if (inode->i_nlink)
ext4_orphan_del(handle, inode);
- inode->i_mtime = inode_set_ctime_current(inode);
+ inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
err2 = ext4_mark_inode_dirty(handle, inode);
if (unlikely(err2 && !err))
err = err2;
raw_inode->i_links_count = cpu_to_le16(inode->i_nlink);
EXT4_INODE_SET_CTIME(inode, raw_inode);
- EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
- EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
+ EXT4_INODE_SET_MTIME(inode, raw_inode);
+ EXT4_INODE_SET_ATIME(inode, raw_inode);
EXT4_EINODE_SET_XTIME(i_crtime, ei, raw_inode);
raw_inode->i_dtime = cpu_to_le32(ei->i_dtime);
}
EXT4_INODE_GET_CTIME(inode, raw_inode);
- EXT4_INODE_GET_XTIME(i_mtime, inode, raw_inode);
- EXT4_INODE_GET_XTIME(i_atime, inode, raw_inode);
+ EXT4_INODE_GET_ATIME(inode, raw_inode);
+ EXT4_INODE_GET_MTIME(inode, raw_inode);
EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
spin_lock(&ei->i_raw_lock);
EXT4_INODE_SET_CTIME(inode, raw_inode);
- EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
- EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
+ EXT4_INODE_SET_MTIME(inode, raw_inode);
+ EXT4_INODE_SET_ATIME(inode, raw_inode);
ext4_inode_csum_set(inode, raw_inode, ei);
spin_unlock(&ei->i_raw_lock);
trace_ext4_other_inode_update_time(inode, orig_ino);
* update c/mtime in shrink case below
*/
if (!shrink)
- inode->i_mtime = inode_set_ctime_current(inode);
+ inode_set_mtime_to_ts(inode,
+ inode_set_ctime_current(inode));
if (shrink)
ext4_fc_track_range(handle, inode,
struct buffer_head *ext4_sb_bread(struct super_block *sb, sector_t block,
blk_opf_t op_flags)
{
- return __ext4_sb_bread_gfp(sb, block, op_flags, __GFP_MOVABLE);
+ gfp_t gfp = mapping_gfp_constraint(sb->s_bdev->bd_inode->i_mapping,
+ ~__GFP_FS) | __GFP_MOVABLE;
+
+ return __ext4_sb_bread_gfp(sb, block, op_flags, gfp);
}
struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb,
sector_t block)
{
- return __ext4_sb_bread_gfp(sb, block, 0, 0);
+ gfp_t gfp = mapping_gfp_constraint(sb->s_bdev->bd_inode->i_mapping,
+ ~__GFP_FS);
+
+ return __ext4_sb_bread_gfp(sb, block, 0, gfp);
}
void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block)
{
- struct buffer_head *bh = sb_getblk_gfp(sb, block, 0);
+ struct buffer_head *bh = bdev_getblk(sb->s_bdev, block,
+ sb->s_blocksize, GFP_NOWAIT | __GFP_NOWARN);
if (likely(bh)) {
if (trylock_buffer(bh))
*/
if (!sb_rdonly(sbi->s_sb) && journal) {
struct buffer_head *sbh = sbi->s_sbh;
- bool call_notify_err;
+ bool call_notify_err = false;
+
handle = jbd2_journal_start(journal, 1);
if (IS_ERR(handle))
goto write_directly;
sync_blockdev(sb->s_bdev);
invalidate_bdev(sb->s_bdev);
- if (sbi->s_journal_bdev) {
+ if (sbi->s_journal_bdev_handle) {
/*
* Invalidate the journal device's buffers. We don't want them
* floating about in memory - the physical journal device may
* hotswapped, and it breaks the `ro-after' testing code.
*/
- sync_blockdev(sbi->s_journal_bdev);
- invalidate_bdev(sbi->s_journal_bdev);
+ sync_blockdev(sbi->s_journal_bdev_handle->bdev);
+ invalidate_bdev(sbi->s_journal_bdev_handle->bdev);
}
ext4_xattr_destroy_cache(sbi->s_ea_inode_cache);
* Add the internal journal blocks whether the journal has been
* loaded or not
*/
- if (sbi->s_journal && !sbi->s_journal_bdev)
+ if (sbi->s_journal && !sbi->s_journal_bdev_handle)
overhead += EXT4_NUM_B2C(sbi, sbi->s_journal->j_total_len);
else if (ext4_has_feature_journal(sb) && !sbi->s_journal && j_inum) {
/* j_inum for internal journal is non-zero */
#endif
fscrypt_free_dummy_policy(&sbi->s_dummy_enc_policy);
brelse(sbi->s_sbh);
- if (sbi->s_journal_bdev) {
- invalidate_bdev(sbi->s_journal_bdev);
- blkdev_put(sbi->s_journal_bdev, sb);
+ if (sbi->s_journal_bdev_handle) {
+ invalidate_bdev(sbi->s_journal_bdev_handle->bdev);
+ bdev_release(sbi->s_journal_bdev_handle);
}
out_fail:
invalidate_bdev(sb->s_bdev);
return journal;
}
-static struct block_device *ext4_get_journal_blkdev(struct super_block *sb,
+static struct bdev_handle *ext4_get_journal_blkdev(struct super_block *sb,
dev_t j_dev, ext4_fsblk_t *j_start,
ext4_fsblk_t *j_len)
{
struct buffer_head *bh;
struct block_device *bdev;
+ struct bdev_handle *bdev_handle;
int hblock, blocksize;
ext4_fsblk_t sb_block;
unsigned long offset;
/* see get_tree_bdev why this is needed and safe */
up_write(&sb->s_umount);
- bdev = blkdev_get_by_dev(j_dev, BLK_OPEN_READ | BLK_OPEN_WRITE, sb,
- &fs_holder_ops);
+ bdev_handle = bdev_open_by_dev(j_dev, BLK_OPEN_READ | BLK_OPEN_WRITE,
+ sb, &fs_holder_ops);
down_write(&sb->s_umount);
- if (IS_ERR(bdev)) {
+ if (IS_ERR(bdev_handle)) {
ext4_msg(sb, KERN_ERR,
"failed to open journal device unknown-block(%u,%u) %ld",
- MAJOR(j_dev), MINOR(j_dev), PTR_ERR(bdev));
- return ERR_CAST(bdev);
+ MAJOR(j_dev), MINOR(j_dev), PTR_ERR(bdev_handle));
+ return bdev_handle;
}
+ bdev = bdev_handle->bdev;
blocksize = sb->s_blocksize;
hblock = bdev_logical_block_size(bdev);
if (blocksize < hblock) {
*j_start = sb_block + 1;
*j_len = ext4_blocks_count(es);
brelse(bh);
- return bdev;
+ return bdev_handle;
out_bh:
brelse(bh);
out_bdev:
- blkdev_put(bdev, sb);
+ bdev_release(bdev_handle);
return ERR_PTR(errno);
}
journal_t *journal;
ext4_fsblk_t j_start;
ext4_fsblk_t j_len;
- struct block_device *journal_bdev;
+ struct bdev_handle *bdev_handle;
int errno = 0;
- journal_bdev = ext4_get_journal_blkdev(sb, j_dev, &j_start, &j_len);
- if (IS_ERR(journal_bdev))
- return ERR_CAST(journal_bdev);
+ bdev_handle = ext4_get_journal_blkdev(sb, j_dev, &j_start, &j_len);
+ if (IS_ERR(bdev_handle))
+ return ERR_CAST(bdev_handle);
- journal = jbd2_journal_init_dev(journal_bdev, sb->s_bdev, j_start,
+ journal = jbd2_journal_init_dev(bdev_handle->bdev, sb->s_bdev, j_start,
j_len, sb->s_blocksize);
if (IS_ERR(journal)) {
ext4_msg(sb, KERN_ERR, "failed to create device journal");
goto out_journal;
}
journal->j_private = sb;
- EXT4_SB(sb)->s_journal_bdev = journal_bdev;
+ EXT4_SB(sb)->s_journal_bdev_handle = bdev_handle;
ext4_init_journal_params(sb, journal);
return journal;
out_journal:
jbd2_journal_destroy(journal);
out_bdev:
- blkdev_put(journal_bdev, sb);
+ bdev_release(bdev_handle);
return ERR_PTR(errno);
}
struct ext4_mount_options old_opts;
ext4_group_t g;
int err = 0;
+ int alloc_ctx;
#ifdef CONFIG_QUOTA
int enable_quota = 0;
int i, j;
}
+ /*
+ * Changing the DIOREAD_NOLOCK or DELALLOC mount options may cause
+ * two calls to ext4_should_dioread_nolock() to return inconsistent
+ * values, triggering WARN_ON in ext4_add_complete_io(). we grab
+ * here s_writepages_rwsem to avoid race between writepages ops and
+ * remount.
+ */
+ alloc_ctx = ext4_writepages_down_write(sb);
ext4_apply_options(fc, sb);
+ ext4_writepages_up_write(sb, alloc_ctx);
if ((old_opts.s_mount_opt & EXT4_MOUNT_JOURNAL_CHECKSUM) ^
test_opt(sb, JOURNAL_CHECKSUM)) {
if (sb_rdonly(sb) && !(old_sb_flags & SB_RDONLY) &&
sb_any_quota_suspended(sb))
dquot_resume(sb, -1);
+
+ alloc_ctx = ext4_writepages_down_write(sb);
sb->s_flags = old_sb_flags;
sbi->s_mount_opt = old_opts.s_mount_opt;
sbi->s_mount_opt2 = old_opts.s_mount_opt2;
sbi->s_commit_interval = old_opts.s_commit_interval;
sbi->s_min_batch_time = old_opts.s_min_batch_time;
sbi->s_max_batch_time = old_opts.s_max_batch_time;
+ ext4_writepages_up_write(sb, alloc_ctx);
+
if (!test_opt(sb, BLOCK_VALIDITY) && sbi->s_system_blks)
ext4_release_system_zone(sb);
#ifdef CONFIG_QUOTA
}
EXT4_I(inode)->i_flags &= ~(EXT4_NOATIME_FL | EXT4_IMMUTABLE_FL);
inode_set_flags(inode, 0, S_NOATIME | S_IMMUTABLE);
- inode->i_mtime = inode_set_ctime_current(inode);
+ inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
err = ext4_mark_inode_dirty(handle, inode);
ext4_journal_stop(handle);
out_unlock:
static void ext4_kill_sb(struct super_block *sb)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
- struct block_device *journal_bdev = sbi ? sbi->s_journal_bdev : NULL;
+ struct bdev_handle *handle = sbi ? sbi->s_journal_bdev_handle : NULL;
kill_block_super(sb);
- if (journal_bdev)
- blkdev_put(journal_bdev, sb);
+ if (handle)
+ bdev_release(handle);
}
static struct file_system_type ext4_fs_type = {
#endif
/* f2fs-wide shrinker description */
- static struct shrinker f2fs_shrinker_info = {
- .scan_objects = f2fs_shrink_scan,
- .count_objects = f2fs_shrink_count,
- .seeks = DEFAULT_SEEKS,
- };
+ static struct shrinker *f2fs_shrinker_info;
+
+ static int __init f2fs_init_shrinker(void)
+ {
+ f2fs_shrinker_info = shrinker_alloc(0, "f2fs-shrinker");
+ if (!f2fs_shrinker_info)
+ return -ENOMEM;
+
+ f2fs_shrinker_info->count_objects = f2fs_shrink_count;
+ f2fs_shrinker_info->scan_objects = f2fs_shrink_scan;
+
+ shrinker_register(f2fs_shrinker_info);
+
+ return 0;
+ }
+
+ static void f2fs_exit_shrinker(void)
+ {
+ shrinker_free(f2fs_shrinker_info);
+ }
enum {
Opt_gc_background,
for (i = 0; i < sbi->s_ndevs; i++) {
if (i > 0)
- blkdev_put(FDEV(i).bdev, sbi->sb);
+ bdev_release(FDEV(i).bdev_handle);
#ifdef CONFIG_BLK_DEV_ZONED
kvfree(FDEV(i).blkz_seq);
#endif
if (len == towrite)
return err;
- inode->i_mtime = inode_set_ctime_current(inode);
+ inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
f2fs_mark_inode_dirty_sync(inode, false);
return len - towrite;
}
return true;
}
-static void f2fs_get_ino_and_lblk_bits(struct super_block *sb,
- int *ino_bits_ret, int *lblk_bits_ret)
-{
- *ino_bits_ret = 8 * sizeof(nid_t);
- *lblk_bits_ret = 8 * sizeof(block_t);
-}
-
static struct block_device **f2fs_get_devices(struct super_block *sb,
unsigned int *num_devs)
{
}
static const struct fscrypt_operations f2fs_cryptops = {
- .key_prefix = "f2fs:",
+ .needs_bounce_pages = 1,
+ .has_32bit_inodes = 1,
+ .supports_subblock_data_units = 1,
+ .legacy_key_prefix = "f2fs:",
.get_context = f2fs_get_context,
.set_context = f2fs_set_context,
.get_dummy_policy = f2fs_get_dummy_policy,
.empty_dir = f2fs_empty_dir,
.has_stable_inodes = f2fs_has_stable_inodes,
- .get_ino_and_lblk_bits = f2fs_get_ino_and_lblk_bits,
.get_devices = f2fs_get_devices,
};
#endif
for (i = 0; i < max_devices; i++) {
if (i == 0)
- FDEV(0).bdev = sbi->sb->s_bdev;
+ FDEV(0).bdev_handle = sbi->sb->s_bdev_handle;
else if (!RDEV(i).path[0])
break;
FDEV(i).end_blk = FDEV(i).start_blk +
(FDEV(i).total_segments <<
sbi->log_blocks_per_seg) - 1;
- FDEV(i).bdev = blkdev_get_by_path(FDEV(i).path,
- mode, sbi->sb, NULL);
+ FDEV(i).bdev_handle = bdev_open_by_path(
+ FDEV(i).path, mode, sbi->sb, NULL);
}
}
- if (IS_ERR(FDEV(i).bdev))
- return PTR_ERR(FDEV(i).bdev);
+ if (IS_ERR(FDEV(i).bdev_handle))
+ return PTR_ERR(FDEV(i).bdev_handle);
+ FDEV(i).bdev = FDEV(i).bdev_handle->bdev;
/* to release errored devices */
sbi->s_ndevs = i + 1;
err = f2fs_init_sysfs();
if (err)
goto free_garbage_collection_cache;
- err = register_shrinker(&f2fs_shrinker_info, "f2fs-shrinker");
+ err = f2fs_init_shrinker();
if (err)
goto free_sysfs;
err = register_filesystem(&f2fs_fs_type);
f2fs_destroy_root_stats();
unregister_filesystem(&f2fs_fs_type);
free_shrinker:
- unregister_shrinker(&f2fs_shrinker_info);
+ f2fs_exit_shrinker();
free_sysfs:
f2fs_exit_sysfs();
free_garbage_collection_cache:
f2fs_destroy_post_read_processing();
f2fs_destroy_root_stats();
unregister_filesystem(&f2fs_fs_type);
- unregister_shrinker(&f2fs_shrinker_info);
+ f2fs_exit_shrinker();
f2fs_exit_sysfs();
f2fs_destroy_garbage_collection_cache();
f2fs_destroy_extent_cache();
static int punch_hole(struct gfs2_inode *ip, u64 offset, u64 length);
/**
- * gfs2_unstuffer_page - unstuff a stuffed inode into a block cached by a page
+ * gfs2_unstuffer_folio - unstuff a stuffed inode into a block cached by a folio
* @ip: the inode
* @dibh: the dinode buffer
* @block: the block number that was allocated
- * @page: The (optional) page. This is looked up if @page is NULL
+ * @folio: The folio.
*
* Returns: errno
*/
-
- static int gfs2_unstuffer_page(struct gfs2_inode *ip, struct buffer_head *dibh,
- u64 block, struct page *page)
+ static int gfs2_unstuffer_folio(struct gfs2_inode *ip, struct buffer_head *dibh,
+ u64 block, struct folio *folio)
{
struct inode *inode = &ip->i_inode;
- if (!PageUptodate(page)) {
- void *kaddr = kmap(page);
+ if (!folio_test_uptodate(folio)) {
+ void *kaddr = kmap_local_folio(folio, 0);
u64 dsize = i_size_read(inode);
memcpy(kaddr, dibh->b_data + sizeof(struct gfs2_dinode), dsize);
- memset(kaddr + dsize, 0, PAGE_SIZE - dsize);
- kunmap(page);
+ memset(kaddr + dsize, 0, folio_size(folio) - dsize);
+ kunmap_local(kaddr);
- SetPageUptodate(page);
+ folio_mark_uptodate(folio);
}
if (gfs2_is_jdata(ip)) {
- struct buffer_head *bh;
+ struct buffer_head *bh = folio_buffers(folio);
- if (!page_has_buffers(page))
- create_empty_buffers(page, BIT(inode->i_blkbits),
- BIT(BH_Uptodate));
+ if (!bh)
+ bh = create_empty_buffers(folio,
+ BIT(inode->i_blkbits), BIT(BH_Uptodate));
- bh = page_buffers(page);
if (!buffer_mapped(bh))
map_bh(bh, inode->i_sb, block);
set_buffer_uptodate(bh);
gfs2_trans_add_data(ip->i_gl, bh);
} else {
- set_page_dirty(page);
+ folio_mark_dirty(folio);
gfs2_ordered_add_inode(ip);
}
return 0;
}
- static int __gfs2_unstuff_inode(struct gfs2_inode *ip, struct page *page)
+ static int __gfs2_unstuff_inode(struct gfs2_inode *ip, struct folio *folio)
{
struct buffer_head *bh, *dibh;
struct gfs2_dinode *di;
dibh, sizeof(struct gfs2_dinode));
brelse(bh);
} else {
- error = gfs2_unstuffer_page(ip, dibh, block, page);
+ error = gfs2_unstuffer_folio(ip, dibh, block, folio);
if (error)
goto out_brelse;
}
int gfs2_unstuff_dinode(struct gfs2_inode *ip)
{
struct inode *inode = &ip->i_inode;
- struct page *page;
+ struct folio *folio;
int error;
down_write(&ip->i_rw_mutex);
- page = grab_cache_page(inode->i_mapping, 0);
- error = -ENOMEM;
- if (!page)
+ folio = filemap_grab_folio(inode->i_mapping, 0);
+ error = PTR_ERR(folio);
+ if (IS_ERR(folio))
goto out;
- error = __gfs2_unstuff_inode(ip, page);
- unlock_page(page);
- put_page(page);
+ error = __gfs2_unstuff_inode(ip, folio);
+ folio_unlock(folio);
+ folio_put(folio);
out:
up_write(&ip->i_rw_mutex);
return error;
ip->i_diskflags |= GFS2_DIF_TRUNC_IN_PROG;
i_size_write(inode, newsize);
- ip->i_inode.i_mtime = inode_set_ctime_current(&ip->i_inode);
+ inode_set_mtime_to_ts(&ip->i_inode, inode_set_ctime_current(&ip->i_inode));
gfs2_dinode_out(ip, dibh->b_data);
if (journaled)
/* Every transaction boundary, we rewrite the dinode
to keep its di_blocks current in case of failure. */
- ip->i_inode.i_mtime = inode_set_ctime_current(&ip->i_inode);
+ inode_set_mtime_to_ts(&ip->i_inode, inode_set_ctime_current(&ip->i_inode));
gfs2_trans_add_meta(ip->i_gl, dibh);
gfs2_dinode_out(ip, dibh->b_data);
brelse(dibh);
gfs2_statfs_change(sdp, 0, +btotal, 0);
gfs2_quota_change(ip, -(s64)btotal, ip->i_inode.i_uid,
ip->i_inode.i_gid);
- ip->i_inode.i_mtime = inode_set_ctime_current(&ip->i_inode);
+ inode_set_mtime_to_ts(&ip->i_inode, inode_set_ctime_current(&ip->i_inode));
gfs2_trans_add_meta(ip->i_gl, dibh);
gfs2_dinode_out(ip, dibh->b_data);
up_write(&ip->i_rw_mutex);
gfs2_buffer_clear_tail(dibh, sizeof(struct gfs2_dinode));
gfs2_ordered_del_inode(ip);
}
- ip->i_inode.i_mtime = inode_set_ctime_current(&ip->i_inode);
+ inode_set_mtime_to_ts(&ip->i_inode, inode_set_ctime_current(&ip->i_inode));
ip->i_diskflags &= ~GFS2_DIF_TRUNC_IN_PROG;
gfs2_trans_add_meta(ip->i_gl, dibh);
goto do_end_trans;
truncate_setsize(inode, size);
- ip->i_inode.i_mtime = inode_set_ctime_current(&ip->i_inode);
+ inode_set_mtime_to_ts(&ip->i_inode, inode_set_ctime_current(&ip->i_inode));
gfs2_trans_add_meta(ip->i_gl, dibh);
gfs2_dinode_out(ip, dibh->b_data);
brelse(dibh);
return vfs_pressure_ratio(atomic_read(&lru_count));
}
- static struct shrinker glock_shrinker = {
- .seeks = DEFAULT_SEEKS,
- .count_objects = gfs2_glock_shrink_count,
- .scan_objects = gfs2_glock_shrink_scan,
- };
+ static struct shrinker *glock_shrinker;
/**
* glock_hash_walk - Call a function for glock in a hash bucket
return -ENOMEM;
}
- ret = register_shrinker(&glock_shrinker, "gfs2-glock");
- if (ret) {
+ glock_shrinker = shrinker_alloc(0, "gfs2-glock");
+ if (!glock_shrinker) {
destroy_workqueue(glock_workqueue);
rhashtable_destroy(&gl_hash_table);
- return ret;
+ return -ENOMEM;
}
+ glock_shrinker->count_objects = gfs2_glock_shrink_count;
+ glock_shrinker->scan_objects = gfs2_glock_shrink_scan;
+
+ shrinker_register(glock_shrinker);
+
for (i = 0; i < GLOCK_WAIT_TABLE_SIZE; i++)
init_waitqueue_head(glock_wait_table + i);
void gfs2_glock_exit(void)
{
- unregister_shrinker(&glock_shrinker);
+ shrinker_free(glock_shrinker);
rhashtable_destroy(&gl_hash_table);
destroy_workqueue(glock_workqueue);
}
for(;; i->fd++) {
struct inode *inode;
- i->file = task_lookup_next_fd_rcu(i->task, &i->fd);
+ i->file = task_lookup_next_fdget_rcu(i->task, &i->fd);
if (!i->file) {
i->fd = 0;
break;
}
+
inode = file_inode(i->file);
- if (inode->i_sb != i->sb)
- continue;
- if (get_file_rcu(i->file))
+ if (inode->i_sb == i->sb)
break;
+
+ rcu_read_unlock();
+ fput(i->file);
+ rcu_read_lock();
}
rcu_read_unlock();
return i->file;
return vfs_pressure_ratio(list_lru_shrink_count(&gfs2_qd_lru, sc));
}
- struct shrinker gfs2_qd_shrinker = {
- .count_objects = gfs2_qd_shrink_count,
- .scan_objects = gfs2_qd_shrink_scan,
- .seeks = DEFAULT_SEEKS,
- .flags = SHRINKER_NUMA_AWARE,
- };
+ static struct shrinker *gfs2_qd_shrinker;
+ int __init gfs2_qd_shrinker_init(void)
+ {
+ gfs2_qd_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "gfs2-qd");
+ if (!gfs2_qd_shrinker)
+ return -ENOMEM;
+
+ gfs2_qd_shrinker->count_objects = gfs2_qd_shrink_count;
+ gfs2_qd_shrinker->scan_objects = gfs2_qd_shrink_scan;
+
+ shrinker_register(gfs2_qd_shrinker);
+
+ return 0;
+ }
+
+ void gfs2_qd_shrinker_exit(void)
+ {
+ shrinker_free(gfs2_qd_shrinker);
+ }
static u64 qd2index(struct gfs2_quota_data *qd)
{
struct gfs2_inode *ip = GFS2_I(sdp->sd_quota_inode);
struct inode *inode = &ip->i_inode;
struct address_space *mapping = inode->i_mapping;
- struct page *page;
+ struct folio *folio;
struct buffer_head *bh;
u64 blk;
unsigned bsize = sdp->sd_sb.sb_bsize, bnum = 0, boff = 0;
blk = index << (PAGE_SHIFT - sdp->sd_sb.sb_bsize_shift);
boff = off % bsize;
- page = grab_cache_page(mapping, index);
- if (!page)
- return -ENOMEM;
- if (!page_has_buffers(page))
- create_empty_buffers(page, bsize, 0);
+ folio = filemap_grab_folio(mapping, index);
+ if (IS_ERR(folio))
+ return PTR_ERR(folio);
+ bh = folio_buffers(folio);
+ if (!bh)
+ bh = create_empty_buffers(folio, bsize, 0);
- bh = page_buffers(page);
- for(;;) {
- /* Find the beginning block within the page */
+ for (;;) {
+ /* Find the beginning block within the folio */
if (pg_off >= ((bnum * bsize) + bsize)) {
bh = bh->b_this_page;
bnum++;
goto unlock_out;
/* If it's a newly allocated disk block, zero it */
if (buffer_new(bh))
- zero_user(page, bnum * bsize, bh->b_size);
+ folio_zero_range(folio, bnum * bsize,
+ bh->b_size);
}
- if (PageUptodate(page))
+ if (folio_test_uptodate(folio))
set_buffer_uptodate(bh);
if (bh_read(bh, REQ_META | REQ_PRIO) < 0)
goto unlock_out;
break;
}
- /* Write to the page, now that we have setup the buffer(s) */
- memcpy_to_page(page, off, buf, bytes);
- flush_dcache_page(page);
- unlock_page(page);
- put_page(page);
+ /* Write to the folio, now that we have setup the buffer(s) */
+ memcpy_to_folio(folio, off, buf, bytes);
+ flush_dcache_folio(folio);
+ folio_unlock(folio);
+ folio_put(folio);
return 0;
unlock_out:
- unlock_page(page);
- put_page(page);
+ folio_unlock(folio);
+ folio_put(folio);
return -EIO;
}
size = loc + sizeof(struct gfs2_quota);
if (size > inode->i_size)
i_size_write(inode, size);
- inode->i_mtime = inode_set_ctime_current(inode);
+ inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
mark_inode_dirty(inode);
set_bit(QDF_REFRESH, &qd->qd_flags);
}
{}
};
- #ifdef CONFIG_NUMA
- static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma,
- struct inode *inode, pgoff_t index)
- {
- vma->vm_policy = mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
- index);
- }
-
- static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
- {
- mpol_cond_put(vma->vm_policy);
- }
- #else
- static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma,
- struct inode *inode, pgoff_t index)
- {
- }
-
- static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
- {
- }
- #endif
-
/*
* Mask used when checking the page offset value passed in via system
* calls. This value will be converted to a loff_t which is signed.
vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
vma->vm_ops = &hugetlb_vm_ops;
- ret = seal_check_future_write(info->seals, vma);
+ ret = seal_check_write(info->seals, vma);
if (ret)
return ret;
size_t res = 0;
/* First subpage to start the loop. */
- page += offset / PAGE_SIZE;
+ page = nth_page(page, offset / PAGE_SIZE);
offset %= PAGE_SIZE;
while (1) {
if (is_raw_hwpoison_page_in_hugepage(page))
break;
offset += n;
if (offset == PAGE_SIZE) {
- page++;
+ page = nth_page(page, 1);
offset = 0;
}
}
ssize_t retval = 0;
while (iov_iter_count(to)) {
- struct page *page;
+ struct folio *folio;
size_t nr, copied, want;
/* nr is the maximum number of bytes to copy from this page */
}
nr = nr - offset;
- /* Find the page */
- page = find_lock_page(mapping, index);
- if (unlikely(page == NULL)) {
+ /* Find the folio */
+ folio = filemap_lock_hugetlb_folio(h, mapping, index);
+ if (IS_ERR(folio)) {
/*
* We have a HOLE, zero out the user-buffer for the
* length of the hole or request.
*/
copied = iov_iter_zero(nr, to);
} else {
- unlock_page(page);
+ folio_unlock(folio);
- if (!PageHWPoison(page))
+ if (!folio_test_has_hwpoisoned(folio))
want = nr;
else {
/*
* touching the 1st raw HWPOISON subpage after
* offset.
*/
- want = adjust_range_hwpoison(page, offset, nr);
+ want = adjust_range_hwpoison(&folio->page, offset, nr);
if (want == 0) {
- put_page(page);
+ folio_put(folio);
retval = -EIO;
break;
}
}
/*
- * We have the page, copy it to user space buffer.
+ * We have the folio, copy it to user space buffer.
*/
- copied = copy_page_to_iter(page, offset, want, to);
- put_page(page);
+ copied = copy_folio_to_iter(folio, offset, want, to);
+ folio_put(folio);
}
offset += copied;
retval += copied;
{
struct hstate *h = hstate_inode(inode);
struct address_space *mapping = &inode->i_data;
- const pgoff_t start = lstart >> huge_page_shift(h);
- const pgoff_t end = lend >> huge_page_shift(h);
+ const pgoff_t end = lend >> PAGE_SHIFT;
struct folio_batch fbatch;
pgoff_t next, index;
int i, freed = 0;
bool truncate_op = (lend == LLONG_MAX);
folio_batch_init(&fbatch);
- next = start;
+ next = lstart >> PAGE_SHIFT;
while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
for (i = 0; i < folio_batch_count(&fbatch); ++i) {
struct folio *folio = fbatch.folios[i];
u32 hash = 0;
- index = folio->index;
+ index = folio->index >> huge_page_order(h);
hash = hugetlb_fault_mutex_hash(mapping, index);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
}
if (truncate_op)
- (void)hugetlb_unreserve_pages(inode, start, LONG_MAX, freed);
+ (void)hugetlb_unreserve_pages(inode,
+ lstart >> huge_page_shift(h),
+ LONG_MAX, freed);
}
static void hugetlbfs_evict_inode(struct inode *inode)
pgoff_t idx = start >> huge_page_shift(h);
struct folio *folio;
- folio = filemap_lock_folio(mapping, idx);
+ folio = filemap_lock_hugetlb_folio(h, mapping, idx);
if (IS_ERR(folio))
return;
/*
* Initialize a pseudo vma as this is required by the huge page
- * allocation routines. If NUMA is configured, use page index
- * as input to create an allocation policy.
+ * allocation routines.
*/
vma_init(&pseudo_vma, mm);
vm_flags_init(&pseudo_vma, VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
/* See if already present in mapping to avoid alloc/free */
- folio = filemap_get_folio(mapping, index);
+ folio = filemap_get_folio(mapping, index << huge_page_order(h));
if (!IS_ERR(folio)) {
folio_put(folio);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
* folios in these areas, we need to consume the reserves
* to keep reservation accounting consistent.
*/
- hugetlb_set_vma_policy(&pseudo_vma, inode, index);
folio = alloc_hugetlb_folio(&pseudo_vma, addr, 0);
- hugetlb_drop_vma_policy(&pseudo_vma);
if (IS_ERR(folio)) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
error = PTR_ERR(folio);
inode->i_mode = S_IFDIR | ctx->mode;
inode->i_uid = ctx->uid;
inode->i_gid = ctx->gid;
- inode->i_atime = inode->i_mtime = inode_set_ctime_current(inode);
+ simple_inode_init_ts(inode);
inode->i_op = &hugetlbfs_dir_inode_operations;
inode->i_fop = &simple_dir_operations;
/* directory inodes start off with i_nlink == 2 (for "." entry) */
lockdep_set_class(&inode->i_mapping->i_mmap_rwsem,
&hugetlbfs_i_mmap_rwsem_key);
inode->i_mapping->a_ops = &hugetlbfs_aops;
- inode->i_atime = inode->i_mtime = inode_set_ctime_current(inode);
+ simple_inode_init_ts(inode);
inode->i_mapping->private_data = resv_map;
info->seals = F_SEAL_SEAL;
switch (mode & S_IFMT) {
inode = hugetlbfs_get_inode(dir->i_sb, dir, mode, dev);
if (!inode)
return -ENOSPC;
- dir->i_mtime = inode_set_ctime_current(dir);
+ inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
d_instantiate(dentry, inode);
dget(dentry);/* Extra count - pin the dentry in core */
return 0;
inode = hugetlbfs_get_inode(dir->i_sb, dir, mode | S_IFREG, 0);
if (!inode)
return -ENOSPC;
- dir->i_mtime = inode_set_ctime_current(dir);
+ inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
d_tmpfile(file, inode);
return finish_open_simple(file, 0);
}
} else
iput(inode);
}
- dir->i_mtime = inode_set_ctime_current(dir);
+ inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
return error;
}
hugetlbfs_inc_free_inodes(sbinfo);
return NULL;
}
-
- /*
- * Any time after allocation, hugetlbfs_destroy_inode can be called
- * for the inode. mpol_free_shared_policy is unconditionally called
- * as part of hugetlbfs_destroy_inode. So, initialize policy here
- * in case of a quick call to destroy.
- *
- * Note that the policy is initialized even if we are creating a
- * private inode. This simplifies hugetlbfs_destroy_inode.
- */
- mpol_shared_policy_init(&p->policy, NULL);
-
return &p->vfs_inode;
}
static void hugetlbfs_destroy_inode(struct inode *inode)
{
hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode->i_sb));
- mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);
}
static const struct address_space_operations hugetlbfs_aops = {
* and I/O completions.
*/
struct iomap_folio_state {
- atomic_t read_bytes_pending;
- atomic_t write_bytes_pending;
spinlock_t state_lock;
+ unsigned int read_bytes_pending;
+ atomic_t write_bytes_pending;
/*
* Each block has two bits in this bitmap:
return test_bit(block, ifs->state);
}
- static void ifs_set_range_uptodate(struct folio *folio,
+ static bool ifs_set_range_uptodate(struct folio *folio,
struct iomap_folio_state *ifs, size_t off, size_t len)
{
struct inode *inode = folio->mapping->host;
unsigned int first_blk = off >> inode->i_blkbits;
unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
unsigned int nr_blks = last_blk - first_blk + 1;
- unsigned long flags;
- spin_lock_irqsave(&ifs->state_lock, flags);
bitmap_set(ifs->state, first_blk, nr_blks);
- if (ifs_is_fully_uptodate(folio, ifs))
- folio_mark_uptodate(folio);
- spin_unlock_irqrestore(&ifs->state_lock, flags);
+ return ifs_is_fully_uptodate(folio, ifs);
}
static void iomap_set_range_uptodate(struct folio *folio, size_t off,
size_t len)
{
struct iomap_folio_state *ifs = folio->private;
+ unsigned long flags;
+ bool uptodate = true;
- if (ifs)
- ifs_set_range_uptodate(folio, ifs, off, len);
- else
+ if (ifs) {
+ spin_lock_irqsave(&ifs->state_lock, flags);
+ uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
+ spin_unlock_irqrestore(&ifs->state_lock, flags);
+ }
+
+ if (uptodate)
folio_mark_uptodate(folio);
}
if (!ifs)
return;
- WARN_ON_ONCE(atomic_read(&ifs->read_bytes_pending));
+ WARN_ON_ONCE(ifs->read_bytes_pending != 0);
WARN_ON_ONCE(atomic_read(&ifs->write_bytes_pending));
WARN_ON_ONCE(ifs_is_fully_uptodate(folio, ifs) !=
folio_test_uptodate(folio));
*lenp = plen;
}
- static void iomap_finish_folio_read(struct folio *folio, size_t offset,
+ static void iomap_finish_folio_read(struct folio *folio, size_t off,
size_t len, int error)
{
struct iomap_folio_state *ifs = folio->private;
+ bool uptodate = !error;
+ bool finished = true;
- if (unlikely(error)) {
- folio_clear_uptodate(folio);
- folio_set_error(folio);
- } else {
- iomap_set_range_uptodate(folio, offset, len);
+ if (ifs) {
+ unsigned long flags;
+
+ spin_lock_irqsave(&ifs->state_lock, flags);
+ if (!error)
+ uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
+ ifs->read_bytes_pending -= len;
+ finished = !ifs->read_bytes_pending;
+ spin_unlock_irqrestore(&ifs->state_lock, flags);
}
- if (!ifs || atomic_sub_and_test(len, &ifs->read_bytes_pending))
- folio_unlock(folio);
+ if (error)
+ folio_set_error(folio);
+ if (finished)
+ folio_end_read(folio, uptodate);
}
static void iomap_read_end_io(struct bio *bio)
}
ctx->cur_folio_in_bio = true;
- if (ifs)
- atomic_add(plen, &ifs->read_bytes_pending);
+ if (ifs) {
+ spin_lock_irq(&ifs->state_lock);
+ ifs->read_bytes_pending += plen;
+ spin_unlock_irq(&ifs->state_lock);
+ }
sector = iomap_sector(iomap, pos);
if (!ctx->bio ||
size_t bytes; /* Bytes to write to folio */
size_t copied; /* Bytes copied from user */
+ bytes = iov_iter_count(i);
+retry:
offset = pos & (chunk - 1);
- bytes = min(chunk - offset, iov_iter_count(i));
+ bytes = min(chunk - offset, bytes);
status = balance_dirty_pages_ratelimited_flags(mapping,
bdp_flags);
if (unlikely(status))
* halfway through, might be a race with munmap,
* might be severe memory pressure.
*/
- if (copied)
- bytes = copied;
if (chunk > PAGE_SIZE)
chunk /= 2;
+ if (copied) {
+ bytes = copied;
+ goto retry;
+ }
} else {
pos += status;
written += status;
}
#endif /* CONFIG_NFS_V4_2 */
- static struct shrinker acl_shrinker = {
- .count_objects = nfs_access_cache_count,
- .scan_objects = nfs_access_cache_scan,
- .seeks = DEFAULT_SEEKS,
- };
+ static struct shrinker *acl_shrinker;
/*
* Register the NFS filesystems
ret = nfs_register_sysctl();
if (ret < 0)
goto error_2;
- ret = register_shrinker(&acl_shrinker, "nfs-acl");
- if (ret < 0)
+
+ acl_shrinker = shrinker_alloc(0, "nfs-acl");
+ if (!acl_shrinker) {
+ ret = -ENOMEM;
goto error_3;
+ }
+
+ acl_shrinker->count_objects = nfs_access_cache_count;
+ acl_shrinker->scan_objects = nfs_access_cache_scan;
+
+ shrinker_register(acl_shrinker);
+
#ifdef CONFIG_NFS_V4_2
nfs_ssc_register_ops();
#endif
*/
void __exit unregister_nfs_fs(void)
{
- unregister_shrinker(&acl_shrinker);
+ shrinker_free(acl_shrinker);
nfs_unregister_sysctl();
unregister_nfs4_fs();
#ifdef CONFIG_NFS_V4_2
sb->s_export_op = &nfs_export_ops;
break;
case 4:
- sb->s_flags |= SB_POSIXACL;
+ sb->s_iflags |= SB_I_NOUMASK;
sb->s_time_gran = 1;
sb->s_time_min = S64_MIN;
sb->s_time_max = S64_MAX;
return ret;
}
- static struct shrinker nfsd_file_shrinker = {
- .scan_objects = nfsd_file_lru_scan,
- .count_objects = nfsd_file_lru_count,
- .seeks = 1,
- };
+ static struct shrinker *nfsd_file_shrinker;
/**
* nfsd_file_cond_queue - conditionally unhash and queue a nfsd_file
goto out_err;
}
- ret = register_shrinker(&nfsd_file_shrinker, "nfsd-filecache");
- if (ret) {
- pr_err("nfsd: failed to register nfsd_file_shrinker: %d\n", ret);
+ nfsd_file_shrinker = shrinker_alloc(0, "nfsd-filecache");
+ if (!nfsd_file_shrinker) {
+ ret = -ENOMEM;
+ pr_err("nfsd: failed to allocate nfsd_file_shrinker\n");
goto out_lru;
}
+ nfsd_file_shrinker->count_objects = nfsd_file_lru_count;
+ nfsd_file_shrinker->scan_objects = nfsd_file_lru_scan;
+ nfsd_file_shrinker->seeks = 1;
+
+ shrinker_register(nfsd_file_shrinker);
+
ret = lease_register_notifier(&nfsd_file_lease_notifier);
if (ret) {
pr_err("nfsd: unable to register lease notifier: %d\n", ret);
out_notifier:
lease_unregister_notifier(&nfsd_file_lease_notifier);
out_shrinker:
- unregister_shrinker(&nfsd_file_shrinker);
+ shrinker_free(nfsd_file_shrinker);
out_lru:
list_lru_destroy(&nfsd_file_lru);
out_err:
return;
lease_unregister_notifier(&nfsd_file_lease_notifier);
- unregister_shrinker(&nfsd_file_shrinker);
+ shrinker_free(nfsd_file_shrinker);
/*
* make sure all callers of nfsd_file_lru_cb are done before
* calling nfsd_file_cache_purge
unsigned char need = may_flags & NFSD_FILE_MAY_MASK;
struct net *net = SVC_NET(rqstp);
struct nfsd_file *new, *nf;
- const struct cred *cred;
+ bool stale_retry = true;
bool open_retry = true;
struct inode *inode;
__be32 status;
int ret;
+retry:
status = fh_verify(rqstp, fhp, S_IFREG,
may_flags|NFSD_MAY_OWNER_OVERRIDE);
if (status != nfs_ok)
return status;
inode = d_inode(fhp->fh_dentry);
- cred = get_current_cred();
-retry:
rcu_read_lock();
- nf = nfsd_file_lookup_locked(net, cred, inode, need, want_gc);
+ nf = nfsd_file_lookup_locked(net, current_cred(), inode, need, want_gc);
rcu_read_unlock();
if (nf) {
rcu_read_lock();
spin_lock(&inode->i_lock);
- nf = nfsd_file_lookup_locked(net, cred, inode, need, want_gc);
+ nf = nfsd_file_lookup_locked(net, current_cred(), inode, need, want_gc);
if (unlikely(nf)) {
spin_unlock(&inode->i_lock);
rcu_read_unlock();
goto construction_err;
}
open_retry = false;
+ fh_put(fhp);
goto retry;
}
this_cpu_inc(nfsd_file_cache_hits);
nfsd_file_check_write_error(nf);
*pnf = nf;
}
- put_cred(cred);
trace_nfsd_file_acquire(rqstp, inode, may_flags, nf, status);
return status;
status = nfs_ok;
trace_nfsd_file_opened(nf, status);
} else {
- status = nfsd_open_verified(rqstp, fhp, may_flags,
- &nf->nf_file);
+ ret = nfsd_open_verified(rqstp, fhp, may_flags,
+ &nf->nf_file);
+ if (ret == -EOPENSTALE && stale_retry) {
+ stale_retry = false;
+ nfsd_file_unhash(nf);
+ clear_and_wake_up_bit(NFSD_FILE_PENDING,
+ &nf->nf_flags);
+ if (refcount_dec_and_test(&nf->nf_ref))
+ nfsd_file_free(nf);
+ nf = NULL;
+ fh_put(fhp);
+ goto retry;
+ }
+ status = nfserrno(ret);
trace_nfsd_file_open(nf, status);
}
} else
#define NFSDDBG_FACILITY NFSDDBG_PROC
-#define all_ones {{~0,~0},~0}
+#define all_ones {{ ~0, ~0}, ~0}
static const stateid_t one_stateid = {
.si_generation = ~0,
.si_opaque = all_ones,
static const struct nfsd4_callback_ops nfsd4_cb_recall_ops;
static const struct nfsd4_callback_ops nfsd4_cb_notify_lock_ops;
+static const struct nfsd4_callback_ops nfsd4_cb_getattr_ops;
static struct workqueue_struct *laundry_wq;
nbl = find_blocked_lock(lo, fh, nn);
if (!nbl) {
- nbl= kmalloc(sizeof(*nbl), GFP_KERNEL);
+ nbl = kmalloc(sizeof(*nbl), GFP_KERNEL);
if (nbl) {
INIT_LIST_HEAD(&nbl->nbl_list);
INIT_LIST_HEAD(&nbl->nbl_lru);
struct nfs4_clnt_odstate *odstate, u32 dl_type)
{
struct nfs4_delegation *dp;
+ struct nfs4_stid *stid;
long n;
dprintk("NFSD alloc_init_deleg\n");
goto out_dec;
if (delegation_blocked(&fp->fi_fhandle))
goto out_dec;
- dp = delegstateid(nfs4_alloc_stid(clp, deleg_slab, nfs4_free_deleg));
- if (dp == NULL)
+ stid = nfs4_alloc_stid(clp, deleg_slab, nfs4_free_deleg);
+ if (stid == NULL)
goto out_dec;
+ dp = delegstateid(stid);
/*
* delegation seqid's are never incremented. The 4.1 special
dp->dl_recalled = false;
nfsd4_init_cb(&dp->dl_recall, dp->dl_stid.sc_client,
&nfsd4_cb_recall_ops, NFSPROC4_CLNT_CB_RECALL);
+ nfsd4_init_cb(&dp->dl_cb_fattr.ncf_getattr, dp->dl_stid.sc_client,
+ &nfsd4_cb_getattr_ops, NFSPROC4_CLNT_CB_GETATTR);
+ dp->dl_cb_fattr.ncf_file_modified = false;
+ dp->dl_cb_fattr.ncf_cb_bmap[0] = FATTR4_WORD0_CHANGE | FATTR4_WORD0_SIZE;
get_nfs4_file(fp);
dp->dl_stid.sc_file = fp;
return dp;
spin_unlock(&nn->client_lock);
}
+static int
+nfsd4_cb_getattr_done(struct nfsd4_callback *cb, struct rpc_task *task)
+{
+ struct nfs4_cb_fattr *ncf =
+ container_of(cb, struct nfs4_cb_fattr, ncf_getattr);
+
+ ncf->ncf_cb_status = task->tk_status;
+ switch (task->tk_status) {
+ case -NFS4ERR_DELAY:
+ rpc_delay(task, 2 * HZ);
+ return 0;
+ default:
+ return 1;
+ }
+}
+
+static void
+nfsd4_cb_getattr_release(struct nfsd4_callback *cb)
+{
+ struct nfs4_cb_fattr *ncf =
+ container_of(cb, struct nfs4_cb_fattr, ncf_getattr);
+ struct nfs4_delegation *dp =
+ container_of(ncf, struct nfs4_delegation, dl_cb_fattr);
+
+ nfs4_put_stid(&dp->dl_stid);
+ clear_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags);
+ wake_up_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY);
+}
+
static const struct nfsd4_callback_ops nfsd4_cb_recall_any_ops = {
.done = nfsd4_cb_recall_any_done,
.release = nfsd4_cb_recall_any_release,
};
+static const struct nfsd4_callback_ops nfsd4_cb_getattr_ops = {
+ .done = nfsd4_cb_getattr_done,
+ .release = nfsd4_cb_getattr_release,
+};
+
+void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf)
+{
+ struct nfs4_delegation *dp =
+ container_of(ncf, struct nfs4_delegation, dl_cb_fattr);
+
+ if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags))
+ return;
+ refcount_inc(&dp->dl_stid.sc_count);
+ nfsd4_run_cb(&ncf->ncf_getattr);
+}
+
static struct nfs4_client *create_client(struct xdr_netobj name,
struct svc_rqst *rqstp, nfs4_verifier *verf)
{
nfsd4_state_shrinker_count(struct shrinker *shrink, struct shrink_control *sc)
{
int count;
- struct nfsd_net *nn = container_of(shrink,
- struct nfsd_net, nfsd_client_shrinker);
+ struct nfsd_net *nn = shrink->private_data;
count = atomic_read(&nn->nfsd_courtesy_clients);
if (!count)
struct svc_fh *parent = NULL;
int cb_up;
int status = 0;
+ struct kstat stat;
+ struct path path;
cb_up = nfsd4_cb_channel_good(oo->oo_owner.so_client);
- open->op_recall = 0;
+ open->op_recall = false;
switch (open->op_claim_type) {
case NFS4_OPEN_CLAIM_PREVIOUS:
if (!cb_up)
- open->op_recall = 1;
+ open->op_recall = true;
break;
case NFS4_OPEN_CLAIM_NULL:
parent = currentfh;
if (open->op_share_access & NFS4_SHARE_ACCESS_WRITE) {
open->op_delegate_type = NFS4_OPEN_DELEGATE_WRITE;
trace_nfsd_deleg_write(&dp->dl_stid.sc_stateid);
+ path.mnt = currentfh->fh_export->ex_path.mnt;
+ path.dentry = currentfh->fh_dentry;
+ if (vfs_getattr(&path, &stat,
+ (STATX_SIZE | STATX_CTIME | STATX_CHANGE_COOKIE),
+ AT_STATX_SYNC_AS_STAT)) {
+ nfs4_put_stid(&dp->dl_stid);
+ destroy_delegation(dp);
+ goto out_no_deleg;
+ }
+ dp->dl_cb_fattr.ncf_cur_fsize = stat.size;
+ dp->dl_cb_fattr.ncf_initial_cinfo =
+ nfsd4_change_attribute(&stat, d_inode(currentfh->fh_dentry));
} else {
open->op_delegate_type = NFS4_OPEN_DELEGATE_READ;
trace_nfsd_deleg_read(&dp->dl_stid.sc_stateid);
if (open->op_claim_type == NFS4_OPEN_CLAIM_PREVIOUS &&
open->op_delegate_type != NFS4_OPEN_DELEGATE_NONE) {
dprintk("NFSD: WARNING: refusing delegation reclaim\n");
- open->op_recall = 1;
+ open->op_recall = true;
}
/* 4.1 client asking for a delegation? */
struct nfsd4_blocked_lock *nbl = NULL;
struct file_lock *file_lock = NULL;
struct file_lock *conflock = NULL;
+ struct super_block *sb;
__be32 status = 0;
int lkflg;
int err;
dprintk("NFSD: nfsd4_lock: permission denied!\n");
return status;
}
+ sb = cstate->current_fh.fh_dentry->d_sb;
if (lock->lk_is_new) {
if (nfsd4_has_session(cstate))
fp = lock_stp->st_stid.sc_file;
switch (lock->lk_type) {
case NFS4_READW_LT:
- if (nfsd4_has_session(cstate))
+ if (nfsd4_has_session(cstate) ||
+ exportfs_lock_op_is_async(sb->s_export_op))
fl_flags |= FL_SLEEP;
fallthrough;
case NFS4_READ_LT:
fl_type = F_RDLCK;
break;
case NFS4_WRITEW_LT:
- if (nfsd4_has_session(cstate))
+ if (nfsd4_has_session(cstate) ||
+ exportfs_lock_op_is_async(sb->s_export_op))
fl_flags |= FL_SLEEP;
fallthrough;
case NFS4_WRITE_LT:
* for file locks), so don't attempt blocking lock notifications
* on those filesystems:
*/
- if (nf->nf_file->f_op->lock)
+ if (!exportfs_lock_op_is_async(sb->s_export_op))
fl_flags &= ~FL_SLEEP;
nbl = find_or_allocate_block(lock_sop, &fp->fi_fhandle, nn);
return status;
}
+void nfsd4_lock_release(union nfsd4_op_u *u)
+{
+ struct nfsd4_lock *lock = &u->lock;
+ struct nfsd4_lock_denied *deny = &lock->lk_denied;
+
+ kfree(deny->ld_owner.data);
+}
+
/*
* The NFSv4 spec allows a client to do a LOCKT without holding an OPEN,
* so we do a temporary open here just to get an open file to pass to
return status;
}
+void nfsd4_lockt_release(union nfsd4_op_u *u)
+{
+ struct nfsd4_lockt *lockt = &u->lockt;
+ struct nfsd4_lock_denied *deny = &lockt->lt_denied;
+
+ kfree(deny->ld_owner.data);
+}
+
__be32
nfsd4_locku(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
union nfsd4_op_u *u)
INIT_WORK(&nn->nfsd_shrinker_work, nfsd4_state_shrinker_worker);
get_net(net);
- nn->nfsd_client_shrinker.scan_objects = nfsd4_state_shrinker_scan;
- nn->nfsd_client_shrinker.count_objects = nfsd4_state_shrinker_count;
- nn->nfsd_client_shrinker.seeks = DEFAULT_SEEKS;
-
- if (register_shrinker(&nn->nfsd_client_shrinker, "nfsd-client"))
+ nn->nfsd_client_shrinker = shrinker_alloc(0, "nfsd-client");
+ if (!nn->nfsd_client_shrinker)
goto err_shrinker;
+
+ nn->nfsd_client_shrinker->scan_objects = nfsd4_state_shrinker_scan;
+ nn->nfsd_client_shrinker->count_objects = nfsd4_state_shrinker_count;
+ nn->nfsd_client_shrinker->private_data = nn;
+
+ shrinker_register(nn->nfsd_client_shrinker);
+
return 0;
err_shrinker:
struct list_head *pos, *next, reaplist;
struct nfsd_net *nn = net_generic(net, nfsd_net_id);
- unregister_shrinker(&nn->nfsd_client_shrinker);
+ shrinker_free(nn->nfsd_client_shrinker);
cancel_work(&nn->nfsd_shrinker_work);
cancel_delayed_work_sync(&nn->laundromat_work);
locks_end_grace(&nn->nfsd4_manager);
* nfsd4_deleg_getattr_conflict - Recall if GETATTR causes conflict
* @rqstp: RPC transaction context
* @inode: file to be checked for a conflict
+ * @modified: return true if file was modified
+ * @size: new size of file if modified is true
*
* This function is called when there is a conflict between a write
* delegation and a change/size GETATTR from another client. The server
* delegation before replying to the GETATTR. See RFC 8881 section
* 18.7.4.
*
- * The current implementation does not support CB_GETATTR yet. However
- * this can avoid recalling the delegation could be added in follow up
- * work.
- *
* Returns 0 if there is no conflict; otherwise an nfs_stat
* code is returned.
*/
__be32
-nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode)
+nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode,
+ bool *modified, u64 *size)
{
- __be32 status;
struct file_lock_context *ctx;
- struct file_lock *fl;
struct nfs4_delegation *dp;
+ struct nfs4_cb_fattr *ncf;
+ struct file_lock *fl;
+ struct iattr attrs;
+ __be32 status;
+ might_sleep();
+
+ *modified = false;
ctx = locks_inode_context(inode);
if (!ctx)
return 0;
break_lease:
spin_unlock(&ctx->flc_lock);
nfsd_stats_wdeleg_getattr_inc();
- status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ));
- if (status != nfserr_jukebox ||
- !nfsd_wait_for_delegreturn(rqstp, inode))
- return status;
+
+ dp = fl->fl_owner;
+ ncf = &dp->dl_cb_fattr;
+ nfs4_cb_getattr(&dp->dl_cb_fattr);
+ wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE);
+ if (ncf->ncf_cb_status) {
+ status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ));
+ if (status != nfserr_jukebox ||
+ !nfsd_wait_for_delegreturn(rqstp, inode))
+ return status;
+ }
+ if (!ncf->ncf_file_modified &&
+ (ncf->ncf_initial_cinfo != ncf->ncf_cb_change ||
+ ncf->ncf_cur_fsize != ncf->ncf_cb_fsize))
+ ncf->ncf_file_modified = true;
+ if (ncf->ncf_file_modified) {
+ /*
+ * The server would not update the file's metadata
+ * with the client's modified size.
+ */
+ attrs.ia_mtime = attrs.ia_ctime = current_time(inode);
+ attrs.ia_valid = ATTR_MTIME | ATTR_CTIME;
+ setattr_copy(&nop_mnt_idmap, inode, &attrs);
+ mark_inode_dirty(inode);
+ ncf->ncf_cur_fsize = ncf->ncf_cb_fsize;
+ *size = ncf->ncf_cur_fsize;
+ *modified = true;
+ }
return 0;
}
break;
struct buffer_head *head, *bh;
u32 bh_next, bh_off, to;
sector_t iblock;
- struct page *page;
+ struct folio *folio;
for (; idx < idx_end; idx += 1, from = 0) {
page_off = (loff_t)idx << PAGE_SHIFT;
PAGE_SIZE;
iblock = page_off >> inode->i_blkbits;
- page = find_or_create_page(mapping, idx,
- mapping_gfp_constraint(mapping,
- ~__GFP_FS));
- if (!page)
- return -ENOMEM;
+ folio = __filemap_get_folio(mapping, idx,
+ FGP_LOCK | FGP_ACCESSED | FGP_CREAT,
+ mapping_gfp_constraint(mapping, ~__GFP_FS));
+ if (IS_ERR(folio))
+ return PTR_ERR(folio);
- if (!page_has_buffers(page))
- create_empty_buffers(page, blocksize, 0);
+ head = folio_buffers(folio);
+ if (!head)
+ head = create_empty_buffers(folio, blocksize, 0);
- bh = head = page_buffers(page);
+ bh = head;
bh_off = 0;
do {
bh_next = bh_off + blocksize;
}
/* Ok, it's mapped. Make sure it's up-to-date. */
- if (PageUptodate(page))
+ if (folio_test_uptodate(folio))
set_buffer_uptodate(bh);
if (!buffer_uptodate(bh)) {
err = bh_read(bh, 0);
if (err < 0) {
- unlock_page(page);
- put_page(page);
+ folio_unlock(folio);
+ folio_put(folio);
goto out;
}
}
} while (bh_off = bh_next, iblock += 1,
head != (bh = bh->b_this_page));
- zero_user_segment(page, from, to);
+ folio_zero_segment(folio, from, to);
- unlock_page(page);
- put_page(page);
+ folio_unlock(folio);
+ folio_put(folio);
cond_resched();
}
out:
err = 0;
}
- inode->i_mtime = inode_set_ctime_current(inode);
+ inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
mark_inode_dirty(inode);
if (IS_SYNC(inode)) {
ni_unlock(ni);
ni->std_fa |= FILE_ATTRIBUTE_ARCHIVE;
- inode->i_mtime = inode_set_ctime_current(inode);
+ inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
if (!IS_DIRSYNC(inode)) {
dirty = 1;
} else {
filemap_invalidate_unlock(mapping);
if (!err) {
- inode->i_mtime = inode_set_ctime_current(inode);
+ inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
mark_inode_dirty(inode);
}
}
static ssize_t ntfs_file_splice_read(struct file *in, loff_t *ppos,
- struct pipe_inode_info *pipe,
- size_t len, unsigned int flags)
+ struct pipe_inode_info *pipe, size_t len,
+ unsigned int flags)
{
struct inode *inode = in->f_mapping->host;
struct ntfs_inode *ni = ntfs_i(inode);
* read-in the blocks at the tail of our file. Avoid reading them by
* testing i_size against each block offset.
*/
- static int ocfs2_should_read_blk(struct inode *inode, struct page *page,
+ static int ocfs2_should_read_blk(struct inode *inode, struct folio *folio,
unsigned int block_start)
{
- u64 offset = page_offset(page) + block_start;
+ u64 offset = folio_pos(folio) + block_start;
if (ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb)))
return 1;
struct inode *inode, unsigned int from,
unsigned int to, int new)
{
+ struct folio *folio = page_folio(page);
int ret = 0;
struct buffer_head *head, *bh, *wait[2], **wait_bh = wait;
unsigned int block_end, block_start;
unsigned int bsize = i_blocksize(inode);
- if (!page_has_buffers(page))
- create_empty_buffers(page, bsize, 0);
+ head = folio_buffers(folio);
+ if (!head)
+ head = create_empty_buffers(folio, bsize, 0);
- head = page_buffers(page);
for (bh = head, block_start = 0; bh != head || !block_start;
bh = bh->b_this_page, block_start += bsize) {
block_end = block_start + bsize;
* they may belong to unallocated clusters.
*/
if (block_start >= to || block_end <= from) {
- if (PageUptodate(page))
+ if (folio_test_uptodate(folio))
set_buffer_uptodate(bh);
continue;
}
clean_bdev_bh_alias(bh);
}
- if (PageUptodate(page)) {
+ if (folio_test_uptodate(folio)) {
set_buffer_uptodate(bh);
} else if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
!buffer_new(bh) &&
- ocfs2_should_read_blk(inode, page, block_start) &&
+ ocfs2_should_read_blk(inode, folio, block_start) &&
(block_start < from || block_end > to)) {
bh_read_nowait(bh, 0);
*wait_bh++=bh;
if (block_start >= to)
break;
- zero_user(page, block_start, bh->b_size);
+ folio_zero_range(folio, block_start, bh->b_size);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
}
inode->i_blocks = ocfs2_inode_sector_count(inode);
di->i_size = cpu_to_le64((u64)i_size_read(inode));
- inode->i_mtime = inode_set_ctime_current(inode);
- di->i_mtime = di->i_ctime = cpu_to_le64(inode->i_mtime.tv_sec);
- di->i_mtime_nsec = di->i_ctime_nsec = cpu_to_le32(inode->i_mtime.tv_nsec);
+ inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
+ di->i_mtime = di->i_ctime = cpu_to_le64(inode_get_mtime_sec(inode));
+ di->i_mtime_nsec = di->i_ctime_nsec = cpu_to_le32(inode_get_mtime_nsec(inode));
if (handle)
ocfs2_update_inode_fsync_trans(handle, inode, 1);
}
#include <linux/shmem_fs.h>
#include <linux/uaccess.h>
#include <linux/pkeys.h>
+ #include <linux/minmax.h>
+ #include <linux/overflow.h>
#include <asm/elf.h>
#include <asm/tlb.h>
if (anon_name)
seq_printf(m, "[anon_shmem:%s]", anon_name->name);
else
- seq_file_path(m, file, "\n");
+ seq_path(m, file_user_path(file), "\n");
goto done;
}
return 0;
}
+ #define PM_SCAN_CATEGORIES (PAGE_IS_WPALLOWED | PAGE_IS_WRITTEN | \
+ PAGE_IS_FILE | PAGE_IS_PRESENT | \
+ PAGE_IS_SWAPPED | PAGE_IS_PFNZERO | \
+ PAGE_IS_HUGE)
+ #define PM_SCAN_FLAGS (PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC)
+
+ struct pagemap_scan_private {
+ struct pm_scan_arg arg;
+ unsigned long masks_of_interest, cur_vma_category;
+ struct page_region *vec_buf;
+ unsigned long vec_buf_len, vec_buf_index, found_pages;
+ struct page_region __user *vec_out;
+ };
+
+ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
+ struct vm_area_struct *vma,
+ unsigned long addr, pte_t pte)
+ {
+ unsigned long categories = 0;
+
+ if (pte_present(pte)) {
+ struct page *page;
+
+ categories |= PAGE_IS_PRESENT;
+ if (!pte_uffd_wp(pte))
+ categories |= PAGE_IS_WRITTEN;
+
+ if (p->masks_of_interest & PAGE_IS_FILE) {
+ page = vm_normal_page(vma, addr, pte);
+ if (page && !PageAnon(page))
+ categories |= PAGE_IS_FILE;
+ }
+
+ if (is_zero_pfn(pte_pfn(pte)))
+ categories |= PAGE_IS_PFNZERO;
+ } else if (is_swap_pte(pte)) {
+ swp_entry_t swp;
+
+ categories |= PAGE_IS_SWAPPED;
+ if (!pte_swp_uffd_wp_any(pte))
+ categories |= PAGE_IS_WRITTEN;
+
+ if (p->masks_of_interest & PAGE_IS_FILE) {
+ swp = pte_to_swp_entry(pte);
+ if (is_pfn_swap_entry(swp) &&
+ !PageAnon(pfn_swap_entry_to_page(swp)))
+ categories |= PAGE_IS_FILE;
+ }
+ }
+
+ return categories;
+ }
+
+ static void make_uffd_wp_pte(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *pte)
+ {
+ pte_t ptent = ptep_get(pte);
+
+ if (pte_present(ptent)) {
+ pte_t old_pte;
+
+ old_pte = ptep_modify_prot_start(vma, addr, pte);
+ ptent = pte_mkuffd_wp(ptent);
+ ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
+ } else if (is_swap_pte(ptent)) {
+ ptent = pte_swp_mkuffd_wp(ptent);
+ set_pte_at(vma->vm_mm, addr, pte, ptent);
+ } else {
+ set_pte_at(vma->vm_mm, addr, pte,
+ make_pte_marker(PTE_MARKER_UFFD_WP));
+ }
+ }
+
+ #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
+ struct vm_area_struct *vma,
+ unsigned long addr, pmd_t pmd)
+ {
+ unsigned long categories = PAGE_IS_HUGE;
+
+ if (pmd_present(pmd)) {
+ struct page *page;
+
+ categories |= PAGE_IS_PRESENT;
+ if (!pmd_uffd_wp(pmd))
+ categories |= PAGE_IS_WRITTEN;
+
+ if (p->masks_of_interest & PAGE_IS_FILE) {
+ page = vm_normal_page_pmd(vma, addr, pmd);
+ if (page && !PageAnon(page))
+ categories |= PAGE_IS_FILE;
+ }
+
+ if (is_zero_pfn(pmd_pfn(pmd)))
+ categories |= PAGE_IS_PFNZERO;
+ } else if (is_swap_pmd(pmd)) {
+ swp_entry_t swp;
+
+ categories |= PAGE_IS_SWAPPED;
+ if (!pmd_swp_uffd_wp(pmd))
+ categories |= PAGE_IS_WRITTEN;
+
+ if (p->masks_of_interest & PAGE_IS_FILE) {
+ swp = pmd_to_swp_entry(pmd);
+ if (is_pfn_swap_entry(swp) &&
+ !PageAnon(pfn_swap_entry_to_page(swp)))
+ categories |= PAGE_IS_FILE;
+ }
+ }
+
+ return categories;
+ }
+
+ static void make_uffd_wp_pmd(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmdp)
+ {
+ pmd_t old, pmd = *pmdp;
+
+ if (pmd_present(pmd)) {
+ old = pmdp_invalidate_ad(vma, addr, pmdp);
+ pmd = pmd_mkuffd_wp(old);
+ set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+ } else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+ pmd = pmd_swp_mkuffd_wp(pmd);
+ set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+ }
+ }
+ #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+ #ifdef CONFIG_HUGETLB_PAGE
+ static unsigned long pagemap_hugetlb_category(pte_t pte)
+ {
+ unsigned long categories = PAGE_IS_HUGE;
+
+ /*
+ * According to pagemap_hugetlb_range(), file-backed HugeTLB
+ * page cannot be swapped. So PAGE_IS_FILE is not checked for
+ * swapped pages.
+ */
+ if (pte_present(pte)) {
+ categories |= PAGE_IS_PRESENT;
+ if (!huge_pte_uffd_wp(pte))
+ categories |= PAGE_IS_WRITTEN;
+ if (!PageAnon(pte_page(pte)))
+ categories |= PAGE_IS_FILE;
+ if (is_zero_pfn(pte_pfn(pte)))
+ categories |= PAGE_IS_PFNZERO;
+ } else if (is_swap_pte(pte)) {
+ categories |= PAGE_IS_SWAPPED;
+ if (!pte_swp_uffd_wp_any(pte))
+ categories |= PAGE_IS_WRITTEN;
+ }
+
+ return categories;
+ }
+
+ static void make_uffd_wp_huge_pte(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ pte_t ptent)
+ {
+ unsigned long psize;
+
+ if (is_hugetlb_entry_hwpoisoned(ptent) || is_pte_marker(ptent))
+ return;
+
+ psize = huge_page_size(hstate_vma(vma));
+
+ if (is_hugetlb_entry_migration(ptent))
+ set_huge_pte_at(vma->vm_mm, addr, ptep,
+ pte_swp_mkuffd_wp(ptent), psize);
+ else if (!huge_pte_none(ptent))
+ huge_ptep_modify_prot_commit(vma, addr, ptep, ptent,
+ huge_pte_mkuffd_wp(ptent));
+ else
+ set_huge_pte_at(vma->vm_mm, addr, ptep,
+ make_pte_marker(PTE_MARKER_UFFD_WP), psize);
+ }
+ #endif /* CONFIG_HUGETLB_PAGE */
+
+ #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE)
+ static void pagemap_scan_backout_range(struct pagemap_scan_private *p,
+ unsigned long addr, unsigned long end)
+ {
+ struct page_region *cur_buf = &p->vec_buf[p->vec_buf_index];
+
+ if (cur_buf->start != addr)
+ cur_buf->end = addr;
+ else
+ cur_buf->start = cur_buf->end = 0;
+
+ p->found_pages -= (end - addr) / PAGE_SIZE;
+ }
+ #endif
+
+ static bool pagemap_scan_is_interesting_page(unsigned long categories,
+ const struct pagemap_scan_private *p)
+ {
+ categories ^= p->arg.category_inverted;
+ if ((categories & p->arg.category_mask) != p->arg.category_mask)
+ return false;
+ if (p->arg.category_anyof_mask && !(categories & p->arg.category_anyof_mask))
+ return false;
+
+ return true;
+ }
+
+ static bool pagemap_scan_is_interesting_vma(unsigned long categories,
+ const struct pagemap_scan_private *p)
+ {
+ unsigned long required = p->arg.category_mask & PAGE_IS_WPALLOWED;
+
+ categories ^= p->arg.category_inverted;
+ if ((categories & required) != required)
+ return false;
+
+ return true;
+ }
+
+ static int pagemap_scan_test_walk(unsigned long start, unsigned long end,
+ struct mm_walk *walk)
+ {
+ struct pagemap_scan_private *p = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+ unsigned long vma_category = 0;
+
+ if (userfaultfd_wp_async(vma) && userfaultfd_wp_use_markers(vma))
+ vma_category |= PAGE_IS_WPALLOWED;
+ else if (p->arg.flags & PM_SCAN_CHECK_WPASYNC)
+ return -EPERM;
+
+ if (vma->vm_flags & VM_PFNMAP)
+ return 1;
+
+ if (!pagemap_scan_is_interesting_vma(vma_category, p))
+ return 1;
+
+ p->cur_vma_category = vma_category;
+
+ return 0;
+ }
+
+ static bool pagemap_scan_push_range(unsigned long categories,
+ struct pagemap_scan_private *p,
+ unsigned long addr, unsigned long end)
+ {
+ struct page_region *cur_buf = &p->vec_buf[p->vec_buf_index];
+
+ /*
+ * When there is no output buffer provided at all, the sentinel values
+ * won't match here. There is no other way for `cur_buf->end` to be
+ * non-zero other than it being non-empty.
+ */
+ if (addr == cur_buf->end && categories == cur_buf->categories) {
+ cur_buf->end = end;
+ return true;
+ }
+
+ if (cur_buf->end) {
+ if (p->vec_buf_index >= p->vec_buf_len - 1)
+ return false;
+
+ cur_buf = &p->vec_buf[++p->vec_buf_index];
+ }
+
+ cur_buf->start = addr;
+ cur_buf->end = end;
+ cur_buf->categories = categories;
+
+ return true;
+ }
+
+ static int pagemap_scan_output(unsigned long categories,
+ struct pagemap_scan_private *p,
+ unsigned long addr, unsigned long *end)
+ {
+ unsigned long n_pages, total_pages;
+ int ret = 0;
+
+ if (!p->vec_buf)
+ return 0;
+
+ categories &= p->arg.return_mask;
+
+ n_pages = (*end - addr) / PAGE_SIZE;
+ if (check_add_overflow(p->found_pages, n_pages, &total_pages) ||
+ total_pages > p->arg.max_pages) {
+ size_t n_too_much = total_pages - p->arg.max_pages;
+ *end -= n_too_much * PAGE_SIZE;
+ n_pages -= n_too_much;
+ ret = -ENOSPC;
+ }
+
+ if (!pagemap_scan_push_range(categories, p, addr, *end)) {
+ *end = addr;
+ n_pages = 0;
+ ret = -ENOSPC;
+ }
+
+ p->found_pages += n_pages;
+ if (ret)
+ p->arg.walk_end = *end;
+
+ return ret;
+ }
+
+ static int pagemap_scan_thp_entry(pmd_t *pmd, unsigned long start,
+ unsigned long end, struct mm_walk *walk)
+ {
+ #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ struct pagemap_scan_private *p = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+ unsigned long categories;
+ spinlock_t *ptl;
+ int ret = 0;
+
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (!ptl)
+ return -ENOENT;
+
+ categories = p->cur_vma_category |
+ pagemap_thp_category(p, vma, start, *pmd);
+
+ if (!pagemap_scan_is_interesting_page(categories, p))
+ goto out_unlock;
+
+ ret = pagemap_scan_output(categories, p, start, &end);
+ if (start == end)
+ goto out_unlock;
+
+ if (~p->arg.flags & PM_SCAN_WP_MATCHING)
+ goto out_unlock;
+ if (~categories & PAGE_IS_WRITTEN)
+ goto out_unlock;
+
+ /*
+ * Break huge page into small pages if the WP operation
+ * needs to be performed on a portion of the huge page.
+ */
+ if (end != start + HPAGE_SIZE) {
+ spin_unlock(ptl);
+ split_huge_pmd(vma, pmd, start);
+ pagemap_scan_backout_range(p, start, end);
+ /* Report as if there was no THP */
+ return -ENOENT;
+ }
+
+ make_uffd_wp_pmd(vma, start, pmd);
+ flush_tlb_range(vma, start, end);
+ out_unlock:
+ spin_unlock(ptl);
+ return ret;
+ #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
+ return -ENOENT;
+ #endif
+ }
+
+ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
+ unsigned long end, struct mm_walk *walk)
+ {
+ struct pagemap_scan_private *p = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+ unsigned long addr, flush_end = 0;
+ pte_t *pte, *start_pte;
+ spinlock_t *ptl;
+ int ret;
+
+ arch_enter_lazy_mmu_mode();
+
+ ret = pagemap_scan_thp_entry(pmd, start, end, walk);
+ if (ret != -ENOENT) {
+ arch_leave_lazy_mmu_mode();
+ return ret;
+ }
+
+ ret = 0;
+ start_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
+ if (!pte) {
+ arch_leave_lazy_mmu_mode();
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
+
+ if (!p->vec_out) {
+ /* Fast path for performing exclusive WP */
+ for (addr = start; addr != end; pte++, addr += PAGE_SIZE) {
+ if (pte_uffd_wp(ptep_get(pte)))
+ continue;
+ make_uffd_wp_pte(vma, addr, pte);
+ if (!flush_end)
+ start = addr;
+ flush_end = addr + PAGE_SIZE;
+ }
+ goto flush_and_return;
+ }
+
+ if (!p->arg.category_anyof_mask && !p->arg.category_inverted &&
+ p->arg.category_mask == PAGE_IS_WRITTEN &&
+ p->arg.return_mask == PAGE_IS_WRITTEN) {
+ for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
+ unsigned long next = addr + PAGE_SIZE;
+
+ if (pte_uffd_wp(ptep_get(pte)))
+ continue;
+ ret = pagemap_scan_output(p->cur_vma_category | PAGE_IS_WRITTEN,
+ p, addr, &next);
+ if (next == addr)
+ break;
+ if (~p->arg.flags & PM_SCAN_WP_MATCHING)
+ continue;
+ make_uffd_wp_pte(vma, addr, pte);
+ if (!flush_end)
+ start = addr;
+ flush_end = next;
+ }
+ goto flush_and_return;
+ }
+
+ for (addr = start; addr != end; pte++, addr += PAGE_SIZE) {
+ unsigned long categories = p->cur_vma_category |
+ pagemap_page_category(p, vma, addr, ptep_get(pte));
+ unsigned long next = addr + PAGE_SIZE;
+
+ if (!pagemap_scan_is_interesting_page(categories, p))
+ continue;
+
+ ret = pagemap_scan_output(categories, p, addr, &next);
+ if (next == addr)
+ break;
+
+ if (~p->arg.flags & PM_SCAN_WP_MATCHING)
+ continue;
+ if (~categories & PAGE_IS_WRITTEN)
+ continue;
+
+ make_uffd_wp_pte(vma, addr, pte);
+ if (!flush_end)
+ start = addr;
+ flush_end = next;
+ }
+
+ flush_and_return:
+ if (flush_end)
+ flush_tlb_range(vma, start, addr);
+
+ pte_unmap_unlock(start_pte, ptl);
+ arch_leave_lazy_mmu_mode();
+
+ cond_resched();
+ return ret;
+ }
+
+ #ifdef CONFIG_HUGETLB_PAGE
+ static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
+ unsigned long start, unsigned long end,
+ struct mm_walk *walk)
+ {
+ struct pagemap_scan_private *p = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+ unsigned long categories;
+ spinlock_t *ptl;
+ int ret = 0;
+ pte_t pte;
+
+ if (~p->arg.flags & PM_SCAN_WP_MATCHING) {
+ /* Go the short route when not write-protecting pages. */
+
+ pte = huge_ptep_get(ptep);
+ categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
+
+ if (!pagemap_scan_is_interesting_page(categories, p))
+ return 0;
+
+ return pagemap_scan_output(categories, p, start, &end);
+ }
+
+ i_mmap_lock_write(vma->vm_file->f_mapping);
+ ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
+
+ pte = huge_ptep_get(ptep);
+ categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
+
+ if (!pagemap_scan_is_interesting_page(categories, p))
+ goto out_unlock;
+
+ ret = pagemap_scan_output(categories, p, start, &end);
+ if (start == end)
+ goto out_unlock;
+
+ if (~categories & PAGE_IS_WRITTEN)
+ goto out_unlock;
+
+ if (end != start + HPAGE_SIZE) {
+ /* Partial HugeTLB page WP isn't possible. */
+ pagemap_scan_backout_range(p, start, end);
+ p->arg.walk_end = start;
+ ret = 0;
+ goto out_unlock;
+ }
+
+ make_uffd_wp_huge_pte(vma, start, ptep, pte);
+ flush_hugetlb_tlb_range(vma, start, end);
+
+ out_unlock:
+ spin_unlock(ptl);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
+
+ return ret;
+ }
+ #else
+ #define pagemap_scan_hugetlb_entry NULL
+ #endif
+
+ static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
+ int depth, struct mm_walk *walk)
+ {
+ struct pagemap_scan_private *p = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+ int ret, err;
+
+ if (!vma || !pagemap_scan_is_interesting_page(p->cur_vma_category, p))
+ return 0;
+
+ ret = pagemap_scan_output(p->cur_vma_category, p, addr, &end);
+ if (addr == end)
+ return ret;
+
+ if (~p->arg.flags & PM_SCAN_WP_MATCHING)
+ return ret;
+
+ err = uffd_wp_range(vma, addr, end - addr, true);
+ if (err < 0)
+ ret = err;
+
+ return ret;
+ }
+
+ static const struct mm_walk_ops pagemap_scan_ops = {
+ .test_walk = pagemap_scan_test_walk,
+ .pmd_entry = pagemap_scan_pmd_entry,
+ .pte_hole = pagemap_scan_pte_hole,
+ .hugetlb_entry = pagemap_scan_hugetlb_entry,
+ };
+
+ static int pagemap_scan_get_args(struct pm_scan_arg *arg,
+ unsigned long uarg)
+ {
+ if (copy_from_user(arg, (void __user *)uarg, sizeof(*arg)))
+ return -EFAULT;
+
+ if (arg->size != sizeof(struct pm_scan_arg))
+ return -EINVAL;
+
+ /* Validate requested features */
+ if (arg->flags & ~PM_SCAN_FLAGS)
+ return -EINVAL;
+ if ((arg->category_inverted | arg->category_mask |
+ arg->category_anyof_mask | arg->return_mask) & ~PM_SCAN_CATEGORIES)
+ return -EINVAL;
+
+ arg->start = untagged_addr((unsigned long)arg->start);
+ arg->end = untagged_addr((unsigned long)arg->end);
+ arg->vec = untagged_addr((unsigned long)arg->vec);
+
+ /* Validate memory pointers */
+ if (!IS_ALIGNED(arg->start, PAGE_SIZE))
+ return -EINVAL;
+ if (!access_ok((void __user *)(long)arg->start, arg->end - arg->start))
+ return -EFAULT;
+ if (!arg->vec && arg->vec_len)
+ return -EINVAL;
+ if (arg->vec && !access_ok((void __user *)(long)arg->vec,
+ arg->vec_len * sizeof(struct page_region)))
+ return -EFAULT;
+
+ /* Fixup default values */
+ arg->end = ALIGN(arg->end, PAGE_SIZE);
+ arg->walk_end = 0;
+ if (!arg->max_pages)
+ arg->max_pages = ULONG_MAX;
+
+ return 0;
+ }
+
+ static int pagemap_scan_writeback_args(struct pm_scan_arg *arg,
+ unsigned long uargl)
+ {
+ struct pm_scan_arg __user *uarg = (void __user *)uargl;
+
+ if (copy_to_user(&uarg->walk_end, &arg->walk_end, sizeof(arg->walk_end)))
+ return -EFAULT;
+
+ return 0;
+ }
+
+ static int pagemap_scan_init_bounce_buffer(struct pagemap_scan_private *p)
+ {
+ if (!p->arg.vec_len)
+ return 0;
+
+ p->vec_buf_len = min_t(size_t, PAGEMAP_WALK_SIZE >> PAGE_SHIFT,
+ p->arg.vec_len);
+ p->vec_buf = kmalloc_array(p->vec_buf_len, sizeof(*p->vec_buf),
+ GFP_KERNEL);
+ if (!p->vec_buf)
+ return -ENOMEM;
+
+ p->vec_buf->start = p->vec_buf->end = 0;
+ p->vec_out = (struct page_region __user *)(long)p->arg.vec;
+
+ return 0;
+ }
+
+ static long pagemap_scan_flush_buffer(struct pagemap_scan_private *p)
+ {
+ const struct page_region *buf = p->vec_buf;
+ long n = p->vec_buf_index;
+
+ if (!p->vec_buf)
+ return 0;
+
+ if (buf[n].end != buf[n].start)
+ n++;
+
+ if (!n)
+ return 0;
+
+ if (copy_to_user(p->vec_out, buf, n * sizeof(*buf)))
+ return -EFAULT;
+
+ p->arg.vec_len -= n;
+ p->vec_out += n;
+
+ p->vec_buf_index = 0;
+ p->vec_buf_len = min_t(size_t, p->vec_buf_len, p->arg.vec_len);
+ p->vec_buf->start = p->vec_buf->end = 0;
+
+ return n;
+ }
+
+ static long do_pagemap_scan(struct mm_struct *mm, unsigned long uarg)
+ {
+ struct mmu_notifier_range range;
+ struct pagemap_scan_private p = {0};
+ unsigned long walk_start;
+ size_t n_ranges_out = 0;
+ int ret;
+
+ ret = pagemap_scan_get_args(&p.arg, uarg);
+ if (ret)
+ return ret;
+
+ p.masks_of_interest = p.arg.category_mask | p.arg.category_anyof_mask |
+ p.arg.return_mask;
+ ret = pagemap_scan_init_bounce_buffer(&p);
+ if (ret)
+ return ret;
+
+ /* Protection change for the range is going to happen. */
+ if (p.arg.flags & PM_SCAN_WP_MATCHING) {
+ mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
+ mm, p.arg.start, p.arg.end);
+ mmu_notifier_invalidate_range_start(&range);
+ }
+
+ for (walk_start = p.arg.start; walk_start < p.arg.end;
+ walk_start = p.arg.walk_end) {
+ long n_out;
+
+ if (fatal_signal_pending(current)) {
+ ret = -EINTR;
+ break;
+ }
+
+ ret = mmap_read_lock_killable(mm);
+ if (ret)
+ break;
+ ret = walk_page_range(mm, walk_start, p.arg.end,
+ &pagemap_scan_ops, &p);
+ mmap_read_unlock(mm);
+
+ n_out = pagemap_scan_flush_buffer(&p);
+ if (n_out < 0)
+ ret = n_out;
+ else
+ n_ranges_out += n_out;
+
+ if (ret != -ENOSPC)
+ break;
+
+ if (p.arg.vec_len == 0 || p.found_pages == p.arg.max_pages)
+ break;
+ }
+
+ /* ENOSPC signifies early stop (buffer full) from the walk. */
+ if (!ret || ret == -ENOSPC)
+ ret = n_ranges_out;
+
+ /* The walk_end isn't set when ret is zero */
+ if (!p.arg.walk_end)
+ p.arg.walk_end = p.arg.end;
+ if (pagemap_scan_writeback_args(&p.arg, uarg))
+ ret = -EFAULT;
+
+ if (p.arg.flags & PM_SCAN_WP_MATCHING)
+ mmu_notifier_invalidate_range_end(&range);
+
+ kfree(p.vec_buf);
+ return ret;
+ }
+
+ static long do_pagemap_cmd(struct file *file, unsigned int cmd,
+ unsigned long arg)
+ {
+ struct mm_struct *mm = file->private_data;
+
+ switch (cmd) {
+ case PAGEMAP_SCAN:
+ return do_pagemap_scan(mm, arg);
+
+ default:
+ return -EINVAL;
+ }
+ }
+
const struct file_operations proc_pagemap_operations = {
.llseek = mem_lseek, /* borrow this */
.read = pagemap_read,
.open = pagemap_open,
.release = pagemap_release,
+ .unlocked_ioctl = do_pagemap_cmd,
+ .compat_ioctl = do_pagemap_cmd,
};
#endif /* CONFIG_PROC_PAGE_MONITOR */
struct numa_maps *md = &numa_priv->md;
struct file *file = vma->vm_file;
struct mm_struct *mm = vma->vm_mm;
- struct mempolicy *pol;
char buffer[64];
+ struct mempolicy *pol;
+ pgoff_t ilx;
int nid;
if (!mm)
/* Ensure we start with an empty set of numa_maps statistics. */
memset(md, 0, sizeof(*md));
- pol = __get_vma_policy(vma, vma->vm_start);
+ pol = __get_vma_policy(vma, vma->vm_start, &ilx);
if (pol) {
mpol_to_str(buffer, sizeof(buffer), pol);
mpol_cond_put(pol);
if (file) {
seq_puts(m, " file=");
- seq_file_path(m, file, "\n\t= ");
+ seq_path(m, file_user_path(file), "\n\t= ");
} else if (vma_is_initial_heap(vma)) {
seq_puts(m, " heap");
} else if (vma_is_initial_stack(vma)) {
* All dquots are placed to the end of inuse_list when first created, and this
* list is used for invalidate operation, which must look at every dquot.
*
- * When the last reference of a dquot will be dropped, the dquot will be
- * added to releasing_dquots. We'd then queue work item which would call
+ * When the last reference of a dquot is dropped, the dquot is added to
+ * releasing_dquots. We'll then queue work item which will call
* synchronize_srcu() and after that perform the final cleanup of all the
- * dquots on the list. Both releasing_dquots and free_dquots use the
- * dq_free list_head in the dquot struct. When a dquot is removed from
- * releasing_dquots, a reference count is always subtracted, and if
- * dq_count == 0 at that point, the dquot will be added to the free_dquots.
+ * dquots on the list. Each cleaned up dquot is moved to free_dquots list.
+ * Both releasing_dquots and free_dquots use the dq_free list_head in the dquot
+ * struct.
*
- * Unused dquots (dq_count == 0) are added to the free_dquots list when freed,
- * and this list is searched whenever we need an available dquot. Dquots are
- * removed from the list as soon as they are used again, and
- * dqstats.free_dquots gives the number of dquots on the list. When
- * dquot is invalidated it's completely released from memory.
+ * Unused and cleaned up dquots are in the free_dquots list and this list is
+ * searched whenever we need an available dquot. Dquots are removed from the
+ * list as soon as they are used again and dqstats.free_dquots gives the number
+ * of dquots on the list. When dquot is invalidated it's completely released
+ * from memory.
*
* Dirty dquots are added to the dqi_dirty_list of quota_info when mark
* dirtied, and this list is searched when writing dirty dquots back to
static inline void put_releasing_dquots(struct dquot *dquot)
{
list_add_tail(&dquot->dq_free, &releasing_dquots);
+ set_bit(DQ_RELEASING_B, &dquot->dq_flags);
}
static inline void remove_free_dquot(struct dquot *dquot)
if (list_empty(&dquot->dq_free))
return;
list_del_init(&dquot->dq_free);
- if (!atomic_read(&dquot->dq_count))
+ if (!test_bit(DQ_RELEASING_B, &dquot->dq_flags))
dqstats_dec(DQST_FREE_DQUOTS);
+ else
+ clear_bit(DQ_RELEASING_B, &dquot->dq_flags);
}
static inline void put_inuse(struct dquot *dquot)
continue;
/* Wait for dquot users */
if (atomic_read(&dquot->dq_count)) {
- /* dquot in releasing_dquots, flush and retry */
- if (!list_empty(&dquot->dq_free)) {
- spin_unlock(&dq_list_lock);
- goto restart;
- }
-
atomic_inc(&dquot->dq_count);
spin_unlock(&dq_list_lock);
/*
* restart. */
goto restart;
}
+ /*
+ * The last user already dropped its reference but dquot didn't
+ * get fully cleaned up yet. Restart the scan which flushes the
+ * work cleaning up released dquots.
+ */
+ if (test_bit(DQ_RELEASING_B, &dquot->dq_flags)) {
+ spin_unlock(&dq_list_lock);
+ goto restart;
+ }
/*
* Quota now has no users and it has been written on last
* dqput()
dq_dirty);
WARN_ON(!dquot_active(dquot));
+ /* If the dquot is releasing we should not touch it */
+ if (test_bit(DQ_RELEASING_B, &dquot->dq_flags)) {
+ spin_unlock(&dq_list_lock);
+ flush_delayed_work("a_release_work);
+ spin_lock(&dq_list_lock);
+ continue;
+ }
/* Now we have active dquot from which someone is
* holding reference so we can safely just increase
percpu_counter_read_positive(&dqstats.counter[DQST_FREE_DQUOTS]));
}
- static struct shrinker dqcache_shrinker = {
- .count_objects = dqcache_shrink_count,
- .scan_objects = dqcache_shrink_scan,
- .seeks = DEFAULT_SEEKS,
- };
-
/*
* Safely release dquot and put reference to dquot.
*/
/* Exchange the list head to avoid livelock. */
list_replace_init(&releasing_dquots, &rls_head);
spin_unlock(&dq_list_lock);
+ synchronize_srcu(&dquot_srcu);
restart:
- synchronize_srcu(&dquot_srcu);
spin_lock(&dq_list_lock);
while (!list_empty(&rls_head)) {
dquot = list_first_entry(&rls_head, struct dquot, dq_free);
- /* Dquot got used again? */
- if (atomic_read(&dquot->dq_count) > 1) {
- remove_free_dquot(dquot);
- atomic_dec(&dquot->dq_count);
- continue;
- }
+ WARN_ON_ONCE(atomic_read(&dquot->dq_count));
+ /*
+ * Note that DQ_RELEASING_B protects us from racing with
+ * invalidate_dquots() calls so we are safe to work with the
+ * dquot even after we drop dq_list_lock.
+ */
if (dquot_dirty(dquot)) {
spin_unlock(&dq_list_lock);
/* Commit dquot before releasing */
}
/* Dquot is inactive and clean, now move it to free list */
remove_free_dquot(dquot);
- atomic_dec(&dquot->dq_count);
put_dquot_last(dquot);
}
spin_unlock(&dq_list_lock);
BUG_ON(!list_empty(&dquot->dq_free));
#endif
put_releasing_dquots(dquot);
+ atomic_dec(&dquot->dq_count);
spin_unlock(&dq_list_lock);
queue_delayed_work(system_unbound_wq, "a_release_work, 1);
}
dqstats_inc(DQST_LOOKUPS);
}
/* Wait for dq_lock - after this we know that either dquot_release() is
- * already finished or it will be canceled due to dq_count > 1 test */
+ * already finished or it will be canceled due to dq_count > 0 test */
wait_on_dquot(dquot);
/* Read the dquot / allocate space in quota file */
if (!dquot_active(dquot)) {
if (sb_has_quota_loaded(sb, type))
return -EBUSY;
+ /*
+ * Quota files should never be encrypted. They should be thought of as
+ * filesystem metadata, not user data. New-style internal quota files
+ * cannot be encrypted by users anyway, but old-style external quota
+ * files could potentially be incorrectly created in an encrypted
+ * directory, hence this explicit check. Some reasons why encrypted
+ * quota files don't work include: (1) some filesystems that support
+ * encryption don't handle it in their quota_read and quota_write, and
+ * (2) cleaning up encrypted quota files at unmount would need special
+ * consideration, as quota files are cleaned up later than user files.
+ */
+ if (IS_ENCRYPTED(inode))
+ return -EINVAL;
+
dqopt->files[type] = igrab(inode);
if (!dqopt->files[type])
return -EIO;
{
int i, ret;
unsigned long nr_hash, order;
+ struct shrinker *dqcache_shrinker;
printk(KERN_NOTICE "VFS: Disk quotas %s\n", __DQUOT_VERSION__);
pr_info("VFS: Dquot-cache hash table entries: %ld (order %ld,"
" %ld bytes)\n", nr_hash, order, (PAGE_SIZE << order));
- if (register_shrinker(&dqcache_shrinker, "dquota-cache"))
- panic("Cannot register dquot shrinker");
+ dqcache_shrinker = shrinker_alloc(0, "dquota-cache");
+ if (!dqcache_shrinker)
+ panic("Cannot allocate dquot shrinker");
+
+ dqcache_shrinker->count_objects = dqcache_shrink_count;
+ dqcache_shrinker->scan_objects = dqcache_shrink_scan;
+
+ shrinker_register(dqcache_shrinker);
return 0;
}
i_uid_write(inode, sd_v1_uid(sd));
i_gid_write(inode, sd_v1_gid(sd));
inode->i_size = sd_v1_size(sd);
- inode->i_atime.tv_sec = sd_v1_atime(sd);
- inode->i_mtime.tv_sec = sd_v1_mtime(sd);
+ inode_set_atime(inode, sd_v1_atime(sd), 0);
+ inode_set_mtime(inode, sd_v1_mtime(sd), 0);
inode_set_ctime(inode, sd_v1_ctime(sd), 0);
- inode->i_atime.tv_nsec = 0;
- inode->i_mtime.tv_nsec = 0;
inode->i_blocks = sd_v1_blocks(sd);
inode->i_generation = le32_to_cpu(INODE_PKEY(inode)->k_dir_id);
i_uid_write(inode, sd_v2_uid(sd));
inode->i_size = sd_v2_size(sd);
i_gid_write(inode, sd_v2_gid(sd));
- inode->i_mtime.tv_sec = sd_v2_mtime(sd);
- inode->i_atime.tv_sec = sd_v2_atime(sd);
+ inode_set_mtime(inode, sd_v2_mtime(sd), 0);
+ inode_set_atime(inode, sd_v2_atime(sd), 0);
inode_set_ctime(inode, sd_v2_ctime(sd), 0);
- inode->i_mtime.tv_nsec = 0;
- inode->i_atime.tv_nsec = 0;
inode->i_blocks = sd_v2_blocks(sd);
rdev = sd_v2_rdev(sd);
if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode))
set_sd_v2_uid(sd_v2, i_uid_read(inode));
set_sd_v2_size(sd_v2, size);
set_sd_v2_gid(sd_v2, i_gid_read(inode));
- set_sd_v2_mtime(sd_v2, inode->i_mtime.tv_sec);
- set_sd_v2_atime(sd_v2, inode->i_atime.tv_sec);
- set_sd_v2_ctime(sd_v2, inode_get_ctime(inode).tv_sec);
+ set_sd_v2_mtime(sd_v2, inode_get_mtime_sec(inode));
+ set_sd_v2_atime(sd_v2, inode_get_atime_sec(inode));
+ set_sd_v2_ctime(sd_v2, inode_get_ctime_sec(inode));
set_sd_v2_blocks(sd_v2, to_fake_used_blocks(inode, SD_V2_SIZE));
if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode))
set_sd_v2_rdev(sd_v2, new_encode_dev(inode->i_rdev));
set_sd_v1_gid(sd_v1, i_gid_read(inode));
set_sd_v1_nlink(sd_v1, inode->i_nlink);
set_sd_v1_size(sd_v1, size);
- set_sd_v1_atime(sd_v1, inode->i_atime.tv_sec);
- set_sd_v1_ctime(sd_v1, inode_get_ctime(inode).tv_sec);
- set_sd_v1_mtime(sd_v1, inode->i_mtime.tv_sec);
+ set_sd_v1_atime(sd_v1, inode_get_atime_sec(inode));
+ set_sd_v1_ctime(sd_v1, inode_get_ctime_sec(inode));
+ set_sd_v1_mtime(sd_v1, inode_get_mtime_sec(inode));
if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode))
set_sd_v1_rdev(sd_v1, new_encode_dev(inode->i_rdev));
/* uid and gid must already be set by the caller for quota init */
- inode->i_mtime = inode->i_atime = inode_set_ctime_current(inode);
+ simple_inode_init_ts(inode);
inode->i_size = i_size;
inode->i_blocks = 0;
inode->i_bytes = 0;
* start/recovery path as __block_write_full_folio, along with special
* code to handle reiserfs tails.
*/
- static int reiserfs_write_full_page(struct page *page,
+ static int reiserfs_write_full_folio(struct folio *folio,
struct writeback_control *wbc)
{
- struct inode *inode = page->mapping->host;
+ struct inode *inode = folio->mapping->host;
unsigned long end_index = inode->i_size >> PAGE_SHIFT;
int error = 0;
unsigned long block;
struct buffer_head *head, *bh;
int partial = 0;
int nr = 0;
- int checked = PageChecked(page);
+ int checked = folio_test_checked(folio);
struct reiserfs_transaction_handle th;
struct super_block *s = inode->i_sb;
int bh_per_page = PAGE_SIZE / s->s_blocksize;
/* no logging allowed when nonblocking or from PF_MEMALLOC */
if (checked && (current->flags & PF_MEMALLOC)) {
- redirty_page_for_writepage(wbc, page);
- unlock_page(page);
+ folio_redirty_for_writepage(wbc, folio);
+ folio_unlock(folio);
return 0;
}
/*
- * The page dirty bit is cleared before writepage is called, which
+ * The folio dirty bit is cleared before writepage is called, which
* means we have to tell create_empty_buffers to make dirty buffers
- * The page really should be up to date at this point, so tossing
+ * The folio really should be up to date at this point, so tossing
* in the BH_Uptodate is just a sanity check.
*/
- if (!page_has_buffers(page)) {
- create_empty_buffers(page, s->s_blocksize,
+ head = folio_buffers(folio);
+ if (!head)
+ head = create_empty_buffers(folio, s->s_blocksize,
(1 << BH_Dirty) | (1 << BH_Uptodate));
- }
- head = page_buffers(page);
/*
- * last page in the file, zero out any contents past the
+ * last folio in the file, zero out any contents past the
* last byte in the file
*/
- if (page->index >= end_index) {
+ if (folio->index >= end_index) {
unsigned last_offset;
last_offset = inode->i_size & (PAGE_SIZE - 1);
- /* no file contents in this page */
- if (page->index >= end_index + 1 || !last_offset) {
- unlock_page(page);
+ /* no file contents in this folio */
+ if (folio->index >= end_index + 1 || !last_offset) {
+ folio_unlock(folio);
return 0;
}
- zero_user_segment(page, last_offset, PAGE_SIZE);
+ folio_zero_segment(folio, last_offset, folio_size(folio));
}
bh = head;
- block = page->index << (PAGE_SHIFT - s->s_blocksize_bits);
+ block = folio->index << (PAGE_SHIFT - s->s_blocksize_bits);
last_block = (i_size_read(inode) - 1) >> inode->i_blkbits;
/* first map all the buffers, logging any direct items we find */
do {
if (block > last_block) {
/*
* This can happen when the block size is less than
- * the page size. The corresponding bytes in the page
+ * the folio size. The corresponding bytes in the folio
* were zero filled above
*/
clear_buffer_dirty(bh);
* blocks we're going to log
*/
if (checked) {
- ClearPageChecked(page);
+ folio_clear_checked(folio);
reiserfs_write_lock(s);
error = journal_begin(&th, s, bh_per_page + 1);
if (error) {
}
reiserfs_update_inode_transaction(inode);
}
- /* now go through and lock any dirty buffers on the page */
+ /* now go through and lock any dirty buffers on the folio */
do {
get_bh(bh);
if (!buffer_mapped(bh))
lock_buffer(bh);
} else {
if (!trylock_buffer(bh)) {
- redirty_page_for_writepage(wbc, page);
+ folio_redirty_for_writepage(wbc, folio);
continue;
}
}
if (error)
goto fail;
}
- BUG_ON(PageWriteback(page));
- set_page_writeback(page);
- unlock_page(page);
+ BUG_ON(folio_test_writeback(folio));
+ folio_start_writeback(folio);
+ folio_unlock(folio);
/*
- * since any buffer might be the only dirty buffer on the page,
- * the first submit_bh can bring the page out of writeback.
+ * since any buffer might be the only dirty buffer on the folio,
+ * the first submit_bh can bring the folio out of writeback.
* be careful with the buffers.
*/
do {
done:
if (nr == 0) {
/*
- * if this page only had a direct item, it is very possible for
+ * if this folio only had a direct item, it is very possible for
* no io to be required without there being an error. Or,
* someone else could have locked them and sent them down the
- * pipe without locking the page
+ * pipe without locking the folio
*/
bh = head;
do {
bh = bh->b_this_page;
} while (bh != head);
if (!partial)
- SetPageUptodate(page);
- end_page_writeback(page);
+ folio_mark_uptodate(folio);
+ folio_end_writeback(folio);
}
return error;
fail:
/*
* catches various errors, we need to make sure any valid dirty blocks
- * get to the media. The page is currently locked and not marked for
+ * get to the media. The folio is currently locked and not marked for
* writeback
*/
- ClearPageUptodate(page);
+ folio_clear_uptodate(folio);
bh = head;
do {
get_bh(bh);
} else {
/*
* clear any dirty bits that might have come from
- * getting attached to a dirty page
+ * getting attached to a dirty folio
*/
clear_buffer_dirty(bh);
}
bh = bh->b_this_page;
} while (bh != head);
- SetPageError(page);
- BUG_ON(PageWriteback(page));
- set_page_writeback(page);
- unlock_page(page);
+ folio_set_error(folio);
+ BUG_ON(folio_test_writeback(folio));
+ folio_start_writeback(folio);
+ folio_unlock(folio);
do {
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
static int reiserfs_writepage(struct page *page, struct writeback_control *wbc)
{
- struct inode *inode = page->mapping->host;
+ struct folio *folio = page_folio(page);
+ struct inode *inode = folio->mapping->host;
reiserfs_wait_on_write_block(inode->i_sb);
- return reiserfs_write_full_page(page, wbc);
+ return reiserfs_write_full_folio(folio, wbc);
}
static void reiserfs_truncate_failed_write(struct inode *inode)
* One thing we have to be careful of with a per-sb shrinker is that we don't
* drop the last active reference to the superblock from within the shrinker.
* If that happens we could trigger unregistering the shrinker from within the
- * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
+ * shrinker path and that leads to deadlock on the shrinker_mutex. Hence we
* take a passive reference to the superblock to avoid this from occurring.
*/
static unsigned long super_cache_scan(struct shrinker *shrink,
long dentries;
long inodes;
- sb = container_of(shrink, struct super_block, s_shrink);
+ sb = shrink->private_data;
/*
* Deadlock avoidance. We may hold various FS locks, and we don't want
struct super_block *sb;
long total_objects = 0;
- sb = container_of(shrink, struct super_block, s_shrink);
+ sb = shrink->private_data;
/*
* We don't call super_trylock_shared() here as it is a scalability
security_sb_free(s);
put_user_ns(s->s_user_ns);
kfree(s->s_subtype);
- free_prealloced_shrinker(&s->s_shrink);
+ shrinker_free(s->s_shrink);
/* no delays needed */
destroy_super_work(&s->destroy_work);
}
s->s_time_min = TIME64_MIN;
s->s_time_max = TIME64_MAX;
- s->s_shrink.seeks = DEFAULT_SEEKS;
- s->s_shrink.scan_objects = super_cache_scan;
- s->s_shrink.count_objects = super_cache_count;
- s->s_shrink.batch = 1024;
- s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
- if (prealloc_shrinker(&s->s_shrink, "sb-%s", type->name))
+ s->s_shrink = shrinker_alloc(SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE,
+ "sb-%s", type->name);
+ if (!s->s_shrink)
goto fail;
- if (list_lru_init_memcg(&s->s_dentry_lru, &s->s_shrink))
+
+ s->s_shrink->scan_objects = super_cache_scan;
+ s->s_shrink->count_objects = super_cache_count;
+ s->s_shrink->batch = 1024;
+ s->s_shrink->private_data = s;
+
+ if (list_lru_init_memcg(&s->s_dentry_lru, s->s_shrink))
goto fail;
- if (list_lru_init_memcg(&s->s_inode_lru, &s->s_shrink))
+ if (list_lru_init_memcg(&s->s_inode_lru, s->s_shrink))
goto fail;
return s;
{
struct file_system_type *fs = s->s_type;
if (atomic_dec_and_test(&s->s_active)) {
- unregister_shrinker(&s->s_shrink);
+ shrinker_free(s->s_shrink);
fs->kill_sb(s);
kill_super_notify(s);
hlist_add_head(&s->s_instances, &s->s_type->fs_supers);
spin_unlock(&sb_lock);
get_filesystem(s->s_type);
- register_shrinker_prepared(&s->s_shrink);
+ shrinker_register(s->s_shrink);
return s;
share_extant_sb:
hlist_add_head(&s->s_instances, &type->fs_supers);
spin_unlock(&sb_lock);
get_filesystem(type);
- register_shrinker_prepared(&s->s_shrink);
+ shrinker_register(s->s_shrink);
return s;
}
EXPORT_SYMBOL(sget);
#ifdef CONFIG_BLOCK
/*
- * Lock a super block that the callers holds a reference to.
+ * Lock the superblock that is holder of the bdev. Returns the superblock
+ * pointer if we successfully locked the superblock and it is alive. Otherwise
+ * we return NULL and just unlock bdev->bd_holder_lock.
*
- * The caller needs to ensure that the super_block isn't being freed while
- * calling this function, e.g. by holding a lock over the call to this function
- * and the place that clears the pointer to the superblock used by this function
- * before freeing the superblock.
+ * The function must be called with bdev->bd_holder_lock and releases it.
*/
-static bool super_lock_shared_active(struct super_block *sb)
+static struct super_block *bdev_super_lock_shared(struct block_device *bdev)
+ __releases(&bdev->bd_holder_lock)
{
- bool born = super_lock_shared(sb);
+ struct super_block *sb = bdev->bd_holder;
+ bool born;
+
+ lockdep_assert_held(&bdev->bd_holder_lock);
+ lockdep_assert_not_held(&sb->s_umount);
+ lockdep_assert_not_held(&bdev->bd_disk->open_mutex);
+
+ /* Make sure sb doesn't go away from under us */
+ spin_lock(&sb_lock);
+ sb->s_count++;
+ spin_unlock(&sb_lock);
+ mutex_unlock(&bdev->bd_holder_lock);
+ born = super_lock_shared(sb);
if (!born || !sb->s_root || !(sb->s_flags & SB_ACTIVE)) {
super_unlock_shared(sb);
- return false;
+ put_super(sb);
+ return NULL;
}
- return true;
+ /*
+ * The superblock is active and we hold s_umount, we can drop our
+ * temporary reference now.
+ */
+ put_super(sb);
+ return sb;
}
static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
{
- struct super_block *sb = bdev->bd_holder;
-
- /* bd_holder_lock ensures that the sb isn't freed */
- lockdep_assert_held(&bdev->bd_holder_lock);
+ struct super_block *sb;
- if (!super_lock_shared_active(sb))
+ sb = bdev_super_lock_shared(bdev);
+ if (!sb)
return;
if (!surprise)
static void fs_bdev_sync(struct block_device *bdev)
{
- struct super_block *sb = bdev->bd_holder;
-
- lockdep_assert_held(&bdev->bd_holder_lock);
+ struct super_block *sb;
- if (!super_lock_shared_active(sb))
+ sb = bdev_super_lock_shared(bdev);
+ if (!sb)
return;
sync_filesystem(sb);
super_unlock_shared(sb);
struct fs_context *fc)
{
blk_mode_t mode = sb_open_mode(sb_flags);
+ struct bdev_handle *bdev_handle;
struct block_device *bdev;
- bdev = blkdev_get_by_dev(sb->s_dev, mode, sb, &fs_holder_ops);
- if (IS_ERR(bdev)) {
+ bdev_handle = bdev_open_by_dev(sb->s_dev, mode, sb, &fs_holder_ops);
+ if (IS_ERR(bdev_handle)) {
if (fc)
errorf(fc, "%s: Can't open blockdev", fc->source);
- return PTR_ERR(bdev);
+ return PTR_ERR(bdev_handle);
}
+ bdev = bdev_handle->bdev;
/*
* This really should be in blkdev_get_by_dev, but right now can't due
* writable from userspace even for a read-only block device.
*/
if ((mode & BLK_OPEN_WRITE) && bdev_read_only(bdev)) {
- blkdev_put(bdev, sb);
+ bdev_release(bdev_handle);
return -EACCES;
}
mutex_unlock(&bdev->bd_fsfreeze_mutex);
if (fc)
warnf(fc, "%pg: Can't mount, blockdev is frozen", bdev);
- blkdev_put(bdev, sb);
+ bdev_release(bdev_handle);
return -EBUSY;
}
spin_lock(&sb_lock);
+ sb->s_bdev_handle = bdev_handle;
sb->s_bdev = bdev;
sb->s_bdi = bdi_get(bdev->bd_disk->bdi);
if (bdev_stable_writes(bdev))
mutex_unlock(&bdev->bd_fsfreeze_mutex);
snprintf(sb->s_id, sizeof(sb->s_id), "%pg", bdev);
- shrinker_debugfs_rename(&sb->s_shrink, "sb-%s:%s", sb->s_type->name,
+ shrinker_debugfs_rename(sb->s_shrink, "sb-%s:%s", sb->s_type->name,
sb->s_id);
sb_set_blocksize(sb, block_size(bdev));
return 0;
generic_shutdown_super(sb);
if (bdev) {
sync_blockdev(bdev);
- blkdev_put(bdev, sb);
+ bdev_release(sb->s_bdev_handle);
}
}
static struct kmem_cache *ubifs_inode_slab;
/* UBIFS TNC shrinker description */
- static struct shrinker ubifs_shrinker_info = {
- .scan_objects = ubifs_shrink_scan,
- .count_objects = ubifs_shrink_count,
- .seeks = DEFAULT_SEEKS,
- };
+ static struct shrinker *ubifs_shrinker_info;
/**
* validate_inode - validate inode.
set_nlink(inode, le32_to_cpu(ino->nlink));
i_uid_write(inode, le32_to_cpu(ino->uid));
i_gid_write(inode, le32_to_cpu(ino->gid));
- inode->i_atime.tv_sec = (int64_t)le64_to_cpu(ino->atime_sec);
- inode->i_atime.tv_nsec = le32_to_cpu(ino->atime_nsec);
- inode->i_mtime.tv_sec = (int64_t)le64_to_cpu(ino->mtime_sec);
- inode->i_mtime.tv_nsec = le32_to_cpu(ino->mtime_nsec);
+ inode_set_atime(inode, (int64_t)le64_to_cpu(ino->atime_sec),
+ le32_to_cpu(ino->atime_nsec));
+ inode_set_mtime(inode, (int64_t)le64_to_cpu(ino->mtime_sec),
+ le32_to_cpu(ino->mtime_nsec));
inode_set_ctime(inode, (int64_t)le64_to_cpu(ino->ctime_sec),
le32_to_cpu(ino->ctime_nsec));
inode->i_mode = le32_to_cpu(ino->mode);
static int __init ubifs_init(void)
{
- int err;
+ int err = -ENOMEM;
BUILD_BUG_ON(sizeof(struct ubifs_ch) != 24);
if (!ubifs_inode_slab)
return -ENOMEM;
- err = register_shrinker(&ubifs_shrinker_info, "ubifs-slab");
- if (err)
+ ubifs_shrinker_info = shrinker_alloc(0, "ubifs-slab");
+ if (!ubifs_shrinker_info)
goto out_slab;
+ ubifs_shrinker_info->count_objects = ubifs_shrink_count;
+ ubifs_shrinker_info->scan_objects = ubifs_shrink_scan;
+
+ shrinker_register(ubifs_shrinker_info);
+
err = ubifs_compressors_init();
if (err)
goto out_shrinker;
dbg_debugfs_exit();
ubifs_compressors_exit();
out_shrinker:
- unregister_shrinker(&ubifs_shrinker_info);
+ shrinker_free(ubifs_shrinker_info);
out_slab:
kmem_cache_destroy(ubifs_inode_slab);
return err;
dbg_debugfs_exit();
ubifs_sysfs_exit();
ubifs_compressors_exit();
- unregister_shrinker(&ubifs_shrinker_info);
+ shrinker_free(ubifs_shrinker_info);
/*
* Make sure all delayed rcu free inodes are flushed before we
i_gid_write(inode, ufs_get_inode_gid(sb, ufs_inode));
inode->i_size = fs64_to_cpu(sb, ufs_inode->ui_size);
- inode->i_atime.tv_sec = (signed)fs32_to_cpu(sb, ufs_inode->ui_atime.tv_sec);
+ inode_set_atime(inode,
+ (signed)fs32_to_cpu(sb, ufs_inode->ui_atime.tv_sec),
+ 0);
inode_set_ctime(inode,
(signed)fs32_to_cpu(sb, ufs_inode->ui_ctime.tv_sec),
0);
- inode->i_mtime.tv_sec = (signed)fs32_to_cpu(sb, ufs_inode->ui_mtime.tv_sec);
- inode->i_mtime.tv_nsec = 0;
- inode->i_atime.tv_nsec = 0;
+ inode_set_mtime(inode,
+ (signed)fs32_to_cpu(sb, ufs_inode->ui_mtime.tv_sec),
+ 0);
inode->i_blocks = fs32_to_cpu(sb, ufs_inode->ui_blocks);
inode->i_generation = fs32_to_cpu(sb, ufs_inode->ui_gen);
ufsi->i_flags = fs32_to_cpu(sb, ufs_inode->ui_flags);
i_gid_write(inode, fs32_to_cpu(sb, ufs2_inode->ui_gid));
inode->i_size = fs64_to_cpu(sb, ufs2_inode->ui_size);
- inode->i_atime.tv_sec = fs64_to_cpu(sb, ufs2_inode->ui_atime);
+ inode_set_atime(inode, fs64_to_cpu(sb, ufs2_inode->ui_atime),
+ fs32_to_cpu(sb, ufs2_inode->ui_atimensec));
inode_set_ctime(inode, fs64_to_cpu(sb, ufs2_inode->ui_ctime),
fs32_to_cpu(sb, ufs2_inode->ui_ctimensec));
- inode->i_mtime.tv_sec = fs64_to_cpu(sb, ufs2_inode->ui_mtime);
- inode->i_atime.tv_nsec = fs32_to_cpu(sb, ufs2_inode->ui_atimensec);
- inode->i_mtime.tv_nsec = fs32_to_cpu(sb, ufs2_inode->ui_mtimensec);
+ inode_set_mtime(inode, fs64_to_cpu(sb, ufs2_inode->ui_mtime),
+ fs32_to_cpu(sb, ufs2_inode->ui_mtimensec));
inode->i_blocks = fs64_to_cpu(sb, ufs2_inode->ui_blocks);
inode->i_generation = fs32_to_cpu(sb, ufs2_inode->ui_gen);
ufsi->i_flags = fs32_to_cpu(sb, ufs2_inode->ui_flags);
ufs_set_inode_gid(sb, ufs_inode, i_gid_read(inode));
ufs_inode->ui_size = cpu_to_fs64(sb, inode->i_size);
- ufs_inode->ui_atime.tv_sec = cpu_to_fs32(sb, inode->i_atime.tv_sec);
+ ufs_inode->ui_atime.tv_sec = cpu_to_fs32(sb,
+ inode_get_atime_sec(inode));
ufs_inode->ui_atime.tv_usec = 0;
ufs_inode->ui_ctime.tv_sec = cpu_to_fs32(sb,
- inode_get_ctime(inode).tv_sec);
+ inode_get_ctime_sec(inode));
ufs_inode->ui_ctime.tv_usec = 0;
- ufs_inode->ui_mtime.tv_sec = cpu_to_fs32(sb, inode->i_mtime.tv_sec);
+ ufs_inode->ui_mtime.tv_sec = cpu_to_fs32(sb,
+ inode_get_mtime_sec(inode));
ufs_inode->ui_mtime.tv_usec = 0;
ufs_inode->ui_blocks = cpu_to_fs32(sb, inode->i_blocks);
ufs_inode->ui_flags = cpu_to_fs32(sb, ufsi->i_flags);
ufs_inode->ui_gid = cpu_to_fs32(sb, i_gid_read(inode));
ufs_inode->ui_size = cpu_to_fs64(sb, inode->i_size);
- ufs_inode->ui_atime = cpu_to_fs64(sb, inode->i_atime.tv_sec);
- ufs_inode->ui_atimensec = cpu_to_fs32(sb, inode->i_atime.tv_nsec);
- ufs_inode->ui_ctime = cpu_to_fs64(sb, inode_get_ctime(inode).tv_sec);
+ ufs_inode->ui_atime = cpu_to_fs64(sb, inode_get_atime_sec(inode));
+ ufs_inode->ui_atimensec = cpu_to_fs32(sb,
+ inode_get_atime_nsec(inode));
+ ufs_inode->ui_ctime = cpu_to_fs64(sb, inode_get_ctime_sec(inode));
ufs_inode->ui_ctimensec = cpu_to_fs32(sb,
- inode_get_ctime(inode).tv_nsec);
- ufs_inode->ui_mtime = cpu_to_fs64(sb, inode->i_mtime.tv_sec);
- ufs_inode->ui_mtimensec = cpu_to_fs32(sb, inode->i_mtime.tv_nsec);
+ inode_get_ctime_nsec(inode));
+ ufs_inode->ui_mtime = cpu_to_fs64(sb, inode_get_mtime_sec(inode));
+ ufs_inode->ui_mtimensec = cpu_to_fs32(sb,
+ inode_get_mtime_nsec(inode));
ufs_inode->ui_blocks = cpu_to_fs64(sb, inode->i_blocks);
ufs_inode->ui_flags = cpu_to_fs32(sb, ufsi->i_flags);
struct ufs_sb_private_info *uspi = UFS_SB(sb)->s_uspi;
unsigned i, end;
sector_t lastfrag;
- struct page *lastpage;
+ struct folio *folio;
struct buffer_head *bh;
u64 phys64;
lastfrag--;
- lastpage = ufs_get_locked_page(mapping, lastfrag >>
+ folio = ufs_get_locked_folio(mapping, lastfrag >>
(PAGE_SHIFT - inode->i_blkbits));
- if (IS_ERR(lastpage)) {
- err = -EIO;
- goto out;
- }
-
- end = lastfrag & ((1 << (PAGE_SHIFT - inode->i_blkbits)) - 1);
- bh = page_buffers(lastpage);
- for (i = 0; i < end; ++i)
- bh = bh->b_this_page;
+ if (IS_ERR(folio)) {
+ err = -EIO;
+ goto out;
+ }
+ end = lastfrag & ((1 << (PAGE_SHIFT - inode->i_blkbits)) - 1);
+ bh = folio_buffers(folio);
+ for (i = 0; i < end; ++i)
+ bh = bh->b_this_page;
err = ufs_getfrag_block(inode, lastfrag, bh, 1);
*/
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
- set_page_dirty(lastpage);
+ folio_mark_dirty(folio);
}
if (lastfrag >= UFS_IND_FRAGMENT) {
}
}
out_unlock:
- ufs_put_locked_page(lastpage);
+ ufs_put_locked_folio(folio);
out:
return err;
}
truncate_setsize(inode, size);
ufs_truncate_blocks(inode);
- inode->i_mtime = inode_set_ctime_current(inode);
+ inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
mark_inode_dirty(inode);
out:
UFSD("EXIT: err %d\n", err);
struct shrinker *shrink,
struct shrink_control *sc)
{
- struct xfs_buftarg *btp = container_of(shrink,
- struct xfs_buftarg, bt_shrinker);
+ struct xfs_buftarg *btp = shrink->private_data;
LIST_HEAD(dispose);
unsigned long freed;
struct shrinker *shrink,
struct shrink_control *sc)
{
- struct xfs_buftarg *btp = container_of(shrink,
- struct xfs_buftarg, bt_shrinker);
+ struct xfs_buftarg *btp = shrink->private_data;
return list_lru_shrink_count(&btp->bt_lru, sc);
}
xfs_free_buftarg(
struct xfs_buftarg *btp)
{
- unregister_shrinker(&btp->bt_shrinker);
- struct block_device *bdev = btp->bt_bdev;
-
+ shrinker_free(btp->bt_shrinker);
ASSERT(percpu_counter_sum(&btp->bt_io_count) == 0);
percpu_counter_destroy(&btp->bt_io_count);
list_lru_destroy(&btp->bt_lru);
fs_put_dax(btp->bt_daxdev, btp->bt_mount);
/* the main block device is closed by kill_block_super */
- if (bdev != btp->bt_mount->m_super->s_bdev)
- blkdev_put(bdev, btp->bt_mount->m_super);
+ if (btp->bt_bdev != btp->bt_mount->m_super->s_bdev)
+ bdev_release(btp->bt_bdev_handle);
kmem_free(btp);
}
*/
STATIC int
xfs_setsize_buftarg_early(
- xfs_buftarg_t *btp,
- struct block_device *bdev)
+ xfs_buftarg_t *btp)
{
- return xfs_setsize_buftarg(btp, bdev_logical_block_size(bdev));
+ return xfs_setsize_buftarg(btp, bdev_logical_block_size(btp->bt_bdev));
}
struct xfs_buftarg *
xfs_alloc_buftarg(
struct xfs_mount *mp,
- struct block_device *bdev)
+ struct bdev_handle *bdev_handle)
{
xfs_buftarg_t *btp;
const struct dax_holder_operations *ops = NULL;
btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
btp->bt_mount = mp;
- btp->bt_dev = bdev->bd_dev;
- btp->bt_bdev = bdev;
- btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off,
+ btp->bt_bdev_handle = bdev_handle;
+ btp->bt_dev = bdev_handle->bdev->bd_dev;
+ btp->bt_bdev = bdev_handle->bdev;
+ btp->bt_daxdev = fs_dax_get_by_bdev(btp->bt_bdev, &btp->bt_dax_part_off,
mp, ops);
/*
ratelimit_state_init(&btp->bt_ioerror_rl, 30 * HZ,
DEFAULT_RATELIMIT_BURST);
- if (xfs_setsize_buftarg_early(btp, bdev))
+ if (xfs_setsize_buftarg_early(btp))
goto error_free;
if (list_lru_init(&btp->bt_lru))
if (percpu_counter_init(&btp->bt_io_count, 0, GFP_KERNEL))
goto error_lru;
- btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
- btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
- btp->bt_shrinker.seeks = DEFAULT_SEEKS;
- btp->bt_shrinker.flags = SHRINKER_NUMA_AWARE;
- if (register_shrinker(&btp->bt_shrinker, "xfs-buf:%s",
- mp->m_super->s_id))
+ btp->bt_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "xfs-buf:%s",
+ mp->m_super->s_id);
+ if (!btp->bt_shrinker)
goto error_pcpu;
+
+ btp->bt_shrinker->count_objects = xfs_buftarg_shrink_count;
+ btp->bt_shrinker->scan_objects = xfs_buftarg_shrink_scan;
+ btp->bt_shrinker->private_data = btp;
+
+ shrinker_register(btp->bt_shrinker);
+
return btp;
error_pcpu:
*/
typedef struct xfs_buftarg {
dev_t bt_dev;
+ struct bdev_handle *bt_bdev_handle;
struct block_device *bt_bdev;
struct dax_device *bt_daxdev;
u64 bt_dax_part_off;
size_t bt_logical_sectormask;
/* LRU control structures */
- struct shrinker bt_shrinker;
+ struct shrinker *bt_shrinker;
struct list_lru bt_lru;
struct percpu_counter bt_io_count;
* Handling of buftargs.
*/
struct xfs_buftarg *xfs_alloc_buftarg(struct xfs_mount *mp,
- struct block_device *bdev);
+ struct bdev_handle *bdev_handle);
extern void xfs_free_buftarg(struct xfs_buftarg *);
extern void xfs_buftarg_wait(struct xfs_buftarg *);
extern void xfs_buftarg_drain(struct xfs_buftarg *);
* Enable recursive subtree protection
*/
CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),
+
+ /*
+ * Enable hugetlb accounting for the memory controller.
+ */
+ CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19),
};
/* cftype->flags */
* Lists running through all tasks using this cgroup group.
* mg_tasks lists tasks which belong to this cset but are in the
* process of being migrated out or in. Protected by
- * css_set_rwsem, but, during migration, once tasks are moved to
+ * css_set_lock, but, during migration, once tasks are moved to
* mg_tasks, it can be read safely while holding cgroup_mutex.
*/
struct list_head tasks;
struct seq_file;
struct workqueue_struct;
struct iov_iter;
-struct fscrypt_info;
+struct fscrypt_inode_info;
struct fscrypt_operations;
struct fsverity_info;
struct fsverity_operations;
* It is also used to block modification of page cache contents through
* memory mappings.
* @gfp_mask: Memory allocation flags to use for allocating pages.
- * @i_mmap_writable: Number of VM_SHARED mappings.
+ * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
* @nr_thps: Number of THPs in the pagecache (non-shmem only).
* @i_mmap: Tree of private and shared mappings.
* @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
/*
* Might pages of this file have been modified in userspace?
- * Note that i_mmap_writable counts all VM_SHARED vmas: do_mmap
+ * Note that i_mmap_writable counts all VM_SHARED, VM_MAYWRITE vmas: do_mmap
* marks vma as VM_SHARED if it is shared, and the file was opened for
* writing i.e. vma may be mprotected writable even if now readonly.
*
};
dev_t i_rdev;
loff_t i_size;
- struct timespec64 i_atime;
- struct timespec64 i_mtime;
+ struct timespec64 __i_atime;
+ struct timespec64 __i_mtime;
struct timespec64 __i_ctime; /* use inode_*_ctime accessors! */
spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short i_bytes;
#endif
#ifdef CONFIG_FS_ENCRYPTION
- struct fscrypt_info *i_crypt_info;
+ struct fscrypt_inode_info *i_crypt_info;
#endif
#ifdef CONFIG_FS_VERITY
atomic_long_inc(&f->f_count);
return f;
}
-#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count)
+
+struct file *get_file_rcu(struct file __rcu **f);
+struct file *get_file_active(struct file **f);
+
#define file_count(x) atomic_long_read(&(x)->f_count)
#define MAX_NON_LFS ((1UL<<31) - 1)
#define SB_NOATIME BIT(10) /* Do not update access times. */
#define SB_NODIRATIME BIT(11) /* Do not update directory access times */
#define SB_SILENT BIT(15)
-#define SB_POSIXACL BIT(16) /* VFS does not apply the umask */
+#define SB_POSIXACL BIT(16) /* Supports POSIX ACLs */
#define SB_INLINECRYPT BIT(17) /* Use blk-crypto for encrypted files */
#define SB_KERNMOUNT BIT(22) /* this is a kern_mount call */
#define SB_I_VERSION BIT(23) /* Update inode I_version field */
#define SB_I_PERSB_BDI 0x00000200 /* has a per-sb bdi */
#define SB_I_TS_EXPIRY_WARNED 0x00000400 /* warned about timestamp range expiry */
#define SB_I_RETIRED 0x00000800 /* superblock shouldn't be reused */
+#define SB_I_NOUMASK 0x00001000 /* VFS does not apply umask */
/* Possible states of 'frozen' field */
enum {
#ifdef CONFIG_SECURITY
void *s_security;
#endif
- const struct xattr_handler **s_xattr;
+ const struct xattr_handler * const *s_xattr;
#ifdef CONFIG_FS_ENCRYPTION
const struct fscrypt_operations *s_cop;
struct fscrypt_keyring *s_master_keys; /* master crypto keys in use */
struct hlist_bl_head s_roots; /* alternate root dentries for NFS */
struct list_head s_mounts; /* list of mounts; _not_ for fs use */
struct block_device *s_bdev;
+ struct bdev_handle *s_bdev_handle;
struct backing_dev_info *s_bdi;
struct mtd_info *s_mtd;
struct hlist_node s_instances;
const struct dentry_operations *s_d_op; /* default d_op for dentries */
- struct shrinker s_shrink; /* per-sb shrinker handle */
+ struct shrinker *s_shrink; /* per-sb shrinker handle */
/* Number of inodes with nlink == 0 but still referenced */
atomic_long_t s_remove_count;
struct timespec64 current_time(struct inode *inode);
struct timespec64 inode_set_ctime_current(struct inode *inode);
-/**
- * inode_get_ctime - fetch the current ctime from the inode
- * @inode: inode from which to fetch ctime
- *
- * Grab the current ctime from the inode and return it.
- */
+static inline time64_t inode_get_atime_sec(const struct inode *inode)
+{
+ return inode->__i_atime.tv_sec;
+}
+
+static inline long inode_get_atime_nsec(const struct inode *inode)
+{
+ return inode->__i_atime.tv_nsec;
+}
+
+static inline struct timespec64 inode_get_atime(const struct inode *inode)
+{
+ return inode->__i_atime;
+}
+
+static inline struct timespec64 inode_set_atime_to_ts(struct inode *inode,
+ struct timespec64 ts)
+{
+ inode->__i_atime = ts;
+ return ts;
+}
+
+static inline struct timespec64 inode_set_atime(struct inode *inode,
+ time64_t sec, long nsec)
+{
+ struct timespec64 ts = { .tv_sec = sec,
+ .tv_nsec = nsec };
+ return inode_set_atime_to_ts(inode, ts);
+}
+
+static inline time64_t inode_get_mtime_sec(const struct inode *inode)
+{
+ return inode->__i_mtime.tv_sec;
+}
+
+static inline long inode_get_mtime_nsec(const struct inode *inode)
+{
+ return inode->__i_mtime.tv_nsec;
+}
+
+static inline struct timespec64 inode_get_mtime(const struct inode *inode)
+{
+ return inode->__i_mtime;
+}
+
+static inline struct timespec64 inode_set_mtime_to_ts(struct inode *inode,
+ struct timespec64 ts)
+{
+ inode->__i_mtime = ts;
+ return ts;
+}
+
+static inline struct timespec64 inode_set_mtime(struct inode *inode,
+ time64_t sec, long nsec)
+{
+ struct timespec64 ts = { .tv_sec = sec,
+ .tv_nsec = nsec };
+ return inode_set_mtime_to_ts(inode, ts);
+}
+
+static inline time64_t inode_get_ctime_sec(const struct inode *inode)
+{
+ return inode->__i_ctime.tv_sec;
+}
+
+static inline long inode_get_ctime_nsec(const struct inode *inode)
+{
+ return inode->__i_ctime.tv_nsec;
+}
+
static inline struct timespec64 inode_get_ctime(const struct inode *inode)
{
return inode->__i_ctime;
}
-/**
- * inode_set_ctime_to_ts - set the ctime in the inode
- * @inode: inode in which to set the ctime
- * @ts: value to set in the ctime field
- *
- * Set the ctime in @inode to @ts
- */
static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
struct timespec64 ts)
{
return inode_set_ctime_to_ts(inode, ts);
}
+struct timespec64 simple_inode_init_ts(struct inode *inode);
+
/*
* Snapshotting support.
*/
#define IS_NOQUOTA(inode) ((inode)->i_flags & S_NOQUOTA)
#define IS_APPEND(inode) ((inode)->i_flags & S_APPEND)
#define IS_IMMUTABLE(inode) ((inode)->i_flags & S_IMMUTABLE)
+
+#ifdef CONFIG_FS_POSIX_ACL
#define IS_POSIXACL(inode) __IS_FLG(inode, SB_POSIXACL)
+#else
+#define IS_POSIXACL(inode) 0
+#endif
#define IS_DEADDIR(inode) ((inode)->i_flags & S_DEAD)
#define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
struct filename {
const char *name; /* pointer to actual string */
const __user char *uptr; /* original userland pointer */
- int refcnt;
+ atomic_t refcnt;
struct audit_names *aname;
const char iname[];
};
static_assert(offsetof(struct filename, iname) % sizeof(long) == 0);
-static inline struct mnt_idmap *file_mnt_idmap(struct file *file)
+static inline struct mnt_idmap *file_mnt_idmap(const struct file *file)
{
return mnt_idmap(file->f_path.mnt);
}
const struct cred *creds);
struct file *dentry_create(const struct path *path, int flags, umode_t mode,
const struct cred *cred);
-struct file *backing_file_open(const struct path *path, int flags,
+struct file *backing_file_open(const struct path *user_path, int flags,
const struct path *real_path,
const struct cred *cred);
-struct path *backing_file_real_path(struct file *f);
+struct path *backing_file_user_path(struct file *f);
/*
- * file_real_path - get the path corresponding to f_inode
+ * file_user_path - get the path to display for memory mapped file
*
- * When opening a backing file for a stackable filesystem (e.g.,
- * overlayfs) f_path may be on the stackable filesystem and f_inode on
- * the underlying filesystem. When the path associated with f_inode is
- * needed, this helper should be used instead of accessing f_path
- * directly.
-*/
-static inline const struct path *file_real_path(struct file *f)
+ * When mmapping a file on a stackable filesystem (e.g., overlayfs), the file
+ * stored in ->vm_file is a backing file whose f_inode is on the underlying
+ * filesystem. When the mapped file path is displayed to user (e.g. via
+ * /proc/<pid>/maps), this helper should be used to get the path to display
+ * to the user, which is the path of the fd that user has requested to map.
+ */
+static inline const struct path *file_user_path(struct file *f)
{
if (unlikely(f->f_mode & FMODE_BACKING))
- return backing_file_real_path(f);
+ return backing_file_user_path(f);
return &f->f_path;
}
# define VM_SAO VM_ARCH_1 /* Strong Access Ordering (powerpc) */
#elif defined(CONFIG_PARISC)
# define VM_GROWSUP VM_ARCH_1
-#elif defined(CONFIG_IA64)
-# define VM_GROWSUP VM_ARCH_1
#elif defined(CONFIG_SPARC64)
# define VM_SPARC_ADI VM_ARCH_1 /* Uses ADI tag for access control */
# define VM_ARCH_CLEAR VM_SPARC_ADI
* policy.
*/
struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
- unsigned long addr);
+ unsigned long addr, pgoff_t *ilx);
#endif
/*
* Called by vm_normal_page() for special PTEs to find the
return vma->vm_flags & VM_ACCESS_FLAGS;
}
+ static inline bool is_shared_maywrite(vm_flags_t vm_flags)
+ {
+ return (vm_flags & (VM_SHARED | VM_MAYWRITE)) ==
+ (VM_SHARED | VM_MAYWRITE);
+ }
+
+ static inline bool vma_is_shared_maywrite(struct vm_area_struct *vma)
+ {
+ return is_shared_maywrite(vma->vm_flags);
+ }
+
static inline
struct vm_area_struct *vma_find(struct vma_iterator *vmi, unsigned long max)
{
struct page *page, unsigned int nr, unsigned long addr);
vm_fault_t finish_fault(struct vm_fault *vmf);
- vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
#endif
/*
#define cpupid_match_pid(task, cpupid) __cpupid_match_pid(task->pid, cpupid)
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
- static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
+ static inline int folio_xchg_last_cpupid(struct folio *folio, int cpupid)
{
- return xchg(&page->_last_cpupid, cpupid & LAST_CPUPID_MASK);
+ return xchg(&folio->_last_cpupid, cpupid & LAST_CPUPID_MASK);
}
- static inline int page_cpupid_last(struct page *page)
+ static inline int folio_last_cpupid(struct folio *folio)
{
- return page->_last_cpupid;
+ return folio->_last_cpupid;
}
static inline void page_cpupid_reset_last(struct page *page)
{
page->_last_cpupid = -1 & LAST_CPUPID_MASK;
}
#else
- static inline int page_cpupid_last(struct page *page)
+ static inline int folio_last_cpupid(struct folio *folio)
{
- return (page->flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
+ return (folio->flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
}
- extern int page_cpupid_xchg_last(struct page *page, int cpupid);
+ int folio_xchg_last_cpupid(struct folio *folio, int cpupid);
static inline void page_cpupid_reset_last(struct page *page)
{
}
#endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
- static inline int xchg_page_access_time(struct page *page, int time)
+ static inline int folio_xchg_access_time(struct folio *folio, int time)
{
int last_time;
- last_time = page_cpupid_xchg_last(page, time >> PAGE_ACCESS_TIME_BUCKETS);
+ last_time = folio_xchg_last_cpupid(folio,
+ time >> PAGE_ACCESS_TIME_BUCKETS);
return last_time << PAGE_ACCESS_TIME_BUCKETS;
}
unsigned int pid_bit;
pid_bit = hash_32(current->pid, ilog2(BITS_PER_LONG));
- if (vma->numab_state && !test_bit(pid_bit, &vma->numab_state->access_pids[1])) {
- __set_bit(pid_bit, &vma->numab_state->access_pids[1]);
+ if (vma->numab_state && !test_bit(pid_bit, &vma->numab_state->pids_active[1])) {
+ __set_bit(pid_bit, &vma->numab_state->pids_active[1]);
}
}
#else /* !CONFIG_NUMA_BALANCING */
- static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
+ static inline int folio_xchg_last_cpupid(struct folio *folio, int cpupid)
{
- return page_to_nid(page); /* XXX */
+ return folio_nid(folio); /* XXX */
}
- static inline int xchg_page_access_time(struct page *page, int time)
+ static inline int folio_xchg_access_time(struct folio *folio, int time)
{
return 0;
}
- static inline int page_cpupid_last(struct page *page)
+ static inline int folio_last_cpupid(struct folio *folio)
{
- return page_to_nid(page); /* XXX */
+ return folio_nid(folio); /* XXX */
}
static inline int cpupid_to_nid(int cpupid)
pte_t pte);
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
+ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t pmd);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t pmd);
void *buf, int len, unsigned int gup_flags);
extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
void *buf, int len, unsigned int gup_flags);
- extern int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
- void *buf, int len, unsigned int gup_flags);
long get_user_pages_remote(struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
int *locked);
+ /*
+ * Retrieves a single page alongside its VMA. Does not support FOLL_NOWAIT.
+ */
static inline struct page *get_user_page_vma_remote(struct mm_struct *mm,
unsigned long addr,
int gup_flags,
{
struct page *page;
struct vm_area_struct *vma;
- int got = get_user_pages_remote(mm, addr, 1, gup_flags, &page, NULL);
+ int got;
+
+ if (WARN_ON_ONCE(unlikely(gup_flags & FOLL_NOWAIT)))
+ return ERR_PTR(-EINVAL);
+
+ got = get_user_pages_remote(mm, addr, 1, gup_flags, &page, NULL);
if (got < 0)
return ERR_PTR(got);
- if (got == 0)
- return NULL;
vma = vma_lookup(mm, addr);
if (WARN_ON_ONCE(!vma)) {
extern unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len,
- bool need_rmap_locks);
+ bool need_rmap_locks, bool for_stack);
/*
* Flags used by change_protection(). For now we make it a bitmap so
*maxrss = hiwater_rss;
}
- #if defined(SPLIT_RSS_COUNTING)
- void sync_mm_rss(struct mm_struct *mm);
- #else
- static inline void sync_mm_rss(struct mm_struct *mm)
- {
- }
- #endif
-
#ifndef CONFIG_ARCH_HAS_PTE_SPECIAL
static inline int pte_special(pte_t pte)
{
return ptl;
}
+ static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
+ {
+ struct folio *folio = ptdesc_folio(ptdesc);
+
+ __folio_set_pgtable(folio);
+ lruvec_stat_add_folio(folio, NR_PAGETABLE);
+ }
+
+ static inline void pagetable_pud_dtor(struct ptdesc *ptdesc)
+ {
+ struct folio *folio = ptdesc_folio(ptdesc);
+
+ __folio_clear_pgtable(folio);
+ lruvec_stat_sub_folio(folio, NR_PAGETABLE);
+ }
+
extern void __init pagecache_init(void);
extern void free_initmem(void);
struct vm_area_struct *next);
extern int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
unsigned long start, unsigned long end, pgoff_t pgoff);
- extern struct vm_area_struct *vma_merge(struct vma_iterator *vmi,
- struct mm_struct *, struct vm_area_struct *prev, unsigned long addr,
- unsigned long end, unsigned long vm_flags, struct anon_vma *,
- struct file *, pgoff_t, struct mempolicy *, struct vm_userfaultfd_ctx,
- struct anon_vma_name *);
extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
- extern int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *,
- unsigned long addr, int new_below);
- extern int split_vma(struct vma_iterator *vmi, struct vm_area_struct *,
- unsigned long addr, int new_below);
extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
extern void unlink_file_vma(struct vm_area_struct *);
extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
unsigned long addr, unsigned long len, pgoff_t pgoff,
bool *need_rmap_locks);
extern void exit_mmap(struct mm_struct *);
+ struct vm_area_struct *vma_modify(struct vma_iterator *vmi,
+ struct vm_area_struct *prev,
+ struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ unsigned long vm_flags,
+ struct mempolicy *policy,
+ struct vm_userfaultfd_ctx uffd_ctx,
+ struct anon_vma_name *anon_name);
+
+ /* We are about to modify the VMA's flags. */
+ static inline struct vm_area_struct
+ *vma_modify_flags(struct vma_iterator *vmi,
+ struct vm_area_struct *prev,
+ struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ unsigned long new_flags)
+ {
+ return vma_modify(vmi, prev, vma, start, end, new_flags,
+ vma_policy(vma), vma->vm_userfaultfd_ctx,
+ anon_vma_name(vma));
+ }
+
+ /* We are about to modify the VMA's flags and/or anon_name. */
+ static inline struct vm_area_struct
+ *vma_modify_flags_name(struct vma_iterator *vmi,
+ struct vm_area_struct *prev,
+ struct vm_area_struct *vma,
+ unsigned long start,
+ unsigned long end,
+ unsigned long new_flags,
+ struct anon_vma_name *new_name)
+ {
+ return vma_modify(vmi, prev, vma, start, end, new_flags,
+ vma_policy(vma), vma->vm_userfaultfd_ctx, new_name);
+ }
+
+ /* We are about to modify the VMA's memory policy. */
+ static inline struct vm_area_struct
+ *vma_modify_policy(struct vma_iterator *vmi,
+ struct vm_area_struct *prev,
+ struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ struct mempolicy *new_pol)
+ {
+ return vma_modify(vmi, prev, vma, start, end, vma->vm_flags,
+ new_pol, vma->vm_userfaultfd_ctx, anon_vma_name(vma));
+ }
+
+ /* We are about to modify the VMA's flags and/or uffd context. */
+ static inline struct vm_area_struct
+ *vma_modify_flags_uffd(struct vma_iterator *vmi,
+ struct vm_area_struct *prev,
+ struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ unsigned long new_flags,
+ struct vm_userfaultfd_ctx new_ctx)
+ {
+ return vma_modify(vmi, prev, vma, start, end, new_flags,
+ vma_policy(vma), new_ctx, anon_vma_name(vma));
+ }
static inline int check_data_rlimit(unsigned long rlim,
unsigned long new,
static inline void mm_populate(unsigned long addr, unsigned long len) {}
#endif
-/* These take the mm semaphore themselves */
-extern int __must_check vm_brk(unsigned long, unsigned long);
+/* This takes the mm semaphore itself */
extern int __must_check vm_brk_flags(unsigned long, unsigned long, unsigned long);
extern int vm_munmap(unsigned long, size_t);
extern unsigned long __must_check vm_mmap(struct file *, unsigned long,
#endif
/**
- * seal_check_future_write - Check for F_SEAL_FUTURE_WRITE flag and handle it
+ * seal_check_write - Check for F_SEAL_WRITE or F_SEAL_FUTURE_WRITE flags and
+ * handle them.
* @seals: the seals to check
* @vma: the vma to operate on
*
- * Check whether F_SEAL_FUTURE_WRITE is set; if so, do proper check/handling on
- * the vma flags. Return 0 if check pass, or <0 for errors.
+ * Check whether F_SEAL_WRITE or F_SEAL_FUTURE_WRITE are set; if so, do proper
+ * check/handling on the vma flags. Return 0 if check pass, or <0 for errors.
*/
- static inline int seal_check_future_write(int seals, struct vm_area_struct *vma)
+ static inline int seal_check_write(int seals, struct vm_area_struct *vma)
{
- if (seals & F_SEAL_FUTURE_WRITE) {
+ if (seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) {
/*
* New PROT_WRITE and MAP_SHARED mmaps are not allowed when
- * "future write" seal active.
+ * write seals are active.
*/
if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE))
return -EPERM;
/*
- * Since an F_SEAL_FUTURE_WRITE sealed memfd can be mapped as
+ * Since an F_SEAL_[FUTURE_]WRITE sealed memfd can be mapped as
* MAP_SHARED and read-only, take care to not allow mprotect to
* revert protections on such mappings. Do this only for shared
* mappings. For private mappings, don't need to mask
#endif
+ static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
+ {
+ phys_addr_t paddr = pfn << PAGE_SHIFT;
+
+ return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE);
+ }
+
#endif /* _LINUX_MM_H */
struct page_pool *pp;
unsigned long _pp_mapping_pad;
unsigned long dma_addr;
- union {
- /**
- * dma_addr_upper: might require a 64-bit
- * value on 32-bit architectures.
- */
- unsigned long dma_addr_upper;
- /**
- * For frag page support, not supported in
- * 32-bit architectures with 64-bit DMA.
- */
- atomic_long_t pp_frag_count;
- };
+ atomic_long_t pp_frag_count;
};
struct { /* Tail pages of compound page */
unsigned long compound_head; /* Bit zero is set */
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
+ #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+ int _last_cpupid;
+ #endif
+
#ifdef CONFIG_KMSAN
/*
* KMSAN metadata for this page:
struct page *kmsan_shadow;
struct page *kmsan_origin;
#endif
-
- #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
- int _last_cpupid;
- #endif
} _struct_page_alignment;
/*
* @_refcount: Do not access this member directly. Use folio_ref_count()
* to find how many references there are to this folio.
* @memcg_data: Memory Control Group data.
+ * @virtual: Virtual address in the kernel direct map.
+ * @_last_cpupid: IDs of last CPU and last process that accessed the folio.
* @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
* @_nr_pages_mapped: Do not use directly, call folio_mapcount().
* @_pincount: Do not use directly, call folio_maybe_dma_pinned().
atomic_t _refcount;
#ifdef CONFIG_MEMCG
unsigned long memcg_data;
+ #endif
+ #if defined(WANT_PAGE_VIRTUAL)
+ void *virtual;
+ #endif
+ #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+ int _last_cpupid;
#endif
/* private: the union with struct page is transitional */
};
#ifdef CONFIG_MEMCG
FOLIO_MATCH(memcg_data, memcg_data);
#endif
+ #if defined(WANT_PAGE_VIRTUAL)
+ FOLIO_MATCH(virtual, virtual);
+ #endif
+ #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+ FOLIO_MATCH(_last_cpupid, _last_cpupid);
+ #endif
#undef FOLIO_MATCH
#define FOLIO_MATCH(pg, fl) \
static_assert(offsetof(struct folio, fl) == \
char name[];
};
+ #ifdef CONFIG_ANON_VMA_NAME
+ /*
+ * mmap_lock should be read-locked when calling anon_vma_name(). Caller should
+ * either keep holding the lock while using the returned pointer or it should
+ * raise anon_vma_name refcount before releasing the lock.
+ */
+ struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma);
+ struct anon_vma_name *anon_vma_name_alloc(const char *name);
+ void anon_vma_name_free(struct kref *kref);
+ #else /* CONFIG_ANON_VMA_NAME */
+ static inline struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma)
+ {
+ return NULL;
+ }
+
+ static inline struct anon_vma_name *anon_vma_name_alloc(const char *name)
+ {
+ return NULL;
+ }
+ #endif
+
struct vma_lock {
struct rw_semaphore lock;
};
struct vma_numab_state {
+ /*
+ * Initialised as time in 'jiffies' after which VMA
+ * should be scanned. Delays first scan of new VMA by at
+ * least sysctl_numa_balancing_scan_delay:
+ */
unsigned long next_scan;
- unsigned long next_pid_reset;
- unsigned long access_pids[2];
+
+ /*
+ * Time in jiffies when pids_active[] is reset to
+ * detect phase change behaviour:
+ */
+ unsigned long pids_active_reset;
+
+ /*
+ * Approximate tracking of PIDs that trapped a NUMA hinting
+ * fault. May produce false positives due to hash collisions.
+ *
+ * [0] Previous PID tracking
+ * [1] Current PID tracking
+ *
+ * Window moves after next_pid_reset has expired approximately
+ * every VMA_PID_RESET_PERIOD jiffies:
+ */
+ unsigned long pids_active[2];
+
+ /*
+ * MM scan sequence ID when the VMA was last completely scanned.
+ * A VMA is not eligible for scanning if prev_scan_seq == numa_scan_seq
+ */
+ int prev_scan_seq;
};
/*
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
} __randomize_layout;
+ #ifdef CONFIG_NUMA
+ #define vma_policy(vma) ((vma)->vm_policy)
+ #else
+ #define vma_policy(vma) NULL
+ #endif
+
#ifdef CONFIG_SCHED_MM_CID
struct mm_cid {
u64 time;
struct root_domain;
struct rq;
struct sched_attr;
-struct sched_param;
struct seq_file;
struct sighand_struct;
struct signal_struct;
extern struct mutex sched_domains_mutex;
#endif
+struct sched_param {
+ int sched_priority;
+};
+
struct sched_info {
#ifdef CONFIG_SCHED_INFO
/* Cumulative counters: */
#endif
unsigned int __state;
-#ifdef CONFIG_PREEMPT_RT
/* saved state for "spinlock sleepers" */
unsigned int saved_state;
-#endif
/*
* This begins the randomizable portion of task_struct. Only
struct mm_struct *mm;
struct mm_struct *active_mm;
+ struct address_space *faults_disabled_mapping;
int exit_state;
int exit_code;
* ->sched_remote_wakeup gets used, so it can be in this word.
*/
unsigned sched_remote_wakeup:1;
+#ifdef CONFIG_RT_MUTEXES
+ unsigned sched_rt_mutex:1;
+#endif
/* Bit to tell LSMs we're in execve(): */
unsigned in_execve:1;
struct mem_cgroup *active_memcg;
#endif
+ #ifdef CONFIG_MEMCG_KMEM
+ struct obj_cgroup *objcg;
+ #endif
+
#ifdef CONFIG_BLK_CGROUP
struct gendisk *throttle_disk;
#endif
#define TNF_FAULT_LOCAL 0x08
#define TNF_MIGRATE_FAIL 0x10
+enum numa_vmaskip_reason {
+ NUMAB_SKIP_UNSUITABLE,
+ NUMAB_SKIP_SHARED_RO,
+ NUMAB_SKIP_INACCESSIBLE,
+ NUMAB_SKIP_SCAN_DELAY,
+ NUMAB_SKIP_PID_INACTIVE,
+ NUMAB_SKIP_IGNORE_PID,
+ NUMAB_SKIP_SEQ_COMPLETED,
+};
+
#ifdef CONFIG_NUMA_BALANCING
extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p, bool final);
- extern bool should_numa_migrate_memory(struct task_struct *p, struct page *page,
- int src_nid, int dst_cpu);
+ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
+ int src_nid, int dst_cpu);
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
{
}
static inline bool should_numa_migrate_memory(struct task_struct *p,
- struct page *page, int src_nid, int dst_cpu)
+ struct folio *folio, int src_nid, int dst_cpu)
{
return true;
}
static u16 have_release_callback __read_mostly;
static u16 have_canfork_callback __read_mostly;
+static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS);
+
/* cgroup namespace for init task */
struct cgroup_namespace init_cgroup_ns = {
.ns.count = REFCOUNT_INIT(2),
cgroup_root_count--;
}
- cgroup_favor_dynmods(root, false);
+ if (!have_favordynmods)
+ cgroup_favor_dynmods(root, false);
+
cgroup_exit_root_id(root);
cgroup_unlock();
if (!css->ss) {
if (cgroup_on_dfl(cgrp)) {
- ret = cgroup_addrm_files(&cgrp->self, cgrp,
+ ret = cgroup_addrm_files(css, cgrp,
cgroup_base_files, true);
if (ret < 0)
return ret;
if (cgroup_psi_enabled()) {
- ret = cgroup_addrm_files(&cgrp->self, cgrp,
+ ret = cgroup_addrm_files(css, cgrp,
cgroup_psi_files, true);
if (ret < 0)
return ret;
}
} else {
- cgroup_addrm_files(css, cgrp,
- cgroup1_base_files, true);
+ ret = cgroup_addrm_files(css, cgrp,
+ cgroup1_base_files, true);
+ if (ret < 0)
+ return ret;
}
} else {
list_for_each_entry(cfts, &css->ss->cfts, node) {
Opt_favordynmods,
Opt_memory_localevents,
Opt_memory_recursiveprot,
+ Opt_memory_hugetlb_accounting,
nr__cgroup2_params
};
fsparam_flag("favordynmods", Opt_favordynmods),
fsparam_flag("memory_localevents", Opt_memory_localevents),
fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot),
+ fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting),
{}
};
case Opt_memory_recursiveprot:
ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
return 0;
+ case Opt_memory_hugetlb_accounting:
+ ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
+ return 0;
}
return -EINVAL;
}
cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;
+
+ if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
+ cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
+ else
+ cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
}
}
seq_puts(seq, ",memory_localevents");
if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
seq_puts(seq, ",memory_recursiveprot");
+ if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
+ seq_puts(seq, ",memory_hugetlb_accounting");
return 0;
}
fc->user_ns = get_user_ns(ctx->ns->user_ns);
fc->global = true;
-#ifdef CONFIG_CGROUP_FAVOR_DYNMODS
- ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
-#endif
+ if (have_favordynmods)
+ ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+
return 0;
}
void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
struct css_task_iter *it)
{
+ unsigned long irqflags;
+
memset(it, 0, sizeof(*it));
- spin_lock_irq(&css_set_lock);
+ spin_lock_irqsave(&css_set_lock, irqflags);
it->ss = css->ss;
it->flags = flags;
css_task_iter_advance(it);
- spin_unlock_irq(&css_set_lock);
+ spin_unlock_irqrestore(&css_set_lock, irqflags);
}
/**
*/
struct task_struct *css_task_iter_next(struct css_task_iter *it)
{
+ unsigned long irqflags;
+
if (it->cur_task) {
put_task_struct(it->cur_task);
it->cur_task = NULL;
}
- spin_lock_irq(&css_set_lock);
+ spin_lock_irqsave(&css_set_lock, irqflags);
/* @it may be half-advanced by skips, finish advancing */
if (it->flags & CSS_TASK_ITER_SKIPPED)
css_task_iter_advance(it);
}
- spin_unlock_irq(&css_set_lock);
+ spin_unlock_irqrestore(&css_set_lock, irqflags);
return it->cur_task;
}
*/
void css_task_iter_end(struct css_task_iter *it)
{
+ unsigned long irqflags;
+
if (it->cur_cset) {
- spin_lock_irq(&css_set_lock);
+ spin_lock_irqsave(&css_set_lock, irqflags);
list_del(&it->iters_node);
put_css_set_locked(it->cur_cset);
- spin_unlock_irq(&css_set_lock);
+ spin_unlock_irqrestore(&css_set_lock, irqflags);
}
if (it->cur_dcset)
if (cgroup1_ssid_disabled(ssid))
pr_info("Disabling %s control group subsystem in v1 mounts\n",
- ss->name);
+ ss->legacy_name);
cgrp_dfl_root.subsys_mask |= 1 << ss->id;
}
__setup("cgroup_debug", enable_cgroup_debug);
+static int __init cgroup_favordynmods_setup(char *str)
+{
+ return (kstrtobool(str, &have_favordynmods) == 0);
+}
+__setup("cgroup_favordynmods=", cgroup_favordynmods_setup);
+
/**
* css_tryget_online_from_dir - get corresponding css from a cgroup dentry
* @dentry: directory dentry of interest
"nsdelegate\n"
"favordynmods\n"
"memory_localevents\n"
- "memory_recursiveprot\n");
+ "memory_recursiveprot\n"
+ "memory_hugetlb_accounting\n");
}
static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
#include <asm/unistd.h>
#include <asm/mmu_context.h>
+#include "exit.h"
+
/*
* The default value should be high enough to not crash a system that randomly
* crashes its kernel from time to time, but low enough to at least not permit
exit_mm_release(current, mm);
if (!mm)
return;
- sync_mm_rss(mm);
mmap_read_lock(mm);
mmgrab_lazy_tlb(mm);
BUG_ON(mm != current->active_mm);
io_uring_files_cancel();
exit_signals(tsk); /* sets PF_EXITING */
- /* sync mm's RSS info before statistics gathering */
- if (tsk->mm)
- sync_mm_rss(tsk->mm);
acct_update_integrals(tsk);
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
return 0;
}
-struct waitid_info {
- pid_t pid;
- uid_t uid;
- int status;
- int cause;
-};
-
-struct wait_opts {
- enum pid_type wo_type;
- int wo_flags;
- struct pid *wo_pid;
-
- struct waitid_info *wo_info;
- int wo_stat;
- struct rusage *wo_rusage;
-
- wait_queue_entry_t child_wait;
- int notask_error;
-};
-
static int eligible_pid(struct wait_opts *wo, struct task_struct *p)
{
return wo->wo_type == PIDTYPE_MAX ||
return 0;
}
+bool pid_child_should_wake(struct wait_opts *wo, struct task_struct *p)
+{
+ if (!eligible_pid(wo, p))
+ return false;
+
+ if ((wo->wo_flags & __WNOTHREAD) && wo->child_wait.private != p->parent)
+ return false;
+
+ return true;
+}
+
static int child_wait_callback(wait_queue_entry_t *wait, unsigned mode,
int sync, void *key)
{
child_wait);
struct task_struct *p = key;
- if (!eligible_pid(wo, p))
- return 0;
+ if (pid_child_should_wake(wo, p))
+ return default_wake_function(wait, mode, sync, key);
- if ((wo->wo_flags & __WNOTHREAD) && wait->private != p->parent)
- return 0;
-
- return default_wake_function(wait, mode, sync, key);
+ return 0;
}
void __wake_up_parent(struct task_struct *p, struct task_struct *parent)
return 0;
}
-static long do_wait(struct wait_opts *wo)
+long __do_wait(struct wait_opts *wo)
{
- int retval;
-
- trace_sched_process_wait(wo->wo_pid);
+ long retval;
- init_waitqueue_func_entry(&wo->child_wait, child_wait_callback);
- wo->child_wait.private = current;
- add_wait_queue(¤t->signal->wait_chldexit, &wo->child_wait);
-repeat:
/*
* If there is nothing that can match our criteria, just get out.
* We will clear ->notask_error to zero if we see any child that
(!wo->wo_pid || !pid_has_task(wo->wo_pid, wo->wo_type)))
goto notask;
- set_current_state(TASK_INTERRUPTIBLE);
read_lock(&tasklist_lock);
if (wo->wo_type == PIDTYPE_PID) {
retval = do_wait_pid(wo);
if (retval)
- goto end;
+ return retval;
} else {
struct task_struct *tsk = current;
do {
retval = do_wait_thread(wo, tsk);
if (retval)
- goto end;
+ return retval;
retval = ptrace_do_wait(wo, tsk);
if (retval)
- goto end;
+ return retval;
if (wo->wo_flags & __WNOTHREAD)
break;
notask:
retval = wo->notask_error;
- if (!retval && !(wo->wo_flags & WNOHANG)) {
- retval = -ERESTARTSYS;
- if (!signal_pending(current)) {
- schedule();
- goto repeat;
- }
- }
-end:
+ if (!retval && !(wo->wo_flags & WNOHANG))
+ return -ERESTARTSYS;
+
+ return retval;
+}
+
+static long do_wait(struct wait_opts *wo)
+{
+ int retval;
+
+ trace_sched_process_wait(wo->wo_pid);
+
+ init_waitqueue_func_entry(&wo->child_wait, child_wait_callback);
+ wo->child_wait.private = current;
+ add_wait_queue(¤t->signal->wait_chldexit, &wo->child_wait);
+
+ do {
+ set_current_state(TASK_INTERRUPTIBLE);
+ retval = __do_wait(wo);
+ if (retval != -ERESTARTSYS)
+ break;
+ if (signal_pending(current))
+ break;
+ schedule();
+ } while (1);
+
__set_current_state(TASK_RUNNING);
remove_wait_queue(¤t->signal->wait_chldexit, &wo->child_wait);
return retval;
}
-static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop,
- int options, struct rusage *ru)
+int kernel_waitid_prepare(struct wait_opts *wo, int which, pid_t upid,
+ struct waitid_info *infop, int options,
+ struct rusage *ru)
{
- struct wait_opts wo;
+ unsigned int f_flags = 0;
struct pid *pid = NULL;
enum pid_type type;
- long ret;
- unsigned int f_flags = 0;
if (options & ~(WNOHANG|WNOWAIT|WEXITED|WSTOPPED|WCONTINUED|
__WNOTHREAD|__WCLONE|__WALL))
return -EINVAL;
}
- wo.wo_type = type;
- wo.wo_pid = pid;
- wo.wo_flags = options;
- wo.wo_info = infop;
- wo.wo_rusage = ru;
+ wo->wo_type = type;
+ wo->wo_pid = pid;
+ wo->wo_flags = options;
+ wo->wo_info = infop;
+ wo->wo_rusage = ru;
if (f_flags & O_NONBLOCK)
- wo.wo_flags |= WNOHANG;
+ wo->wo_flags |= WNOHANG;
+
+ return 0;
+}
+
+static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop,
+ int options, struct rusage *ru)
+{
+ struct wait_opts wo;
+ long ret;
+
+ ret = kernel_waitid_prepare(&wo, which, upid, infop, options, ru);
+ if (ret)
+ return ret;
ret = do_wait(&wo);
- if (!ret && !(options & WNOHANG) && (f_flags & O_NONBLOCK))
+ if (!ret && !(options & WNOHANG) && (wo.wo_flags & WNOHANG))
ret = -EAGAIN;
- put_pid(pid);
+ put_pid(wo.wo_pid);
return ret;
}
get_file(file);
i_mmap_lock_write(mapping);
- if (tmp->vm_flags & VM_SHARED)
+ if (vma_is_shared_maywrite(tmp))
mapping_allow_writable(mapping);
flush_dcache_mmap_lock(mapping);
/* insert tmp into the share list, just after mpnt */
hugetlb_count_init(mm);
if (current->mm) {
- mm->flags = current->mm->flags & MMF_INIT_MASK;
+ mm->flags = mmf_init_flags(current->mm->flags);
mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
} else {
mm->flags = default_dump_filter;
/**
* set_mm_exe_file - change a reference to the mm's executable file
+ * @mm: The mm to change.
+ * @new_exe_file: The new file to use.
*
* This changes mm's executable file (shown as symlink /proc/[pid]/exe).
*
/**
* replace_mm_exe_file - replace a reference to the mm's executable file
+ * @mm: The mm to change.
+ * @new_exe_file: The new file to use.
*
* This changes mm's executable file (shown as symlink /proc/[pid]/exe).
*
/**
* get_mm_exe_file - acquire a reference to the mm's executable file
+ * @mm: The mm of interest.
*
* Returns %NULL if mm has no associated executable file.
* User must release file via fput().
struct file *exe_file;
rcu_read_lock();
- exe_file = rcu_dereference(mm->exe_file);
- if (exe_file && !get_file_rcu(exe_file))
- exe_file = NULL;
+ exe_file = get_file_rcu(&mm->exe_file);
rcu_read_unlock();
return exe_file;
}
/**
* get_task_exe_file - acquire a reference to the task's executable file
+ * @task: The task.
*
* Returns %NULL if task's mm (if any) has no associated executable file or
* this is a kernel thread with borrowed mm (see the comment above get_task_mm).
/**
* get_task_mm - acquire a reference to the task's mm
+ * @task: The task.
*
* Returns %NULL if the task has no mm. Checks PF_KTHREAD (meaning
* this kernel workthread has transiently adopted a user mm with use_mm,
* __pidfd_prepare - allocate a new pidfd_file and reserve a pidfd
* @pid: the struct pid for which to create a pidfd
* @flags: flags of the new @pidfd
- * @pidfd: the pidfd to return
+ * @ret: Where to return the file for the pidfd.
*
* Allocate a new file that stashes @pid and reserve a new pidfd number in the
* caller's file descriptor table. The pidfd is reserved but not installed yet.
-
+ *
* The helper doesn't perform checks on @pid which makes it useful for pidfds
* created via CLONE_PIDFD where @pid has no task attached when the pidfd and
* pidfd file are prepared.
* pidfd_prepare - allocate a new pidfd_file and reserve a pidfd
* @pid: the struct pid for which to create a pidfd
* @flags: flags of the new @pidfd
- * @pidfd: the pidfd to return
+ * @ret: Where to return the pidfd.
*
* Allocate a new file that stashes @pid and reserve a new pidfd number in the
* caller's file descriptor table. The pidfd is reserved but not installed yet.
p->io_uring = NULL;
#endif
- #if defined(SPLIT_RSS_COUNTING)
- memset(&p->rss_stat, 0, sizeof(p->rss_stat));
- #endif
-
p->default_timer_slack_ns = current->timer_slack_ns;
#ifdef CONFIG_PSI
if (!access_ok((void __user *)kargs->stack, kargs->stack_size))
return false;
-#if !defined(CONFIG_STACK_GROWSUP) && !defined(CONFIG_IA64)
+#if !defined(CONFIG_STACK_GROWSUP)
kargs->stack += kargs->stack_size;
#endif
}
}
/**
- * clone3 - create a new process with specific properties
+ * sys_clone3 - create a new process with specific properties
* @uargs: argument structure
* @size: size of @uargs
*
#include <linux/bitops.h>
#include <linux/export.h>
#include <linux/completion.h>
+#include <linux/kmemleak.h>
#include <linux/moduleparam.h>
#include <linux/panic.h>
#include <linux/panic_notifier.h>
/* Unregister a counter, with NULL for not caring which. */
void rcu_gp_slow_unregister(atomic_t *rgssp)
{
- WARN_ON_ONCE(rgssp && rgssp != rcu_gp_slow_suppress);
+ WARN_ON_ONCE(rgssp && rgssp != rcu_gp_slow_suppress && rcu_gp_slow_suppress != NULL);
WRITE_ONCE(rcu_gp_slow_suppress, NULL);
}
*/
static void rcu_gp_fqs(bool first_time)
{
+ int nr_fqs = READ_ONCE(rcu_state.nr_fqs_jiffies_stall);
struct rcu_node *rnp = rcu_get_root();
WRITE_ONCE(rcu_state.gp_activity, jiffies);
WRITE_ONCE(rcu_state.n_force_qs, rcu_state.n_force_qs + 1);
+
+ WARN_ON_ONCE(nr_fqs > 3);
+ /* Only countdown nr_fqs for stall purposes if jiffies moves. */
+ if (nr_fqs) {
+ if (nr_fqs == 1) {
+ WRITE_ONCE(rcu_state.jiffies_stall,
+ jiffies + rcu_jiffies_till_stall_check());
+ }
+ WRITE_ONCE(rcu_state.nr_fqs_jiffies_stall, --nr_fqs);
+ }
+
if (first_time) {
/* Collect dyntick-idle snapshots. */
force_qs_rnp(dyntick_save_progress_counter);
trace_rcu_invoke_callback(rcu_state.name, rhp);
f = rhp->func;
+ debug_rcu_head_callback(rhp);
WRITE_ONCE(rhp->func, (rcu_callback_t)0L);
f(rhp);
*/
void call_rcu_hurry(struct rcu_head *head, rcu_callback_t func)
{
- return __call_rcu_common(head, func, false);
+ __call_rcu_common(head, func, false);
}
EXPORT_SYMBOL_GPL(call_rcu_hurry);
#endif
*/
void call_rcu(struct rcu_head *head, rcu_callback_t func)
{
- return __call_rcu_common(head, func, IS_ENABLED(CONFIG_RCU_LAZY));
+ __call_rcu_common(head, func, IS_ENABLED(CONFIG_RCU_LAZY));
}
EXPORT_SYMBOL_GPL(call_rcu);
success = true;
}
+ /*
+ * The kvfree_rcu() caller considers the pointer freed at this point
+ * and likely removes any references to it. Since the actual slab
+ * freeing (and kmemleak_free()) is deferred, tell kmemleak to ignore
+ * this object (no scanning or false positives reporting).
+ */
+ kmemleak_ignore(ptr);
+
// Set timer to drain after KFREE_DRAIN_JIFFIES.
if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING)
schedule_delayed_monitor_work(krcp);
return freed == 0 ? SHRINK_STOP : freed;
}
- static struct shrinker kfree_rcu_shrinker = {
- .count_objects = kfree_rcu_shrink_count,
- .scan_objects = kfree_rcu_shrink_scan,
- .batch = 0,
- .seeks = DEFAULT_SEEKS,
- };
-
void __init kfree_rcu_scheduler_running(void)
{
int cpu;
}
EXPORT_SYMBOL_GPL(rcu_barrier);
+static unsigned long rcu_barrier_last_throttle;
+
+/**
+ * rcu_barrier_throttled - Do rcu_barrier(), but limit to one per second
+ *
+ * This can be thought of as guard rails around rcu_barrier() that
+ * permits unrestricted userspace use, at least assuming the hardware's
+ * try_cmpxchg() is robust. There will be at most one call per second to
+ * rcu_barrier() system-wide from use of this function, which means that
+ * callers might needlessly wait a second or three.
+ *
+ * This is intended for use by test suites to avoid OOM by flushing RCU
+ * callbacks from the previous test before starting the next. See the
+ * rcutree.do_rcu_barrier module parameter for more information.
+ *
+ * Why not simply make rcu_barrier() more scalable? That might be
+ * the eventual endpoint, but let's keep it simple for the time being.
+ * Note that the module parameter infrastructure serializes calls to a
+ * given .set() function, but should concurrent .set() invocation ever be
+ * possible, we are ready!
+ */
+static void rcu_barrier_throttled(void)
+{
+ unsigned long j = jiffies;
+ unsigned long old = READ_ONCE(rcu_barrier_last_throttle);
+ unsigned long s = rcu_seq_snap(&rcu_state.barrier_sequence);
+
+ while (time_in_range(j, old, old + HZ / 16) ||
+ !try_cmpxchg(&rcu_barrier_last_throttle, &old, j)) {
+ schedule_timeout_idle(HZ / 16);
+ if (rcu_seq_done(&rcu_state.barrier_sequence, s)) {
+ smp_mb(); /* caller's subsequent code after above check. */
+ return;
+ }
+ j = jiffies;
+ old = READ_ONCE(rcu_barrier_last_throttle);
+ }
+ rcu_barrier();
+}
+
+/*
+ * Invoke rcu_barrier_throttled() when a rcutree.do_rcu_barrier
+ * request arrives. We insist on a true value to allow for possible
+ * future expansion.
+ */
+static int param_set_do_rcu_barrier(const char *val, const struct kernel_param *kp)
+{
+ bool b;
+ int ret;
+
+ if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING)
+ return -EAGAIN;
+ ret = kstrtobool(val, &b);
+ if (!ret && b) {
+ atomic_inc((atomic_t *)kp->arg);
+ rcu_barrier_throttled();
+ atomic_dec((atomic_t *)kp->arg);
+ }
+ return ret;
+}
+
+/*
+ * Output the number of outstanding rcutree.do_rcu_barrier requests.
+ */
+static int param_get_do_rcu_barrier(char *buffer, const struct kernel_param *kp)
+{
+ return sprintf(buffer, "%d\n", atomic_read((atomic_t *)kp->arg));
+}
+
+static const struct kernel_param_ops do_rcu_barrier_ops = {
+ .set = param_set_do_rcu_barrier,
+ .get = param_get_do_rcu_barrier,
+};
+static atomic_t do_rcu_barrier;
+module_param_cb(do_rcu_barrier, &do_rcu_barrier_ops, &do_rcu_barrier, 0644);
+
/*
* Compute the mask of online CPUs for the specified rcu_node structure.
* This will not be stable unless the rcu_node structure's ->lock is
rdp = this_cpu_ptr(&rcu_data);
/*
* Strictly, we care here about the case where the current CPU is
- * in rcu_cpu_starting() and thus has an excuse for rdp->grpmask
+ * in rcutree_report_cpu_starting() and thus has an excuse for rdp->grpmask
* not being up to date. So arch_spin_is_locked() might have a
* false positive if it's held by some *other* CPU, but that's
* OK because that just means a false *negative* on the warning.
return !!rcu_state.n_online_cpus;
}
-/*
- * Near the end of the offline process. Trace the fact that this CPU
- * is going offline.
- */
-int rcutree_dying_cpu(unsigned int cpu)
-{
- bool blkd;
- struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
- struct rcu_node *rnp = rdp->mynode;
-
- if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
- return 0;
-
- blkd = !!(READ_ONCE(rnp->qsmask) & rdp->grpmask);
- trace_rcu_grace_period(rcu_state.name, READ_ONCE(rnp->gp_seq),
- blkd ? TPS("cpuofl-bgp") : TPS("cpuofl"));
- return 0;
-}
-
/*
* All CPUs for the specified rcu_node structure have gone offline,
* and all tasks that were preempted within an RCU read-side critical
}
}
-/*
- * The CPU has been completely removed, and some other CPU is reporting
- * this fact from process context. Do the remainder of the cleanup.
- * There can only be one CPU hotplug operation at a time, so no need for
- * explicit locking.
- */
-int rcutree_dead_cpu(unsigned int cpu)
-{
- if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
- return 0;
-
- WRITE_ONCE(rcu_state.n_online_cpus, rcu_state.n_online_cpus - 1);
- // Stop-machine done, so allow nohz_full to disable tick.
- tick_dep_clear(TICK_DEP_BIT_RCU);
- return 0;
-}
-
/*
* Propagate ->qsinitmask bits up the rcu_node tree to account for the
* first CPU in a given leaf rcu_node structure coming online. The caller
return 0;
}
-/*
- * Near the beginning of the process. The CPU is still very much alive
- * with pretty much all services enabled.
- */
-int rcutree_offline_cpu(unsigned int cpu)
-{
- unsigned long flags;
- struct rcu_data *rdp;
- struct rcu_node *rnp;
-
- rdp = per_cpu_ptr(&rcu_data, cpu);
- rnp = rdp->mynode;
- raw_spin_lock_irqsave_rcu_node(rnp, flags);
- rnp->ffmask &= ~rdp->grpmask;
- raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
-
- rcutree_affinity_setting(cpu, cpu);
-
- // nohz_full CPUs need the tick for stop-machine to work quickly
- tick_dep_set(TICK_DEP_BIT_RCU);
- return 0;
-}
-
/*
* Mark the specified CPU as being online so that subsequent grace periods
* (both expedited and normal) will wait on it. Note that this means that
* from the incoming CPU rather than from the cpuhp_step mechanism.
* This is because this function must be invoked at a precise location.
* This incoming CPU must not have enabled interrupts yet.
+ *
+ * This mirrors the effects of rcutree_report_cpu_dead().
*/
-void rcu_cpu_starting(unsigned int cpu)
+void rcutree_report_cpu_starting(unsigned int cpu)
{
unsigned long mask;
struct rcu_data *rdp;
* Note that this function is special in that it is invoked directly
* from the outgoing CPU rather than from the cpuhp_step mechanism.
* This is because this function must be invoked at a precise location.
+ *
+ * This mirrors the effect of rcutree_report_cpu_starting().
*/
-void rcu_report_dead(unsigned int cpu)
+void rcutree_report_cpu_dead(void)
{
- unsigned long flags, seq_flags;
+ unsigned long flags;
unsigned long mask;
- struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
+ struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rdp & rnp. */
+ /*
+ * IRQS must be disabled from now on and until the CPU dies, or an interrupt
+ * may introduce a new READ-side while it is actually off the QS masks.
+ */
+ lockdep_assert_irqs_disabled();
// Do any dangling deferred wakeups.
do_nocb_deferred_wakeup(rdp);
/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
mask = rdp->grpmask;
- local_irq_save(seq_flags);
arch_spin_lock(&rcu_state.ofl_lock);
raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
rdp->rcu_ofl_gp_seq = READ_ONCE(rcu_state.gp_seq);
WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask);
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
arch_spin_unlock(&rcu_state.ofl_lock);
- local_irq_restore(seq_flags);
-
rdp->cpu_started = false;
}
cpu, rcu_segcblist_n_cbs(&rdp->cblist),
rcu_segcblist_first_cb(&rdp->cblist));
}
-#endif
+
+/*
+ * The CPU has been completely removed, and some other CPU is reporting
+ * this fact from process context. Do the remainder of the cleanup.
+ * There can only be one CPU hotplug operation at a time, so no need for
+ * explicit locking.
+ */
+int rcutree_dead_cpu(unsigned int cpu)
+{
+ WRITE_ONCE(rcu_state.n_online_cpus, rcu_state.n_online_cpus - 1);
+ // Stop-machine done, so allow nohz_full to disable tick.
+ tick_dep_clear(TICK_DEP_BIT_RCU);
+ return 0;
+}
+
+/*
+ * Near the end of the offline process. Trace the fact that this CPU
+ * is going offline.
+ */
+int rcutree_dying_cpu(unsigned int cpu)
+{
+ bool blkd;
+ struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
+ struct rcu_node *rnp = rdp->mynode;
+
+ blkd = !!(READ_ONCE(rnp->qsmask) & rdp->grpmask);
+ trace_rcu_grace_period(rcu_state.name, READ_ONCE(rnp->gp_seq),
+ blkd ? TPS("cpuofl-bgp") : TPS("cpuofl"));
+ return 0;
+}
+
+/*
+ * Near the beginning of the process. The CPU is still very much alive
+ * with pretty much all services enabled.
+ */
+int rcutree_offline_cpu(unsigned int cpu)
+{
+ unsigned long flags;
+ struct rcu_data *rdp;
+ struct rcu_node *rnp;
+
+ rdp = per_cpu_ptr(&rcu_data, cpu);
+ rnp = rdp->mynode;
+ raw_spin_lock_irqsave_rcu_node(rnp, flags);
+ rnp->ffmask &= ~rdp->grpmask;
+ raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+
+ rcutree_affinity_setting(cpu, cpu);
+
+ // nohz_full CPUs need the tick for stop-machine to work quickly
+ tick_dep_set(TICK_DEP_BIT_RCU);
+ return 0;
+}
+#endif /* #ifdef CONFIG_HOTPLUG_CPU */
/*
* On non-huge systems, use expedited RCU grace periods to make suspend
{
int cpu;
int i, j;
+ struct shrinker *kfree_rcu_shrinker;
/* Clamp it to [0:100] seconds interval. */
if (rcu_delay_page_cache_fill_msec < 0 ||
INIT_DELAYED_WORK(&krcp->page_cache_work, fill_page_cache_func);
krcp->initialized = true;
}
- if (register_shrinker(&kfree_rcu_shrinker, "rcu-kfree"))
- pr_err("Failed to register kfree_rcu() shrinker!\n");
+
+ kfree_rcu_shrinker = shrinker_alloc(0, "rcu-kfree");
+ if (!kfree_rcu_shrinker) {
+ pr_err("Failed to allocate kfree_rcu() shrinker!\n");
+ return;
+ }
+
+ kfree_rcu_shrinker->count_objects = kfree_rcu_shrink_count;
+ kfree_rcu_shrinker->scan_objects = kfree_rcu_shrink_scan;
+
+ shrinker_register(kfree_rcu_shrinker);
}
void __init rcu_init(void)
pm_notifier(rcu_pm_notify, 0);
WARN_ON(num_online_cpus() > 1); // Only one CPU this early in boot.
rcutree_prepare_cpu(cpu);
- rcu_cpu_starting(cpu);
+ rcutree_report_cpu_starting(cpu);
rcutree_online_cpu(cpu);
/* Create workqueue for Tree SRCU and for expedited GPs. */
#include <asm/switch_to.h>
-#include <linux/sched/cond_resched.h>
-
#include "sched.h"
#include "stats.h"
#include "autogroup.h"
unsigned int sysctl_sched_base_slice = 750000ULL;
static unsigned int normalized_sysctl_sched_base_slice = 750000ULL;
-/*
- * After fork, child runs first. If set to 0 (default) then
- * parent will (try to) run first.
- */
-unsigned int sysctl_sched_child_runs_first __read_mostly;
-
const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
int sched_thermal_decay_shift;
#ifdef CONFIG_SYSCTL
static struct ctl_table sched_fair_sysctls[] = {
- {
- .procname = "sched_child_runs_first",
- .data = &sysctl_sched_child_runs_first,
- .maxlen = sizeof(unsigned int),
- .mode = 0644,
- .proc_handler = proc_dointvec,
- },
#ifdef CONFIG_CFS_BANDWIDTH
{
.procname = "sched_cfs_bandwidth_slice_us",
cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta;
}
+/*
+ * Specifically: avg_runtime() + 0 must result in entity_eligible() := true
+ * For this to be so, the result of this function must have a left bias.
+ */
u64 avg_vruntime(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
load += weight;
}
- if (load)
+ if (load) {
+ /* sign flips effective floor / ceil */
+ if (avg < 0)
+ avg -= (load - 1);
avg = div_s64(avg, load);
+ }
return cfs_rq->min_vruntime + avg;
}
*
* Which allows an EDF like search on (sub)trees.
*/
-static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
{
struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
struct sched_entity *curr = cfs_rq->curr;
struct sched_entity *best = NULL;
+ struct sched_entity *best_left = NULL;
if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
curr = NULL;
+ best = curr;
/*
* Once selected, run a task until it either becomes non-eligible or
}
/*
- * If this entity has an earlier deadline than the previous
- * best, take this one. If it also has the earliest deadline
- * of its subtree, we're done.
+ * Now we heap search eligible trees for the best (min_)deadline
*/
- if (!best || deadline_gt(deadline, best, se)) {
+ if (!best || deadline_gt(deadline, best, se))
best = se;
- if (best->deadline == best->min_deadline)
- break;
- }
/*
- * If the earlest deadline in this subtree is in the fully
- * eligible left half of our space, go there.
+ * Every se in a left branch is eligible, keep track of the
+ * branch with the best min_deadline
*/
+ if (node->rb_left) {
+ struct sched_entity *left = __node_2_se(node->rb_left);
+
+ if (!best_left || deadline_gt(min_deadline, best_left, left))
+ best_left = left;
+
+ /*
+ * min_deadline is in the left branch. rb_left and all
+ * descendants are eligible, so immediately switch to the second
+ * loop.
+ */
+ if (left->min_deadline == se->min_deadline)
+ break;
+ }
+
+ /* min_deadline is at this node, no need to look right */
+ if (se->deadline == se->min_deadline)
+ break;
+
+ /* else min_deadline is in the right branch. */
+ node = node->rb_right;
+ }
+
+ /*
+ * We ran into an eligible node which is itself the best.
+ * (Or nr_running == 0 and both are NULL)
+ */
+ if (!best_left || (s64)(best_left->min_deadline - best->deadline) > 0)
+ return best;
+
+ /*
+ * Now best_left and all of its children are eligible, and we are just
+ * looking for deadline == min_deadline
+ */
+ node = &best_left->run_node;
+ while (node) {
+ struct sched_entity *se = __node_2_se(node);
+
+ /* min_deadline is the current node */
+ if (se->deadline == se->min_deadline)
+ return se;
+
+ /* min_deadline is in the left branch */
if (node->rb_left &&
__node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
node = node->rb_left;
continue;
}
+ /* else min_deadline is in the right branch */
node = node->rb_right;
}
+ return NULL;
+}
- if (!best || (curr && deadline_gt(deadline, best, curr)))
- best = curr;
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *se = __pick_eevdf(cfs_rq);
- if (unlikely(!best)) {
+ if (!se) {
struct sched_entity *left = __pick_first_entity(cfs_rq);
if (left) {
pr_err("EEVDF scheduling fail, picking leftmost\n");
}
}
- return best;
+ return se;
}
#ifdef CONFIG_SCHED_DEBUG
* The smaller the hint page fault latency, the higher the possibility
* for the page to be hot.
*/
- static int numa_hint_fault_latency(struct page *page)
+ static int numa_hint_fault_latency(struct folio *folio)
{
int last_time, time;
time = jiffies_to_msecs(jiffies);
- last_time = xchg_page_access_time(page, time);
+ last_time = folio_xchg_access_time(folio, time);
return (time - last_time) & PAGE_ACCESS_TIME_MASK;
}
}
}
- bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
+ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
int src_nid, int dst_cpu)
{
struct numa_group *ng = deref_curr_numa_group(p);
numa_promotion_adjust_threshold(pgdat, rate_limit, def_th);
th = pgdat->nbp_threshold ? : def_th;
- latency = numa_hint_fault_latency(page);
+ latency = numa_hint_fault_latency(folio);
if (latency >= th)
return false;
return !numa_promotion_rate_limit(pgdat, rate_limit,
- thp_nr_pages(page));
+ folio_nr_pages(folio));
}
this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
- last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
+ last_cpupid = folio_xchg_last_cpupid(folio, this_cpupid);
if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
!node_is_toptier(src_nid) && !cpupid_valid(last_cpupid))
}
/* Cannot migrate task to CPU-less node */
- if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) {
- int near_nid = max_nid;
- int distance, near_distance = INT_MAX;
-
- for_each_node_state(nid, N_CPU) {
- distance = node_distance(max_nid, nid);
- if (distance < near_distance) {
- near_nid = nid;
- near_distance = distance;
- }
- }
- max_nid = near_nid;
- }
+ max_nid = numa_nearest_node(max_nid, N_CPU);
if (ng) {
numa_group_count_active_nodes(ng);
p->mm->numa_scan_offset = 0;
}
-static bool vma_is_accessed(struct vm_area_struct *vma)
+static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
{
unsigned long pids;
/*
if (READ_ONCE(current->mm->numa_scan_seq) < 2)
return true;
- pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
- return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
+ pids = vma->numab_state->pids_active[0] | vma->numab_state->pids_active[1];
+ if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids))
+ return true;
+
+ /*
+ * Complete a scan that has already started regardless of PID access, or
+ * some VMAs may never be scanned in multi-threaded applications:
+ */
+ if (mm->numa_scan_offset > vma->vm_start) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_IGNORE_PID);
+ return true;
+ }
+
+ return false;
}
#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
unsigned long nr_pte_updates = 0;
long pages, virtpages;
struct vma_iterator vmi;
+ bool vma_pids_skipped;
+ bool vma_pids_forced = false;
SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
*/
p->node_stamp += 2 * TICK_NSEC;
- start = mm->numa_scan_offset;
pages = sysctl_numa_balancing_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
virtpages = pages * 8; /* Scan up to this much virtual space */
if (!mmap_read_trylock(mm))
return;
+
+ /*
+ * VMAs are skipped if the current PID has not trapped a fault within
+ * the VMA recently. Allow scanning to be forced if there is no
+ * suitable VMA remaining.
+ */
+ vma_pids_skipped = false;
+
+retry_pids:
+ start = mm->numa_scan_offset;
vma_iter_init(&vmi, mm, start);
vma = vma_next(&vmi);
if (!vma) {
do {
if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UNSUITABLE);
continue;
}
* as migrating the pages will be of marginal benefit.
*/
if (!vma->vm_mm ||
- (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+ (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ))) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SHARED_RO);
continue;
+ }
/*
* Skip inaccessible VMAs to avoid any confusion between
* PROT_NONE and NUMA hinting ptes
*/
- if (!vma_is_accessible(vma))
+ if (!vma_is_accessible(vma)) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_INACCESSIBLE);
continue;
+ }
/* Initialise new per-VMA NUMAB state. */
if (!vma->numab_state) {
msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
/* Reset happens after 4 times scan delay of scan start */
- vma->numab_state->next_pid_reset = vma->numab_state->next_scan +
+ vma->numab_state->pids_active_reset = vma->numab_state->next_scan +
msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+
+ /*
+ * Ensure prev_scan_seq does not match numa_scan_seq,
+ * to prevent VMAs being skipped prematurely on the
+ * first scan:
+ */
+ vma->numab_state->prev_scan_seq = mm->numa_scan_seq - 1;
}
/*
* delay the scan for new VMAs.
*/
if (mm->numa_scan_seq && time_before(jiffies,
- vma->numab_state->next_scan))
+ vma->numab_state->next_scan)) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SCAN_DELAY);
continue;
+ }
+
+ /* RESET access PIDs regularly for old VMAs. */
+ if (mm->numa_scan_seq &&
+ time_after(jiffies, vma->numab_state->pids_active_reset)) {
+ vma->numab_state->pids_active_reset = vma->numab_state->pids_active_reset +
+ msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+ vma->numab_state->pids_active[0] = READ_ONCE(vma->numab_state->pids_active[1]);
+ vma->numab_state->pids_active[1] = 0;
+ }
- /* Do not scan the VMA if task has not accessed */
- if (!vma_is_accessed(vma))
+ /* Do not rescan VMAs twice within the same sequence. */
+ if (vma->numab_state->prev_scan_seq == mm->numa_scan_seq) {
+ mm->numa_scan_offset = vma->vm_end;
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SEQ_COMPLETED);
continue;
+ }
/*
- * RESET access PIDs regularly for old VMAs. Resetting after checking
- * vma for recent access to avoid clearing PID info before access..
+ * Do not scan the VMA if task has not accessed it, unless no other
+ * VMA candidate exists.
*/
- if (mm->numa_scan_seq &&
- time_after(jiffies, vma->numab_state->next_pid_reset)) {
- vma->numab_state->next_pid_reset = vma->numab_state->next_pid_reset +
- msecs_to_jiffies(VMA_PID_RESET_PERIOD);
- vma->numab_state->access_pids[0] = READ_ONCE(vma->numab_state->access_pids[1]);
- vma->numab_state->access_pids[1] = 0;
+ if (!vma_pids_forced && !vma_is_accessed(mm, vma)) {
+ vma_pids_skipped = true;
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
+ continue;
}
do {
cond_resched();
} while (end != vma->vm_end);
+
+ /* VMA scan is complete, do not scan until next sequence. */
+ vma->numab_state->prev_scan_seq = mm->numa_scan_seq;
+
+ /*
+ * Only force scan within one VMA at a time, to limit the
+ * cost of scanning a potentially uninteresting VMA.
+ */
+ if (vma_pids_forced)
+ break;
} for_each_vma(vmi, vma);
+ /*
+ * If no VMAs are remaining and VMAs were skipped due to the PID
+ * not accessing the VMA previously, then force a scan to ensure
+ * forward progress:
+ */
+ if (!vma && !vma_pids_forced && vma_pids_skipped) {
+ vma_pids_forced = true;
+ goto retry_pids;
+ }
+
out:
/*
* It is possible to reach the end of the VMA list but the last few
*/
deadline = div_s64(deadline * old_weight, weight);
se->deadline = se->vruntime + deadline;
+ if (se != cfs_rq->curr)
+ min_deadline_cb_propagate(&se->run_node, NULL);
}
#ifdef CONFIG_SMP
*/
static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
{
- long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+ long delta;
+ u64 now;
/*
* No need to update load_avg for root_task_group as it is not used.
if (cfs_rq->tg == &root_task_group)
return;
+ /*
+ * For migration heavy workloads, access to tg->load_avg can be
+ * unbound. Limit the update rate to at most once per ms.
+ */
+ now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+ if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
+ return;
+
+ delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
atomic_long_add(delta, &cfs_rq->tg->load_avg);
cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+ cfs_rq->last_update_tg_load_avg = now;
}
}
return max(task_util(p), _task_util_est(p));
}
-#ifdef CONFIG_UCLAMP_TASK
-static inline unsigned long uclamp_task_util(struct task_struct *p,
- unsigned long uclamp_min,
- unsigned long uclamp_max)
-{
- return clamp(task_util_est(p), uclamp_min, uclamp_max);
-}
-#else
-static inline unsigned long uclamp_task_util(struct task_struct *p,
- unsigned long uclamp_min,
- unsigned long uclamp_max)
-{
- return task_util_est(p);
-}
-#endif
-
static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
struct task_struct *p)
{
* To avoid overestimation of actual task utilization, skip updates if
* we cannot grant there is idle time in this CPU.
*/
- if (task_util(p) > capacity_orig_of(cpu_of(rq_of(cfs_rq))))
+ if (task_util(p) > arch_scale_cpu_capacity(cpu_of(rq_of(cfs_rq))))
return;
/*
return fits;
/*
- * We must use capacity_orig_of() for comparing against uclamp_min and
+ * We must use arch_scale_cpu_capacity() for comparing against uclamp_min and
* uclamp_max. We only care about capacity pressure (by using
* capacity_of()) for comparing against the real util.
*
* If a task is boosted to 1024 for example, we don't want a tiny
* pressure to skew the check whether it fits a CPU or not.
*
- * Similarly if a task is capped to capacity_orig_of(little_cpu), it
+ * Similarly if a task is capped to arch_scale_cpu_capacity(little_cpu), it
* should fit a little cpu even if there's some pressure.
*
* Only exception is for thermal pressure since it has a direct impact
* For uclamp_max, we can tolerate a drop in performance level as the
* goal is to cap the task. So it's okay if it's getting less.
*/
- capacity_orig = capacity_orig_of(cpu);
+ capacity_orig = arch_scale_cpu_capacity(cpu);
capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
/*
static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
{
- return true;
+ return !cfs_rq->nr_running;
}
#define UPDATE_TG 0x0
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
- u64 vslice = calc_delta_fair(se->slice, se);
- u64 vruntime = avg_vruntime(cfs_rq);
+ u64 vslice, vruntime = avg_vruntime(cfs_rq);
s64 lag = 0;
+ se->slice = sysctl_sched_base_slice;
+ vslice = calc_delta_fair(se->slice, se);
+
/*
* Due to how V is constructed as the weighted average of entities,
* adding tasks with positive lag, or removing tasks with negative lag
* 4) do not run the "skip" process, if something else is available
*/
static struct sched_entity *
-pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+pick_next_entity(struct cfs_rq *cfs_rq)
{
/*
* Enabling NEXT_BUDDY will affect latency but not fairness.
static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
{
- struct cfs_rq *local_unthrottle = NULL;
int this_cpu = smp_processor_id();
u64 runtime, remaining = 1;
bool throttled = false;
- struct cfs_rq *cfs_rq;
+ struct cfs_rq *cfs_rq, *tmp;
struct rq_flags rf;
struct rq *rq;
+ LIST_HEAD(local_unthrottle);
rcu_read_lock();
list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
if (!cfs_rq_throttled(cfs_rq))
goto next;
-#ifdef CONFIG_SMP
/* Already queued for async unthrottle */
if (!list_empty(&cfs_rq->throttled_csd_list))
goto next;
-#endif
/* By the above checks, this should never be true */
SCHED_WARN_ON(cfs_rq->runtime_remaining > 0);
/* we check whether we're throttled above */
if (cfs_rq->runtime_remaining > 0) {
- if (cpu_of(rq) != this_cpu ||
- SCHED_WARN_ON(local_unthrottle))
+ if (cpu_of(rq) != this_cpu) {
unthrottle_cfs_rq_async(cfs_rq);
- else
- local_unthrottle = cfs_rq;
+ } else {
+ /*
+ * We currently only expect to be unthrottling
+ * a single cfs_rq locally.
+ */
+ SCHED_WARN_ON(!list_empty(&local_unthrottle));
+ list_add_tail(&cfs_rq->throttled_csd_list,
+ &local_unthrottle);
+ }
} else {
throttled = true;
}
next:
rq_unlock_irqrestore(rq, &rf);
}
- rcu_read_unlock();
- if (local_unthrottle) {
- rq = cpu_rq(this_cpu);
+ list_for_each_entry_safe(cfs_rq, tmp, &local_unthrottle,
+ throttled_csd_list) {
+ struct rq *rq = rq_of(cfs_rq);
+
rq_lock_irqsave(rq, &rf);
- if (cfs_rq_throttled(local_unthrottle))
- unthrottle_cfs_rq(local_unthrottle);
+
+ list_del_init(&cfs_rq->throttled_csd_list);
+
+ if (cfs_rq_throttled(cfs_rq))
+ unthrottle_cfs_rq(cfs_rq);
+
rq_unlock_irqrestore(rq, &rf);
}
+ SCHED_WARN_ON(!list_empty(&local_unthrottle));
+
+ rcu_read_unlock();
return throttled;
}
{
cfs_rq->runtime_enabled = 0;
INIT_LIST_HEAD(&cfs_rq->throttled_list);
-#ifdef CONFIG_SMP
INIT_LIST_HEAD(&cfs_rq->throttled_csd_list);
-#endif
}
void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
int i, cpu, idle_cpu = -1, nr = INT_MAX;
struct sched_domain_shared *sd_share;
- struct rq *this_rq = this_rq();
- int this = smp_processor_id();
- struct sched_domain *this_sd = NULL;
- u64 time = 0;
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
- if (sched_feat(SIS_PROP) && !has_idle_core) {
- u64 avg_cost, avg_idle, span_avg;
- unsigned long now = jiffies;
-
- this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
- if (!this_sd)
- return -1;
-
- /*
- * If we're busy, the assumption that the last idle period
- * predicts the future is flawed; age away the remaining
- * predicted idle time.
- */
- if (unlikely(this_rq->wake_stamp < now)) {
- while (this_rq->wake_stamp < now && this_rq->wake_avg_idle) {
- this_rq->wake_stamp++;
- this_rq->wake_avg_idle >>= 1;
- }
- }
-
- avg_idle = this_rq->wake_avg_idle;
- avg_cost = this_sd->avg_scan_cost + 1;
-
- span_avg = sd->span_weight * avg_idle;
- if (span_avg > 4*avg_cost)
- nr = div_u64(span_avg, avg_cost);
- else
- nr = 4;
-
- time = cpu_clock(this);
- }
-
if (sched_feat(SIS_UTIL)) {
sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
if (sd_share) {
}
}
+ if (static_branch_unlikely(&sched_cluster_active)) {
+ struct sched_group *sg = sd->groups;
+
+ if (sg->flags & SD_CLUSTER) {
+ for_each_cpu_wrap(cpu, sched_group_span(sg), target + 1) {
+ if (!cpumask_test_cpu(cpu, cpus))
+ continue;
+
+ if (has_idle_core) {
+ i = select_idle_core(p, cpu, cpus, &idle_cpu);
+ if ((unsigned int)i < nr_cpumask_bits)
+ return i;
+ } else {
+ if (--nr <= 0)
+ return -1;
+ idle_cpu = __select_idle_cpu(cpu, p);
+ if ((unsigned int)idle_cpu < nr_cpumask_bits)
+ return idle_cpu;
+ }
+ }
+ cpumask_andnot(cpus, cpus, sched_group_span(sg));
+ }
+ }
+
for_each_cpu_wrap(cpu, cpus, target + 1) {
if (has_idle_core) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
return i;
} else {
- if (!--nr)
+ if (--nr <= 0)
return -1;
idle_cpu = __select_idle_cpu(cpu, p);
if ((unsigned int)idle_cpu < nr_cpumask_bits)
if (has_idle_core)
set_idle_cores(target, false);
- if (sched_feat(SIS_PROP) && this_sd && !has_idle_core) {
- time = cpu_clock(this) - time;
-
- /*
- * Account for the scan cost of wakeups against the average
- * idle time.
- */
- this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
-
- update_avg(&this_sd->avg_scan_cost, time);
- }
-
return idle_cpu;
}
* Look for the CPU with best capacity.
*/
else if (fits < 0)
- cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
+ cpu_cap = arch_scale_cpu_capacity(cpu) - thermal_load_avg(cpu_rq(cpu));
/*
* First, select CPU which fits better (-1 being better than 0).
bool has_idle_core = false;
struct sched_domain *sd;
unsigned long task_util, util_min, util_max;
- int i, recent_used_cpu;
+ int i, recent_used_cpu, prev_aff = -1;
/*
* On asymmetric system, update task utilization because we will check
*/
if (prev != target && cpus_share_cache(prev, target) &&
(available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
- asym_fits_cpu(task_util, util_min, util_max, prev))
- return prev;
+ asym_fits_cpu(task_util, util_min, util_max, prev)) {
+
+ if (!static_branch_unlikely(&sched_cluster_active) ||
+ cpus_share_resources(prev, target))
+ return prev;
+
+ prev_aff = prev;
+ }
/*
* Allow a per-cpu kthread to stack with the wakee if the
(available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
- return recent_used_cpu;
+
+ if (!static_branch_unlikely(&sched_cluster_active) ||
+ cpus_share_resources(recent_used_cpu, target))
+ return recent_used_cpu;
+
+ } else {
+ recent_used_cpu = -1;
}
/*
if ((unsigned)i < nr_cpumask_bits)
return i;
+ /*
+ * For cluster machines which have lower sharing cache like L2 or
+ * LLC Tag, we tend to find an idle CPU in the target's cluster
+ * first. But prev_cpu or recent_used_cpu may also be a good candidate,
+ * use them if possible when no idle CPU found in select_idle_cpu().
+ */
+ if ((unsigned int)prev_aff < nr_cpumask_bits)
+ return prev_aff;
+ if ((unsigned int)recent_used_cpu < nr_cpumask_bits)
+ return recent_used_cpu;
+
return target;
}
util = max(util, util_est);
}
- return min(util, capacity_orig_of(cpu));
+ return min(util, arch_scale_cpu_capacity(cpu));
}
unsigned long cpu_util_cfs(int cpu)
{
unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
unsigned long busy_time = eenv->pd_busy_time;
+ unsigned long energy;
if (dst_cpu >= 0)
busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
- return em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
+ energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
+
+ trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
+
+ return energy;
}
/*
target = prev_cpu;
sync_entity_load_avg(&p->se);
- if (!uclamp_task_util(p, p_util_min, p_util_max))
+ if (!task_util_est(p) && p_util_min == 0)
goto unlock;
eenv_task_busy_time(&eenv, p, prev_cpu);
for (; pd; pd = pd->next) {
unsigned long util_min = p_util_min, util_max = p_util_max;
unsigned long cpu_cap, cpu_thermal_cap, util;
- unsigned long cur_delta, max_spare_cap = 0;
+ long prev_spare_cap = -1, max_spare_cap = -1;
unsigned long rq_util_min, rq_util_max;
- unsigned long prev_spare_cap = 0;
+ unsigned long cur_delta, base_energy;
int max_spare_cap_cpu = -1;
- unsigned long base_energy;
int fits, max_fits = -1;
cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
prev_spare_cap = cpu_cap;
prev_fits = fits;
} else if ((fits > max_fits) ||
- ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
+ ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
/*
* Find the CPU with the maximum spare capacity
* among the remaining CPUs in the performance
}
}
- if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
+ if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
continue;
eenv_pd_busy_time(&eenv, cpus, p);
base_energy = compute_energy(&eenv, pd, cpus, p, -1);
/* Evaluate the energy impact of using prev_cpu. */
- if (prev_spare_cap > 0) {
+ if (prev_spare_cap > -1) {
prev_delta = compute_energy(&eenv, pd, cpus, p,
prev_cpu);
/* CPU utilization has changed */
/*
* Preempt the current task with a newly woken task if needed:
*/
-static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
+static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
{
struct task_struct *curr = rq->curr;
struct sched_entity *se = &curr->se, *pse = &p->se;
/*
* This is possible from callers such as attach_tasks(), in which we
- * unconditionally check_preempt_curr() after an enqueue (which may have
+ * unconditionally wakeup_preempt() after an enqueue (which may have
* lead to a throttle). This both saves work and prevents false
* next-buddy nomination below.
*/
goto again;
}
- se = pick_next_entity(cfs_rq, curr);
+ se = pick_next_entity(cfs_rq);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
}
}
- se = pick_next_entity(cfs_rq, curr);
+ se = pick_next_entity(cfs_rq);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
put_prev_task(rq, prev);
do {
- se = pick_next_entity(cfs_rq, NULL);
+ se = pick_next_entity(cfs_rq);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
WARN_ON_ONCE(task_rq(p) != rq);
activate_task(rq, p, ENQUEUE_NOCLOCK);
- check_preempt_curr(rq, p, 0);
+ wakeup_preempt(rq, p, 0);
}
/*
unsigned long capacity = scale_rt_capacity(cpu);
struct sched_group *sdg = sd->groups;
- cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu);
-
if (!capacity)
capacity = 1;
check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
{
return ((rq->cpu_capacity * sd->imbalance_pct) <
- (rq->cpu_capacity_orig * 100));
+ (arch_scale_cpu_capacity(cpu_of(rq)) * 100));
}
/*
static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
{
return rq->misfit_task_load &&
- (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
+ (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
check_cpu_capacity(rq, sd));
}
* can only do it if @group is an SMT group and has exactly on busy CPU. Larger
* imbalances in the number of CPUS are dealt with in find_busiest_group().
*
- * If we are balancing load within an SMT core, or at DIE domain level, always
+ * If we are balancing load within an SMT core, or at PKG domain level, always
* proceed.
*
* Return: true if @env::dst_cpu can do with asym_packing load balance. False
busiest->push_cpu = this_cpu;
active_balance = 1;
}
- raw_spin_rq_unlock_irqrestore(busiest, flags);
+ preempt_disable();
+ raw_spin_rq_unlock_irqrestore(busiest, flags);
if (active_balance) {
stop_one_cpu_nowait(cpu_of(busiest),
active_load_balance_cpu_stop, busiest,
&busiest->active_balance_work);
}
+ preempt_enable();
}
} else {
sd->nr_balance_failed = 0;
#ifdef CONFIG_NO_HZ_COMMON
/*
- * idle load balancing details
- * - When one of the busy CPUs notice that there may be an idle rebalancing
+ * NOHZ idle load balancing (ILB) details:
+ *
+ * - When one of the busy CPUs notices that there may be an idle rebalancing
* needed, they will kick the idle load balancer, which then does idle
* load balancing for all the idle CPUs.
- * - HK_TYPE_MISC CPUs are used for this task, because HK_TYPE_SCHED not set
+ *
+ * - HK_TYPE_MISC CPUs are used for this task, because HK_TYPE_SCHED is not set
* anywhere yet.
*/
-
static inline int find_new_ilb(void)
{
- int ilb;
const struct cpumask *hk_mask;
+ int ilb_cpu;
hk_mask = housekeeping_cpumask(HK_TYPE_MISC);
- for_each_cpu_and(ilb, nohz.idle_cpus_mask, hk_mask) {
+ for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
- if (ilb == smp_processor_id())
+ if (ilb_cpu == smp_processor_id())
continue;
- if (idle_cpu(ilb))
- return ilb;
+ if (idle_cpu(ilb_cpu))
+ return ilb_cpu;
}
- return nr_cpu_ids;
+ return -1;
}
/*
- * Kick a CPU to do the nohz balancing, if it is time for it. We pick any
- * idle CPU in the HK_TYPE_MISC housekeeping set (if there is one).
+ * Kick a CPU to do the NOHZ balancing, if it is time for it, via a cross-CPU
+ * SMP function call (IPI).
+ *
+ * We pick the first idle CPU in the HK_TYPE_MISC housekeeping set (if there is one).
*/
static void kick_ilb(unsigned int flags)
{
nohz.next_balance = jiffies+1;
ilb_cpu = find_new_ilb();
-
- if (ilb_cpu >= nr_cpu_ids)
+ if (ilb_cpu < 0)
return;
/*
/*
* This way we generate an IPI on the target CPU which
- * is idle. And the softirq performing nohz idle load balance
+ * is idle, and the softirq performing NOHZ idle load balancing
* will be run before returning from the IPI.
*/
smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
/*
* None are in tickless mode and hence no need for NOHZ idle load
- * balancing.
+ * balancing:
*/
if (likely(!atomic_read(&nohz.nr_cpus)))
return;
sd = rcu_dereference(rq->sd);
if (sd) {
/*
- * If there's a CFS task and the current CPU has reduced
- * capacity; kick the ILB to see if there's a better CPU to run
- * on.
+ * If there's a runnable CFS task and the current CPU has reduced
+ * capacity, kick the ILB to see if there's a better CPU to run on:
*/
if (rq->cfs.h_nr_running >= 1 && check_cpu_capacity(rq, sd)) {
flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
if (sds) {
/*
* If there is an imbalance between LLC domains (IOW we could
- * increase the overall cache use), we need some less-loaded LLC
- * domain to pull some load. Likewise, we may need to spread
+ * increase the overall cache utilization), we need a less-loaded LLC
+ * domain to pull some load from. Likewise, we may need to spread
* load within the current LLC domain (e.g. packed SMT cores but
* other CPUs are idle). We can't really know from here how busy
- * the others are - so just get a nohz balance going if it looks
+ * the others are - so just get a NOHZ balance going if it looks
* like this LLC domain has tasks we could move.
*/
nr_busy = atomic_read(&sds->nr_busy_cpus);
}
/*
- * Check if we need to run the ILB for updating blocked load before entering
- * idle state.
+ * Check if we need to directly run the ILB for updating blocked load before
+ * entering idle state. Here we run ILB directly without issuing IPIs.
+ *
+ * Note that when this function is called, the tick may not yet be stopped on
+ * this CPU yet. nohz.idle_cpus_mask is updated only when tick is stopped and
+ * cleared on the next busy tick. In other words, nohz.idle_cpus_mask updates
+ * don't align with CPUs enter/exit idle to avoid bottlenecks due to high idle
+ * entry/exit rate (usec). So it is possible that _nohz_idle_balance() is
+ * called from this function on (this) CPU that's not yet in the mask. That's
+ * OK because the goal of nohz_run_idle_balance() is to run ILB only for
+ * updating the blocked load of already idle CPUs without waking up one of
+ * those idle CPUs and outside the preempt disable / irq off phase of the local
+ * cpu about to enter idle, because it can take a long time.
*/
void nohz_run_idle_balance(int cpu)
{
if (p->prio > oldprio)
resched_curr(rq);
} else
- check_preempt_curr(rq, p, 0);
+ wakeup_preempt(rq, p, 0);
}
#ifdef CONFIG_FAIR_GROUP_SCHED
if (task_current(rq, p))
resched_curr(rq);
else
- check_preempt_curr(rq, p, 0);
+ wakeup_preempt(rq, p, 0);
}
}
.yield_task = yield_task_fair,
.yield_to_task = yield_to_task_fair,
- .check_preempt_curr = check_preempt_wakeup,
+ .wakeup_preempt = check_preempt_wakeup_fair,
.pick_next_task = __pick_next_task_fair,
.put_prev_task = put_prev_task_fair,
* to the last. It would be better if bind would truly restrict
* the allocation to memory nodes instead
*
- * preferred Try a specific node first before normal fallback.
+ * preferred Try a specific node first before normal fallback.
* As a special case NUMA_NO_NODE here means do the allocation
* on the local CPU. This is normally identical to default,
* but useful to set in a VMA when you have a non default
* on systems with highmem kernel lowmem allocation don't get policied.
* Same with GFP_DMA allocations.
*
- * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between
+ * For shmem/tmpfs shared memory the policy is shared between
* all users and remembered even when nobody has memory mapped.
*/
/* Internal flags */
#define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0) /* Skip checks for continuous vmas */
- #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */
+ #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */
+ #define MPOL_MF_WRLOCK (MPOL_MF_INTERNAL << 2) /* Write-lock walked vmas */
static struct kmem_cache *policy_cache;
static struct kmem_cache *sn_cache;
static struct mempolicy preferred_node_policy[MAX_NUMNODES];
/**
- * numa_map_to_online_node - Find closest online node
+ * numa_nearest_node - Find nearest node by state
* @node: Node id to start the search
+ * @state: State to filter the search
*
- * Lookup the next closest node by distance if @nid is not online.
+ * Lookup the closest node by distance if @nid is not in state.
*
- * Return: this @node if it is online, otherwise the closest node by distance
+ * Return: this @node if it is in state, otherwise the closest node by distance
*/
-int numa_map_to_online_node(int node)
+int numa_nearest_node(int node, unsigned int state)
{
int min_dist = INT_MAX, dist, n, min_node;
- if (node == NUMA_NO_NODE || node_online(node))
+ if (state >= NR_NODE_STATES)
+ return -EINVAL;
+
+ if (node == NUMA_NO_NODE || node_state(node, state))
return node;
min_node = node;
- for_each_online_node(n) {
+ for_each_node_state(n, state) {
dist = node_distance(node, n);
if (dist < min_dist) {
min_dist = dist;
return min_node;
}
-EXPORT_SYMBOL_GPL(numa_map_to_online_node);
+EXPORT_SYMBOL_GPL(numa_nearest_node);
struct mempolicy *get_task_policy(struct task_struct *p)
{
{
struct mempolicy *policy;
- pr_debug("setting mode %d flags %d nodes[0] %lx\n",
- mode, flags, nodes ? nodes_addr(*nodes)[0] : NUMA_NO_NODE);
-
if (mode == MPOL_DEFAULT) {
if (nodes && !nodes_empty(*nodes))
return ERR_PTR(-EINVAL);
return ERR_PTR(-EINVAL);
} else if (nodes_empty(*nodes))
return ERR_PTR(-EINVAL);
+
policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
if (!policy)
return ERR_PTR(-ENOMEM);
}
/* Slow path of a mpol destructor. */
- void __mpol_put(struct mempolicy *p)
+ void __mpol_put(struct mempolicy *pol)
{
- if (!atomic_dec_and_test(&p->refcnt))
+ if (!atomic_dec_and_test(&pol->refcnt))
return;
- kmem_cache_free(policy_cache, p);
+ kmem_cache_free(policy_cache, pol);
}
static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes)
*
* Called with task's alloc_lock held.
*/
-
void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new)
{
mpol_rebind_policy(tsk->mempolicy, new);
*
* Call holding a reference to mm. Takes mm->mmap_lock during call.
*/
-
void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
{
struct vm_area_struct *vma;
},
};
- static int migrate_folio_add(struct folio *folio, struct list_head *foliolist,
+ static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
unsigned long flags);
+ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
+ pgoff_t ilx, int *nid);
+
+ static bool strictly_unmovable(unsigned long flags)
+ {
+ /*
+ * STRICT without MOVE flags lets do_mbind() fail immediately with -EIO
+ * if any misplaced page is found.
+ */
+ return (flags & (MPOL_MF_STRICT | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ==
+ MPOL_MF_STRICT;
+ }
+
+ struct migration_mpol { /* for alloc_migration_target_by_mpol() */
+ struct mempolicy *pol;
+ pgoff_t ilx;
+ };
struct queue_pages {
struct list_head *pagelist;
unsigned long start;
unsigned long end;
struct vm_area_struct *first;
- bool has_unmovable;
+ struct folio *large; /* note last large folio encountered */
+ long nr_failed; /* could not be isolated at this time */
};
/*
return node_isset(nid, *qp->nmask) == !(flags & MPOL_MF_INVERT);
}
- /*
- * queue_folios_pmd() has three possible return values:
- * 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. zero page, or unmovable page is found
- * but continue walking (indicated by queue_pages.has_unmovable).
- * -EIO - is migration entry or only MPOL_MF_STRICT was specified and an
- * existing folio was already on a node that does not follow the
- * policy.
- */
- static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
- unsigned long end, struct mm_walk *walk)
- __releases(ptl)
+ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk)
{
- int ret = 0;
struct folio *folio;
struct queue_pages *qp = walk->private;
- unsigned long flags;
if (unlikely(is_pmd_migration_entry(*pmd))) {
- ret = -EIO;
- goto unlock;
+ qp->nr_failed++;
+ return;
}
folio = pfn_folio(pmd_pfn(*pmd));
if (is_huge_zero_page(&folio->page)) {
walk->action = ACTION_CONTINUE;
- goto unlock;
+ return;
}
if (!queue_folio_required(folio, qp))
- goto unlock;
-
- flags = qp->flags;
- /* go to folio migration */
- if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
- if (!vma_migratable(walk->vma) ||
- migrate_folio_add(folio, qp->pagelist, flags)) {
- qp->has_unmovable = true;
- goto unlock;
- }
- } else
- ret = -EIO;
- unlock:
- spin_unlock(ptl);
- return ret;
+ return;
+ if (!(qp->flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ||
+ !vma_migratable(walk->vma) ||
+ !migrate_folio_add(folio, qp->pagelist, qp->flags))
+ qp->nr_failed++;
}
/*
- * Scan through pages checking if pages follow certain conditions,
- * and move them to the pagelist if they do.
+ * Scan through folios, checking if they satisfy the required conditions,
+ * moving them from LRU to local pagelist for migration if they do (or not).
*
- * queue_folios_pte_range() has three possible return values:
- * 0 - folios are placed on the right node or queued successfully, or
- * special page is met, i.e. zero page, or unmovable page is found
- * but continue walking (indicated by queue_pages.has_unmovable).
- * -EIO - only MPOL_MF_STRICT was specified and an existing folio was already
- * on a node that does not follow the policy.
+ * queue_folios_pte_range() has two possible return values:
+ * 0 - continue walking to scan for more, even if an existing folio on the
+ * wrong node could not be isolated and queued for migration.
+ * -EIO - only MPOL_MF_STRICT was specified, without MPOL_MF_MOVE or ..._ALL,
+ * and an existing folio was on a node that does not follow the policy.
*/
static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
spinlock_t *ptl;
ptl = pmd_trans_huge_lock(pmd, vma);
- if (ptl)
- return queue_folios_pmd(pmd, ptl, addr, end, walk);
+ if (ptl) {
+ queue_folios_pmd(pmd, walk);
+ spin_unlock(ptl);
+ goto out;
+ }
mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
if (!pte) {
}
for (; addr != end; pte++, addr += PAGE_SIZE) {
ptent = ptep_get(pte);
- if (!pte_present(ptent))
+ if (pte_none(ptent))
+ continue;
+ if (!pte_present(ptent)) {
+ if (is_migration_entry(pte_to_swp_entry(ptent)))
+ qp->nr_failed++;
continue;
+ }
folio = vm_normal_folio(vma, addr, ptent);
if (!folio || folio_is_zone_device(folio))
continue;
continue;
if (!queue_folio_required(folio, qp))
continue;
- if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
- /*
- * MPOL_MF_STRICT must be specified if we get here.
- * Continue walking vmas due to MPOL_MF_MOVE* flags.
- */
- if (!vma_migratable(vma))
- qp->has_unmovable = true;
-
+ if (folio_test_large(folio)) {
/*
- * Do not abort immediately since there may be
- * temporary off LRU pages in the range. Still
- * need migrate other LRU pages.
+ * A large folio can only be isolated from LRU once,
+ * but may be mapped by many PTEs (and Copy-On-Write may
+ * intersperse PTEs of other, order 0, folios). This is
+ * a common case, so don't mistake it for failure (but
+ * there can be other cases of multi-mapped pages which
+ * this quick check does not help to filter out - and a
+ * search of the pagelist might grow to be prohibitive).
+ *
+ * migrate_pages(&pagelist) returns nr_failed folios, so
+ * check "large" now so that queue_pages_range() returns
+ * a comparable nr_failed folios. This does imply that
+ * if folio could not be isolated for some racy reason
+ * at its first PTE, later PTEs will not give it another
+ * chance of isolation; but keeps the accounting simple.
*/
- if (migrate_folio_add(folio, qp->pagelist, flags))
- qp->has_unmovable = true;
- } else
- break;
+ if (folio == qp->large)
+ continue;
+ qp->large = folio;
+ }
+ if (!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ||
+ !vma_migratable(vma) ||
+ !migrate_folio_add(folio, qp->pagelist, flags)) {
+ qp->nr_failed++;
+ if (strictly_unmovable(flags))
+ break;
+ }
}
pte_unmap_unlock(mapped_pte, ptl);
cond_resched();
-
- return addr != end ? -EIO : 0;
+ out:
+ if (qp->nr_failed && strictly_unmovable(flags))
+ return -EIO;
+ return 0;
}
static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
- int ret = 0;
#ifdef CONFIG_HUGETLB_PAGE
struct queue_pages *qp = walk->private;
- unsigned long flags = (qp->flags & MPOL_MF_VALID);
+ unsigned long flags = qp->flags;
struct folio *folio;
spinlock_t *ptl;
pte_t entry;
ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
entry = huge_ptep_get(pte);
- if (!pte_present(entry))
+ if (!pte_present(entry)) {
+ if (unlikely(is_hugetlb_entry_migration(entry)))
+ qp->nr_failed++;
goto unlock;
+ }
folio = pfn_folio(pte_pfn(entry));
if (!queue_folio_required(folio, qp))
goto unlock;
-
- if (flags == MPOL_MF_STRICT) {
- /*
- * STRICT alone means only detecting misplaced folio and no
- * need to further check other vma.
- */
- ret = -EIO;
+ if (!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ||
+ !vma_migratable(walk->vma)) {
+ qp->nr_failed++;
goto unlock;
}
-
- if (!vma_migratable(walk->vma)) {
- /*
- * Must be STRICT with MOVE*, otherwise .test_walk() have
- * stopped walking current vma.
- * Detecting misplaced folio but allow migrating folios which
- * have been queued.
- */
- qp->has_unmovable = true;
- goto unlock;
- }
-
/*
- * With MPOL_MF_MOVE, we try to migrate only unshared folios. If it
- * is shared it is likely not worth migrating.
+ * Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio.
+ * Choosing not to migrate a shared folio is not counted as a failure.
*
* To check if the folio is shared, ideally we want to make sure
* every page is mapped to the same process. Doing that is very
- * expensive, so check the estimated mapcount of the folio instead.
+ * expensive, so check the estimated sharers of the folio instead.
*/
- if (flags & (MPOL_MF_MOVE_ALL) ||
- (flags & MPOL_MF_MOVE && folio_estimated_sharers(folio) == 1 &&
- !hugetlb_pmd_shared(pte))) {
- if (!isolate_hugetlb(folio, qp->pagelist) &&
- (flags & MPOL_MF_STRICT))
- /*
- * Failed to isolate folio but allow migrating pages
- * which have been queued.
- */
- qp->has_unmovable = true;
- }
+ if ((flags & MPOL_MF_MOVE_ALL) ||
+ (folio_estimated_sharers(folio) == 1 && !hugetlb_pmd_shared(pte)))
+ if (!isolate_hugetlb(folio, qp->pagelist))
+ qp->nr_failed++;
unlock:
spin_unlock(ptl);
- #else
- BUG();
+ if (qp->nr_failed && strictly_unmovable(flags))
+ return -EIO;
#endif
- return ret;
+ return 0;
}
#ifdef CONFIG_NUMA_BALANCING
return nr_updated;
}
- #else
- static unsigned long change_prot_numa(struct vm_area_struct *vma,
- unsigned long addr, unsigned long end)
- {
- return 0;
- }
#endif /* CONFIG_NUMA_BALANCING */
static int queue_pages_test_walk(unsigned long start, unsigned long end,
if (endvma > end)
endvma = end;
- if (flags & MPOL_MF_LAZY) {
- /* Similar to task_numa_work, skip inaccessible VMAs */
- if (!is_vm_hugetlb_page(vma) && vma_is_accessible(vma) &&
- !(vma->vm_flags & VM_MIXEDMAP))
- change_prot_numa(vma, start, endvma);
- return 1;
- }
-
- /* queue pages from current vma */
- if (flags & MPOL_MF_VALID)
+ /*
+ * Check page nodes, and queue pages to move, in the current vma.
+ * But if no moving, and no strict checking, the scan can be skipped.
+ */
+ if (flags & (MPOL_MF_STRICT | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
return 0;
return 1;
}
/*
* Walk through page tables and collect pages to be migrated.
*
- * If pages found in a given range are on a set of nodes (determined by
- * @nodes and @flags,) it's isolated and queued to the pagelist which is
- * passed via @private.
+ * If pages found in a given range are not on the required set of @nodes,
+ * and migration is allowed, they are isolated and queued to @pagelist.
*
- * queue_pages_range() has three possible return values:
- * 1 - there is unmovable page, but MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
- * 0 - queue pages successfully or no misplaced page.
- * errno - i.e. misplaced pages with MPOL_MF_STRICT specified (-EIO) or
- * memory range specified by nodemask and maxnode points outside
- * your accessible address space (-EFAULT)
+ * queue_pages_range() may return:
+ * 0 - all pages already on the right node, or successfully queued for moving
+ * (or neither strict checking nor moving requested: only range checking).
+ * >0 - this number of misplaced folios could not be queued for moving
+ * (a hugetlbfs page or a transparent huge page being counted as 1).
+ * -EIO - a misplaced page found, when MPOL_MF_STRICT specified without MOVEs.
+ * -EFAULT - a hole in the memory range, when MPOL_MF_DISCONTIG_OK unspecified.
*/
- static int
+ static long
queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
nodemask_t *nodes, unsigned long flags,
- struct list_head *pagelist, bool lock_vma)
+ struct list_head *pagelist)
{
int err;
struct queue_pages qp = {
.start = start,
.end = end,
.first = NULL,
- .has_unmovable = false,
};
- const struct mm_walk_ops *ops = lock_vma ?
+ const struct mm_walk_ops *ops = (flags & MPOL_MF_WRLOCK) ?
&queue_pages_lock_vma_walk_ops : &queue_pages_walk_ops;
err = walk_page_range(mm, start, end, ops, &qp);
- if (qp.has_unmovable)
- err = 1;
if (!qp.first)
/* whole range in hole */
err = -EFAULT;
- return err;
+ return err ? : qp.nr_failed;
}
/*
* This must be called with the mmap_lock held for writing.
*/
static int vma_replace_policy(struct vm_area_struct *vma,
- struct mempolicy *pol)
+ struct mempolicy *pol)
{
int err;
struct mempolicy *old;
vma_assert_write_locked(vma);
- pr_debug("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n",
- vma->vm_start, vma->vm_end, vma->vm_pgoff,
- vma->vm_ops, vma->vm_file,
- vma->vm_ops ? vma->vm_ops->set_policy : NULL);
-
new = mpol_dup(pol);
if (IS_ERR(new))
return PTR_ERR(new);
struct vm_area_struct **prev, unsigned long start,
unsigned long end, struct mempolicy *new_pol)
{
- struct vm_area_struct *merged;
unsigned long vmstart, vmend;
- pgoff_t pgoff;
- int err;
vmend = min(end, vma->vm_end);
if (start > vma->vm_start) {
vmstart = vma->vm_start;
}
- if (mpol_equal(vma_policy(vma), new_pol)) {
+ if (mpol_equal(vma->vm_policy, new_pol)) {
*prev = vma;
return 0;
}
- pgoff = vma->vm_pgoff + ((vmstart - vma->vm_start) >> PAGE_SHIFT);
- merged = vma_merge(vmi, vma->vm_mm, *prev, vmstart, vmend, vma->vm_flags,
- vma->anon_vma, vma->vm_file, pgoff, new_pol,
- vma->vm_userfaultfd_ctx, anon_vma_name(vma));
- if (merged) {
- *prev = merged;
- return vma_replace_policy(merged, new_pol);
- }
-
- if (vma->vm_start != vmstart) {
- err = split_vma(vmi, vma, vmstart, 1);
- if (err)
- return err;
- }
-
- if (vma->vm_end != vmend) {
- err = split_vma(vmi, vma, vmend, 0);
- if (err)
- return err;
- }
+ vma = vma_modify_policy(vmi, *prev, vma, vmstart, vmend, new_pol);
+ if (IS_ERR(vma))
+ return PTR_ERR(vma);
*prev = vma;
return vma_replace_policy(vma, new_pol);
*
* Called with task's alloc_lock held
*/
- static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
+ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
{
nodes_clear(*nodes);
- if (p == &default_policy)
+ if (pol == &default_policy)
return;
- switch (p->mode) {
+ switch (pol->mode) {
case MPOL_BIND:
case MPOL_INTERLEAVE:
case MPOL_PREFERRED:
case MPOL_PREFERRED_MANY:
- *nodes = p->nodes;
+ *nodes = pol->nodes;
break;
case MPOL_LOCAL:
/* return empty node mask for local allocation */
}
if (flags & MPOL_F_ADDR) {
+ pgoff_t ilx; /* ignored here */
/*
* Do NOT fall back to task policy if the
* vma/shared policy at addr is NULL. We
mmap_read_unlock(mm);
return -EFAULT;
}
- if (vma->vm_ops && vma->vm_ops->get_policy)
- pol = vma->vm_ops->get_policy(vma, addr);
- else
- pol = vma->vm_policy;
+ pol = __get_vma_policy(vma, addr, &ilx);
} else if (addr)
return -EINVAL;
}
#ifdef CONFIG_MIGRATION
- static int migrate_folio_add(struct folio *folio, struct list_head *foliolist,
+ static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
unsigned long flags)
{
/*
- * We try to migrate only unshared folios. If it is shared it
- * is likely not worth migrating.
+ * Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio.
+ * Choosing not to migrate a shared folio is not counted as a failure.
*
* To check if the folio is shared, ideally we want to make sure
* every page is mapped to the same process. Doing that is very
- * expensive, so check the estimated mapcount of the folio instead.
+ * expensive, so check the estimated sharers of the folio instead.
*/
if ((flags & MPOL_MF_MOVE_ALL) || folio_estimated_sharers(folio) == 1) {
if (folio_isolate_lru(folio)) {
node_stat_mod_folio(folio,
NR_ISOLATED_ANON + folio_is_file_lru(folio),
folio_nr_pages(folio));
- } else if (flags & MPOL_MF_STRICT) {
+ } else {
/*
* Non-movable folio may reach here. And, there may be
* temporary off LRU folios or non-LRU movable folios.
* Treat them as unmovable folios since they can't be
- * isolated, so they can't be moved at the moment. It
- * should return -EIO for this case too.
+ * isolated, so they can't be moved at the moment.
*/
- return -EIO;
+ return false;
}
}
-
- return 0;
+ return true;
}
/*
* Migrate pages from one node to a target node.
* Returns error or the number of pages not migrated.
*/
- static int migrate_to_node(struct mm_struct *mm, int source, int dest,
- int flags)
+ static long migrate_to_node(struct mm_struct *mm, int source, int dest,
+ int flags)
{
nodemask_t nmask;
struct vm_area_struct *vma;
LIST_HEAD(pagelist);
- int err = 0;
+ long nr_failed;
+ long err = 0;
struct migration_target_control mtc = {
.nid = dest,
.gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE,
nodes_clear(nmask);
node_set(source, nmask);
+ VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)));
+
+ mmap_read_lock(mm);
+ vma = find_vma(mm, 0);
+
/*
- * This does not "check" the range but isolates all pages that
+ * This does not migrate the range, but isolates all pages that
* need migration. Between passing in the full user address
- * space range and MPOL_MF_DISCONTIG_OK, this call can not fail.
+ * space range and MPOL_MF_DISCONTIG_OK, this call cannot fail,
+ * but passes back the count of pages which could not be isolated.
*/
- vma = find_vma(mm, 0);
- VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)));
- queue_pages_range(mm, vma->vm_start, mm->task_size, &nmask,
- flags | MPOL_MF_DISCONTIG_OK, &pagelist, false);
+ nr_failed = queue_pages_range(mm, vma->vm_start, mm->task_size, &nmask,
+ flags | MPOL_MF_DISCONTIG_OK, &pagelist);
+ mmap_read_unlock(mm);
if (!list_empty(&pagelist)) {
err = migrate_pages(&pagelist, alloc_migration_target, NULL,
- (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL, NULL);
+ (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL, NULL);
if (err)
putback_movable_pages(&pagelist);
}
+ if (err >= 0)
+ err += nr_failed;
return err;
}
int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
const nodemask_t *to, int flags)
{
- int busy = 0;
- int err = 0;
+ long nr_failed = 0;
+ long err = 0;
nodemask_t tmp;
lru_cache_disable();
- mmap_read_lock(mm);
-
/*
* Find a 'source' bit set in 'tmp' whose corresponding 'dest'
* bit in 'to' is not also set in 'tmp'. Clear the found 'source'
node_clear(source, tmp);
err = migrate_to_node(mm, source, dest, flags);
if (err > 0)
- busy += err;
+ nr_failed += err;
if (err < 0)
break;
}
- mmap_read_unlock(mm);
lru_cache_enable();
if (err < 0)
return err;
- return busy;
-
+ return (nr_failed < INT_MAX) ? nr_failed : INT_MAX;
}
/*
- * Allocate a new page for page migration based on vma policy.
- * Start by assuming the page is mapped by the same vma as contains @start.
- * Search forward from there, if not. N.B., this assumes that the
- * list of pages handed to migrate_pages()--which is how we get here--
- * is in virtual address order.
+ * Allocate a new folio for page migration, according to NUMA mempolicy.
*/
- static struct folio *new_folio(struct folio *src, unsigned long start)
+ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
+ unsigned long private)
{
- struct vm_area_struct *vma;
- unsigned long address;
- VMA_ITERATOR(vmi, current->mm, start);
- gfp_t gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL;
+ struct migration_mpol *mmpol = (struct migration_mpol *)private;
+ struct mempolicy *pol = mmpol->pol;
+ pgoff_t ilx = mmpol->ilx;
+ struct page *page;
+ unsigned int order;
+ int nid = numa_node_id();
+ gfp_t gfp;
- for_each_vma(vmi, vma) {
- address = page_address_in_vma(&src->page, vma);
- if (address != -EFAULT)
- break;
- }
+ order = folio_order(src);
+ ilx += src->index >> order;
if (folio_test_hugetlb(src)) {
- return alloc_hugetlb_folio_vma(folio_hstate(src),
- vma, address);
+ nodemask_t *nodemask;
+ struct hstate *h;
+
+ h = folio_hstate(src);
+ gfp = htlb_alloc_mask(h);
+ nodemask = policy_nodemask(gfp, pol, ilx, &nid);
+ return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp);
}
if (folio_test_large(src))
gfp = GFP_TRANSHUGE;
+ else
+ gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP;
- /*
- * if !vma, vma_alloc_folio() will use task or system default policy
- */
- return vma_alloc_folio(gfp, folio_order(src), vma, address,
- folio_test_large(src));
+ page = alloc_pages_mpol(gfp, order, pol, ilx, nid);
+ return page_rmappable_folio(page);
}
#else
- static int migrate_folio_add(struct folio *folio, struct list_head *foliolist,
+ static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
unsigned long flags)
{
- return -EIO;
+ return false;
}
int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
return -ENOSYS;
}
- static struct folio *new_folio(struct folio *src, unsigned long start)
+ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
+ unsigned long private)
{
return NULL;
}
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *prev;
struct vma_iterator vmi;
+ struct migration_mpol mmpol;
struct mempolicy *new;
unsigned long end;
- int err;
- int ret;
+ long err;
+ long nr_failed;
LIST_HEAD(pagelist);
if (flags & ~(unsigned long)MPOL_MF_VALID)
if (IS_ERR(new))
return PTR_ERR(new);
- if (flags & MPOL_MF_LAZY)
- new->flags |= MPOL_F_MOF;
-
/*
* If we are using the default policy then operation
* on discontinuous address spaces is okay after all
if (!new)
flags |= MPOL_MF_DISCONTIG_OK;
- pr_debug("mbind %lx-%lx mode:%d flags:%d nodes:%lx\n",
- start, start + len, mode, mode_flags,
- nmask ? nodes_addr(*nmask)[0] : NUMA_NO_NODE);
-
- if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
-
+ if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
lru_cache_disable();
- }
{
NODEMASK_SCRATCH(scratch);
if (scratch) {
goto mpol_out;
/*
- * Lock the VMAs before scanning for pages to migrate, to ensure we don't
- * miss a concurrently inserted page.
+ * Lock the VMAs before scanning for pages to migrate,
+ * to ensure we don't miss a concurrently inserted page.
*/
- ret = queue_pages_range(mm, start, end, nmask,
- flags | MPOL_MF_INVERT, &pagelist, true);
+ nr_failed = queue_pages_range(mm, start, end, nmask,
+ flags | MPOL_MF_INVERT | MPOL_MF_WRLOCK, &pagelist);
- if (ret < 0) {
- err = ret;
- goto up_out;
- }
-
- vma_iter_init(&vmi, mm, start);
- prev = vma_prev(&vmi);
- for_each_vma_range(vmi, vma, end) {
- err = mbind_range(&vmi, vma, &prev, start, end, new);
- if (err)
- break;
+ if (nr_failed < 0) {
+ err = nr_failed;
+ nr_failed = 0;
+ } else {
+ vma_iter_init(&vmi, mm, start);
+ prev = vma_prev(&vmi);
+ for_each_vma_range(vmi, vma, end) {
+ err = mbind_range(&vmi, vma, &prev, start, end, new);
+ if (err)
+ break;
+ }
}
- if (!err) {
- int nr_failed = 0;
-
- if (!list_empty(&pagelist)) {
- WARN_ON_ONCE(flags & MPOL_MF_LAZY);
- nr_failed = migrate_pages(&pagelist, new_folio, NULL,
- start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND, NULL);
- if (nr_failed)
- putback_movable_pages(&pagelist);
+ if (!err && !list_empty(&pagelist)) {
+ /* Convert MPOL_DEFAULT's NULL to task or default policy */
+ if (!new) {
+ new = get_task_policy(current);
+ mpol_get(new);
}
+ mmpol.pol = new;
+ mmpol.ilx = 0;
- if (((ret > 0) || nr_failed) && (flags & MPOL_MF_STRICT))
- err = -EIO;
- } else {
- up_out:
- if (!list_empty(&pagelist))
- putback_movable_pages(&pagelist);
+ /*
+ * In the interleaved case, attempt to allocate on exactly the
+ * targeted nodes, for the first VMA to be migrated; for later
+ * VMAs, the nodes will still be interleaved from the targeted
+ * nodemask, but one by one may be selected differently.
+ */
+ if (new->mode == MPOL_INTERLEAVE) {
+ struct page *page;
+ unsigned int order;
+ unsigned long addr = -EFAULT;
+
+ list_for_each_entry(page, &pagelist, lru) {
+ if (!PageKsm(page))
+ break;
+ }
+ if (!list_entry_is_head(page, &pagelist, lru)) {
+ vma_iter_init(&vmi, mm, start);
+ for_each_vma_range(vmi, vma, end) {
+ addr = page_address_in_vma(page, vma);
+ if (addr != -EFAULT)
+ break;
+ }
+ }
+ if (addr != -EFAULT) {
+ order = compound_order(page);
+ /* We already know the pol, but not the ilx */
+ mpol_cond_put(get_vma_policy(vma, addr, order,
+ &mmpol.ilx));
+ /* Set base from which to increment by index */
+ mmpol.ilx -= page->index >> order;
+ }
+ }
}
mmap_write_unlock(mm);
+
+ if (!err && !list_empty(&pagelist)) {
+ nr_failed |= migrate_pages(&pagelist,
+ alloc_migration_target_by_mpol, NULL,
+ (unsigned long)&mmpol, MIGRATE_SYNC,
+ MR_MEMPOLICY_MBIND, NULL);
+ }
+
+ if (nr_failed && (flags & MPOL_MF_STRICT))
+ err = -EIO;
+ if (!list_empty(&pagelist))
+ putback_movable_pages(&pagelist);
mpol_out:
mpol_put(new);
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
out_put:
put_task_struct(task);
goto out;
-
}
SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
return kernel_migrate_pages(pid, maxnode, old_nodes, new_nodes);
}
-
/* Retrieve NUMA policy */
static int kernel_get_mempolicy(int __user *policy,
unsigned long __user *nmask,
}
struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
- unsigned long addr)
+ unsigned long addr, pgoff_t *ilx)
{
- struct mempolicy *pol = NULL;
-
- if (vma) {
- if (vma->vm_ops && vma->vm_ops->get_policy) {
- pol = vma->vm_ops->get_policy(vma, addr);
- } else if (vma->vm_policy) {
- pol = vma->vm_policy;
-
- /*
- * shmem_alloc_page() passes MPOL_F_SHARED policy with
- * a pseudo vma whose vma->vm_ops=NULL. Take a reference
- * count on these policies which will be dropped by
- * mpol_cond_put() later
- */
- if (mpol_needs_cond_ref(pol))
- mpol_get(pol);
- }
- }
-
- return pol;
+ *ilx = 0;
+ return (vma->vm_ops && vma->vm_ops->get_policy) ?
+ vma->vm_ops->get_policy(vma, addr, ilx) : vma->vm_policy;
}
/*
- * get_vma_policy(@vma, @addr)
+ * get_vma_policy(@vma, @addr, @order, @ilx)
* @vma: virtual memory area whose policy is sought
* @addr: address in @vma for shared policy lookup
+ * @order: 0, or appropriate huge_page_order for interleaving
+ * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE
*
* Returns effective policy for a VMA at specified address.
* Falls back to current->mempolicy or system default policy, as necessary.
* freeing by another task. It is the caller's responsibility to free the
* extra reference for shared policies.
*/
- static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
- unsigned long addr)
+ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+ unsigned long addr, int order, pgoff_t *ilx)
{
- struct mempolicy *pol = __get_vma_policy(vma, addr);
+ struct mempolicy *pol;
+ pol = __get_vma_policy(vma, addr, ilx);
if (!pol)
pol = get_task_policy(current);
-
+ if (pol->mode == MPOL_INTERLEAVE) {
+ *ilx += vma->vm_pgoff >> order;
+ *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
+ }
return pol;
}
if (vma->vm_ops && vma->vm_ops->get_policy) {
bool ret = false;
+ pgoff_t ilx; /* ignored here */
- pol = vma->vm_ops->get_policy(vma, vma->vm_start);
+ pol = vma->vm_ops->get_policy(vma, vma->vm_start, &ilx);
if (pol && (pol->flags & MPOL_F_MOF))
ret = true;
mpol_cond_put(pol);
return zone >= dynamic_policy_zone;
}
- /*
- * Return a nodemask representing a mempolicy for filtering nodes for
- * page allocation
- */
- nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
- {
- int mode = policy->mode;
-
- /* Lower zones don't get a nodemask applied for MPOL_BIND */
- if (unlikely(mode == MPOL_BIND) &&
- apply_policy_zone(policy, gfp_zone(gfp)) &&
- cpuset_nodemask_valid_mems_allowed(&policy->nodes))
- return &policy->nodes;
-
- if (mode == MPOL_PREFERRED_MANY)
- return &policy->nodes;
-
- return NULL;
- }
-
- /*
- * Return the preferred node id for 'prefer' mempolicy, and return
- * the given id for all other policies.
- *
- * policy_node() is always coupled with policy_nodemask(), which
- * secures the nodemask limit for 'bind' and 'prefer-many' policy.
- */
- static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
- {
- if (policy->mode == MPOL_PREFERRED) {
- nd = first_node(policy->nodes);
- } else {
- /*
- * __GFP_THISNODE shouldn't even be used with the bind policy
- * because we might easily break the expectation to stay on the
- * requested node and not break the policy.
- */
- WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
- }
-
- if ((policy->mode == MPOL_BIND ||
- policy->mode == MPOL_PREFERRED_MANY) &&
- policy->home_node != NUMA_NO_NODE)
- return policy->home_node;
-
- return nd;
- }
-
/* Do dynamic interleaving for a process */
- static unsigned interleave_nodes(struct mempolicy *policy)
+ static unsigned int interleave_nodes(struct mempolicy *policy)
{
- unsigned next;
- struct task_struct *me = current;
+ unsigned int nid;
- next = next_node_in(me->il_prev, policy->nodes);
- if (next < MAX_NUMNODES)
- me->il_prev = next;
- return next;
+ nid = next_node_in(current->il_prev, policy->nodes);
+ if (nid < MAX_NUMNODES)
+ current->il_prev = nid;
+ return nid;
}
/*
}
/*
- * Do static interleaving for a VMA with known offset @n. Returns the n'th
- * node in pol->nodes (starting from n=0), wrapping around if n exceeds the
- * number of present nodes.
+ * Do static interleaving for interleave index @ilx. Returns the ilx'th
+ * node in pol->nodes (starting from ilx=0), wrapping around if ilx
+ * exceeds the number of present nodes.
*/
- static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
+ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
{
nodemask_t nodemask = pol->nodes;
unsigned int target, nnodes;
nnodes = nodes_weight(nodemask);
if (!nnodes)
return numa_node_id();
- target = (unsigned int)n % nnodes;
+ target = ilx % nnodes;
nid = first_node(nodemask);
for (i = 0; i < target; i++)
nid = next_node(nid, nodemask);
return nid;
}
- /* Determine a node number for interleave */
- static inline unsigned interleave_nid(struct mempolicy *pol,
- struct vm_area_struct *vma, unsigned long addr, int shift)
+ /*
+ * Return a nodemask representing a mempolicy for filtering nodes for
+ * page allocation, together with preferred node id (or the input node id).
+ */
+ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
+ pgoff_t ilx, int *nid)
{
- if (vma) {
- unsigned long off;
+ nodemask_t *nodemask = NULL;
+ switch (pol->mode) {
+ case MPOL_PREFERRED:
+ /* Override input node id */
+ *nid = first_node(pol->nodes);
+ break;
+ case MPOL_PREFERRED_MANY:
+ nodemask = &pol->nodes;
+ if (pol->home_node != NUMA_NO_NODE)
+ *nid = pol->home_node;
+ break;
+ case MPOL_BIND:
+ /* Restrict to nodemask (but not on lower zones) */
+ if (apply_policy_zone(pol, gfp_zone(gfp)) &&
+ cpuset_nodemask_valid_mems_allowed(&pol->nodes))
+ nodemask = &pol->nodes;
+ if (pol->home_node != NUMA_NO_NODE)
+ *nid = pol->home_node;
/*
- * for small pages, there is no difference between
- * shift and PAGE_SHIFT, so the bit-shift is safe.
- * for huge pages, since vm_pgoff is in units of small
- * pages, we need to shift off the always 0 bits to get
- * a useful offset.
+ * __GFP_THISNODE shouldn't even be used with the bind policy
+ * because we might easily break the expectation to stay on the
+ * requested node and not break the policy.
*/
- BUG_ON(shift < PAGE_SHIFT);
- off = vma->vm_pgoff >> (shift - PAGE_SHIFT);
- off += (addr - vma->vm_start) >> shift;
- return offset_il_node(pol, off);
- } else
- return interleave_nodes(pol);
+ WARN_ON_ONCE(gfp & __GFP_THISNODE);
+ break;
+ case MPOL_INTERLEAVE:
+ /* Override input node id */
+ *nid = (ilx == NO_INTERLEAVE_INDEX) ?
+ interleave_nodes(pol) : interleave_nid(pol, ilx);
+ break;
+ }
+
+ return nodemask;
}
#ifdef CONFIG_HUGETLBFS
* to the struct mempolicy for conditional unref after allocation.
* If the effective policy is 'bind' or 'prefer-many', returns a pointer
* to the mempolicy's @nodemask for filtering the zonelist.
- *
- * Must be protected by read_mems_allowed_begin()
*/
int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
- struct mempolicy **mpol, nodemask_t **nodemask)
+ struct mempolicy **mpol, nodemask_t **nodemask)
{
+ pgoff_t ilx;
int nid;
- int mode;
-
- *mpol = get_vma_policy(vma, addr);
- *nodemask = NULL;
- mode = (*mpol)->mode;
- if (unlikely(mode == MPOL_INTERLEAVE)) {
- nid = interleave_nid(*mpol, vma, addr,
- huge_page_shift(hstate_vma(vma)));
- } else {
- nid = policy_node(gfp_flags, *mpol, numa_node_id());
- if (mode == MPOL_BIND || mode == MPOL_PREFERRED_MANY)
- *nodemask = &(*mpol)->nodes;
- }
+ nid = numa_node_id();
+ *mpol = get_vma_policy(vma, addr, hstate_vma(vma)->order, &ilx);
+ *nodemask = policy_nodemask(gfp_flags, *mpol, ilx, &nid);
return nid;
}
return ret;
}
- /* Allocate a page in interleaved policy.
- Own path because it needs to do special accounting. */
- static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
- unsigned nid)
- {
- struct page *page;
-
- page = __alloc_pages(gfp, order, nid, NULL);
- /* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
- if (!static_branch_likely(&vm_numa_stat_key))
- return page;
- if (page && page_to_nid(page) == nid) {
- preempt_disable();
- __count_numa_event(page_zone(page), NUMA_INTERLEAVE_HIT);
- preempt_enable();
- }
- return page;
- }
-
static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
- int nid, struct mempolicy *pol)
+ int nid, nodemask_t *nodemask)
{
struct page *page;
gfp_t preferred_gfp;
*/
preferred_gfp = gfp | __GFP_NOWARN;
preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
- page = __alloc_pages(preferred_gfp, order, nid, &pol->nodes);
+ page = __alloc_pages(preferred_gfp, order, nid, nodemask);
if (!page)
page = __alloc_pages(gfp, order, nid, NULL);
}
/**
- * vma_alloc_folio - Allocate a folio for a VMA.
+ * alloc_pages_mpol - Allocate pages according to NUMA mempolicy.
* @gfp: GFP flags.
- * @order: Order of the folio.
- * @vma: Pointer to VMA or NULL if not available.
- * @addr: Virtual address of the allocation. Must be inside @vma.
- * @hugepage: For hugepages try only the preferred node if possible.
+ * @order: Order of the page allocation.
+ * @pol: Pointer to the NUMA mempolicy.
+ * @ilx: Index for interleave mempolicy (also distinguishes alloc_pages()).
+ * @nid: Preferred node (usually numa_node_id() but @mpol may override it).
*
- * Allocate a folio for a specific address in @vma, using the appropriate
- * NUMA policy. When @vma is not NULL the caller must hold the mmap_lock
- * of the mm_struct of the VMA to prevent it from going away. Should be
- * used for all allocations for folios that will be mapped into user space.
- *
- * Return: The folio on success or NULL if allocation fails.
+ * Return: The page on success or NULL if allocation fails.
*/
- struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
- unsigned long addr, bool hugepage)
+ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
+ struct mempolicy *pol, pgoff_t ilx, int nid)
{
- struct mempolicy *pol;
- int node = numa_node_id();
- struct folio *folio;
- int preferred_nid;
- nodemask_t *nmask;
-
- pol = get_vma_policy(vma, addr);
-
- if (pol->mode == MPOL_INTERLEAVE) {
- struct page *page;
- unsigned nid;
-
- nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
- mpol_cond_put(pol);
- gfp |= __GFP_COMP;
- page = alloc_page_interleave(gfp, order, nid);
- folio = (struct folio *)page;
- if (folio && order > 1)
- folio_prep_large_rmappable(folio);
- goto out;
- }
-
- if (pol->mode == MPOL_PREFERRED_MANY) {
- struct page *page;
+ nodemask_t *nodemask;
+ struct page *page;
- node = policy_node(gfp, pol, node);
- gfp |= __GFP_COMP;
- page = alloc_pages_preferred_many(gfp, order, node, pol);
- mpol_cond_put(pol);
- folio = (struct folio *)page;
- if (folio && order > 1)
- folio_prep_large_rmappable(folio);
- goto out;
- }
+ nodemask = policy_nodemask(gfp, pol, ilx, &nid);
- if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
- int hpage_node = node;
+ if (pol->mode == MPOL_PREFERRED_MANY)
+ return alloc_pages_preferred_many(gfp, order, nid, nodemask);
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ /* filter "hugepage" allocation, unless from alloc_pages() */
+ order == HPAGE_PMD_ORDER && ilx != NO_INTERLEAVE_INDEX) {
/*
* For hugepage allocation and non-interleave policy which
* allows the current node (or other explicitly preferred
* If the policy is interleave or does not allow the current
* node in its nodemask, we allocate the standard way.
*/
- if (pol->mode == MPOL_PREFERRED)
- hpage_node = first_node(pol->nodes);
-
- nmask = policy_nodemask(gfp, pol);
- if (!nmask || node_isset(hpage_node, *nmask)) {
- mpol_cond_put(pol);
+ if (pol->mode != MPOL_INTERLEAVE &&
+ (!nodemask || node_isset(nid, *nodemask))) {
/*
* First, try to allocate THP only on local node, but
* don't reclaim unnecessarily, just compact.
*/
- folio = __folio_alloc_node(gfp | __GFP_THISNODE |
- __GFP_NORETRY, order, hpage_node);
-
+ page = __alloc_pages_node(nid,
+ gfp | __GFP_THISNODE | __GFP_NORETRY, order);
+ if (page || !(gfp & __GFP_DIRECT_RECLAIM))
+ return page;
/*
* If hugepage allocations are configured to always
* synchronous compact or the vma has been madvised
* to prefer hugepage backing, retry allowing remote
* memory with both reclaim and compact as well.
*/
- if (!folio && (gfp & __GFP_DIRECT_RECLAIM))
- folio = __folio_alloc(gfp, order, hpage_node,
- nmask);
+ }
+ }
- goto out;
+ page = __alloc_pages(gfp, order, nid, nodemask);
+
+ if (unlikely(pol->mode == MPOL_INTERLEAVE) && page) {
+ /* skip NUMA_INTERLEAVE_HIT update if numa stats is disabled */
+ if (static_branch_likely(&vm_numa_stat_key) &&
+ page_to_nid(page) == nid) {
+ preempt_disable();
+ __count_numa_event(page_zone(page), NUMA_INTERLEAVE_HIT);
+ preempt_enable();
}
}
- nmask = policy_nodemask(gfp, pol);
- preferred_nid = policy_node(gfp, pol, node);
- folio = __folio_alloc(gfp, order, preferred_nid, nmask);
+ return page;
+ }
+
+ /**
+ * vma_alloc_folio - Allocate a folio for a VMA.
+ * @gfp: GFP flags.
+ * @order: Order of the folio.
+ * @vma: Pointer to VMA.
+ * @addr: Virtual address of the allocation. Must be inside @vma.
+ * @hugepage: Unused (was: For hugepages try only preferred node if possible).
+ *
+ * Allocate a folio for a specific address in @vma, using the appropriate
+ * NUMA policy. The caller must hold the mmap_lock of the mm_struct of the
+ * VMA to prevent it from going away. Should be used for all allocations
+ * for folios that will be mapped into user space, excepting hugetlbfs, and
+ * excepting where direct use of alloc_pages_mpol() is more appropriate.
+ *
+ * Return: The folio on success or NULL if allocation fails.
+ */
+ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
+ unsigned long addr, bool hugepage)
+ {
+ struct mempolicy *pol;
+ pgoff_t ilx;
+ struct page *page;
+
+ pol = get_vma_policy(vma, addr, order, &ilx);
+ page = alloc_pages_mpol(gfp | __GFP_COMP, order,
+ pol, ilx, numa_node_id());
mpol_cond_put(pol);
- out:
- return folio;
+ return page_rmappable_folio(page);
}
EXPORT_SYMBOL(vma_alloc_folio);
* flags are used.
* Return: The page on success or NULL if allocation fails.
*/
- struct page *alloc_pages(gfp_t gfp, unsigned order)
+ struct page *alloc_pages(gfp_t gfp, unsigned int order)
{
struct mempolicy *pol = &default_policy;
- struct page *page;
-
- if (!in_interrupt() && !(gfp & __GFP_THISNODE))
- pol = get_task_policy(current);
/*
* No reference counting needed for current->mempolicy
* nor system default_policy
*/
- if (pol->mode == MPOL_INTERLEAVE)
- page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
- else if (pol->mode == MPOL_PREFERRED_MANY)
- page = alloc_pages_preferred_many(gfp, order,
- policy_node(gfp, pol, numa_node_id()), pol);
- else
- page = __alloc_pages(gfp, order,
- policy_node(gfp, pol, numa_node_id()),
- policy_nodemask(gfp, pol));
+ if (!in_interrupt() && !(gfp & __GFP_THISNODE))
+ pol = get_task_policy(current);
- return page;
+ return alloc_pages_mpol(gfp, order,
+ pol, NO_INTERLEAVE_INDEX, numa_node_id());
}
EXPORT_SYMBOL(alloc_pages);
- struct folio *folio_alloc(gfp_t gfp, unsigned order)
+ struct folio *folio_alloc(gfp_t gfp, unsigned int order)
{
- struct page *page = alloc_pages(gfp | __GFP_COMP, order);
- struct folio *folio = (struct folio *)page;
-
- if (folio && order > 1)
- folio_prep_large_rmappable(folio);
- return folio;
+ return page_rmappable_folio(alloc_pages(gfp | __GFP_COMP, order));
}
EXPORT_SYMBOL(folio_alloc);
unsigned long nr_pages, struct page **page_array)
{
struct mempolicy *pol = &default_policy;
+ nodemask_t *nodemask;
+ int nid;
if (!in_interrupt() && !(gfp & __GFP_THISNODE))
pol = get_task_policy(current);
return alloc_pages_bulk_array_preferred_many(gfp,
numa_node_id(), pol, nr_pages, page_array);
- return __alloc_pages_bulk(gfp, policy_node(gfp, pol, numa_node_id()),
- policy_nodemask(gfp, pol), nr_pages, NULL,
- page_array);
+ nid = numa_node_id();
+ nodemask = policy_nodemask(gfp, pol, NO_INTERLEAVE_INDEX, &nid);
+ return __alloc_pages_bulk(gfp, nid, nodemask,
+ nr_pages, NULL, page_array);
}
int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
{
- struct mempolicy *pol = mpol_dup(vma_policy(src));
+ struct mempolicy *pol = mpol_dup(src->vm_policy);
if (IS_ERR(pol))
return PTR_ERR(pol);
* lookup first element intersecting start-end. Caller holds sp->lock for
* reading or for writing
*/
- static struct sp_node *
- sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end)
+ static struct sp_node *sp_lookup(struct shared_policy *sp,
+ pgoff_t start, pgoff_t end)
{
struct rb_node *n = sp->root.rb_node;
}
rb_link_node(&new->nd, parent, p);
rb_insert_color(&new->nd, &sp->root);
- pr_debug("inserting %lx-%lx: %d\n", new->start, new->end,
- new->policy ? new->policy->mode : 0);
}
/* Find shared policy intersecting idx */
- struct mempolicy *
- mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
+ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
+ pgoff_t idx)
{
struct mempolicy *pol = NULL;
struct sp_node *sn;
}
/**
- * mpol_misplaced - check whether current page node is valid in policy
+ * mpol_misplaced - check whether current folio node is valid in policy
*
- * @page: page to be checked
- * @vma: vm area where page mapped
- * @addr: virtual address where page mapped
+ * @folio: folio to be checked
+ * @vma: vm area where folio mapped
+ * @addr: virtual address in @vma for shared policy lookup and interleave policy
*
- * Lookup current policy node id for vma,addr and "compare to" page's
+ * Lookup current policy node id for vma,addr and "compare to" folio's
* node id. Policy determination "mimics" alloc_page_vma().
* Called from fault path where we know the vma and faulting address.
*
* Return: NUMA_NO_NODE if the page is in a node that is valid for this
- * policy, or a suitable node ID to allocate a replacement page from.
+ * policy, or a suitable node ID to allocate a replacement folio from.
*/
- int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
+ unsigned long addr)
{
struct mempolicy *pol;
+ pgoff_t ilx;
struct zoneref *z;
- int curnid = page_to_nid(page);
- unsigned long pgoff;
+ int curnid = folio_nid(folio);
int thiscpu = raw_smp_processor_id();
int thisnid = cpu_to_node(thiscpu);
int polnid = NUMA_NO_NODE;
int ret = NUMA_NO_NODE;
- pol = get_vma_policy(vma, addr);
+ pol = get_vma_policy(vma, addr, folio_order(folio), &ilx);
if (!(pol->flags & MPOL_F_MOF))
goto out;
switch (pol->mode) {
case MPOL_INTERLEAVE:
- pgoff = vma->vm_pgoff;
- pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
- polnid = offset_il_node(pol, pgoff);
+ polnid = interleave_nid(pol, ilx);
break;
case MPOL_PREFERRED:
BUG();
}
- /* Migrate the page towards the node whose CPU is referencing it */
+ /* Migrate the folio towards the node whose CPU is referencing it */
if (pol->flags & MPOL_F_MORON) {
polnid = thisnid;
- if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
+ if (!should_numa_migrate_memory(current, folio, curnid,
+ thiscpu))
goto out;
}
static void sp_delete(struct shared_policy *sp, struct sp_node *n)
{
- pr_debug("deleting %lx-l%lx\n", n->start, n->end);
rb_erase(&n->nd, &sp->root);
sp_free(n);
}
}
/* Replace a policy range. */
- static int shared_policy_replace(struct shared_policy *sp, unsigned long start,
- unsigned long end, struct sp_node *new)
+ static int shared_policy_replace(struct shared_policy *sp, pgoff_t start,
+ pgoff_t end, struct sp_node *new)
{
struct sp_node *n;
struct sp_node *n_new = NULL;
rwlock_init(&sp->lock);
if (mpol) {
- struct vm_area_struct pvma;
- struct mempolicy *new;
+ struct sp_node *sn;
+ struct mempolicy *npol;
NODEMASK_SCRATCH(scratch);
if (!scratch)
goto put_mpol;
- /* contextualize the tmpfs mount point mempolicy */
- new = mpol_new(mpol->mode, mpol->flags, &mpol->w.user_nodemask);
- if (IS_ERR(new))
+
+ /* contextualize the tmpfs mount point mempolicy to this file */
+ npol = mpol_new(mpol->mode, mpol->flags, &mpol->w.user_nodemask);
+ if (IS_ERR(npol))
goto free_scratch; /* no valid nodemask intersection */
task_lock(current);
- ret = mpol_set_nodemask(new, &mpol->w.user_nodemask, scratch);
+ ret = mpol_set_nodemask(npol, &mpol->w.user_nodemask, scratch);
task_unlock(current);
if (ret)
- goto put_new;
-
- /* Create pseudo-vma that contains just the policy */
- vma_init(&pvma, NULL);
- pvma.vm_end = TASK_SIZE; /* policy covers entire file */
- mpol_set_shared_policy(sp, &pvma, new); /* adds ref */
-
- put_new:
- mpol_put(new); /* drop initial ref */
+ goto put_npol;
+
+ /* alloc node covering entire file; adds ref to file's npol */
+ sn = sp_alloc(0, MAX_LFS_FILESIZE >> PAGE_SHIFT, npol);
+ if (sn)
+ sp_insert(sp, sn);
+ put_npol:
+ mpol_put(npol); /* drop initial ref on file's npol */
free_scratch:
NODEMASK_SCRATCH_FREE(scratch);
put_mpol:
}
}
- int mpol_set_shared_policy(struct shared_policy *info,
- struct vm_area_struct *vma, struct mempolicy *npol)
+ int mpol_set_shared_policy(struct shared_policy *sp,
+ struct vm_area_struct *vma, struct mempolicy *pol)
{
int err;
struct sp_node *new = NULL;
unsigned long sz = vma_pages(vma);
- pr_debug("set_shared_policy %lx sz %lu %d %d %lx\n",
- vma->vm_pgoff,
- sz, npol ? npol->mode : -1,
- npol ? npol->flags : -1,
- npol ? nodes_addr(npol->nodes)[0] : NUMA_NO_NODE);
-
- if (npol) {
- new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, npol);
+ if (pol) {
+ new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, pol);
if (!new)
return -ENOMEM;
}
- err = shared_policy_replace(info, vma->vm_pgoff, vma->vm_pgoff+sz, new);
+ err = shared_policy_replace(sp, vma->vm_pgoff, vma->vm_pgoff + sz, new);
if (err && new)
sp_free(new);
return err;
}
/* Free a backing policy store on inode delete. */
- void mpol_free_shared_policy(struct shared_policy *p)
+ void mpol_free_shared_policy(struct shared_policy *sp)
{
struct sp_node *n;
struct rb_node *next;
- if (!p->root.rb_node)
+ if (!sp->root.rb_node)
return;
- write_lock(&p->lock);
- next = rb_first(&p->root);
+ write_lock(&sp->lock);
+ next = rb_first(&sp->root);
while (next) {
n = rb_entry(next, struct sp_node, nd);
next = rb_next(&n->nd);
- sp_delete(p, n);
+ sp_delete(sp, n);
}
- write_unlock(&p->lock);
+ write_unlock(&sp->lock);
}
#ifdef CONFIG_NUMA_BALANCING
}
#endif /* CONFIG_NUMA_BALANCING */
- /* assumes fs == KERNEL_DS */
void __init numa_policy_init(void)
{
nodemask_t interleave_nodes;
/*
* Parse and format mempolicy from/to strings
*/
-
static const char * const policy_modes[] =
{
[MPOL_DEFAULT] = "default",
[MPOL_PREFERRED_MANY] = "prefer (many)",
};
-
#ifdef CONFIG_TMPFS
/**
* mpol_parse_str - parse string to mempolicy, for tmpfs mpol mount option.
static void __remove_shared_vm_struct(struct vm_area_struct *vma,
struct file *file, struct address_space *mapping)
{
- if (vma->vm_flags & VM_SHARED)
+ if (vma_is_shared_maywrite(vma))
mapping_unmap_writable(mapping);
flush_dcache_mmap_lock(mapping);
static void __vma_link_file(struct vm_area_struct *vma,
struct address_space *mapping)
{
- if (vma->vm_flags & VM_SHARED)
+ if (vma_is_shared_maywrite(vma))
mapping_allow_writable(mapping);
flush_dcache_mmap_lock(mapping);
* **** is not represented - it will be merged and the vma containing the
* area is returned, or the function will return NULL
*/
- struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
- struct vm_area_struct *prev, unsigned long addr,
- unsigned long end, unsigned long vm_flags,
- struct anon_vma *anon_vma, struct file *file,
- pgoff_t pgoff, struct mempolicy *policy,
- struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
- struct anon_vma_name *anon_name)
+ static struct vm_area_struct
+ *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
+ struct vm_area_struct *prev, unsigned long addr, unsigned long end,
+ unsigned long vm_flags, struct anon_vma *anon_vma, struct file *file,
+ pgoff_t pgoff, struct mempolicy *policy,
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+ struct anon_vma_name *anon_name)
{
struct vm_area_struct *curr, *next, *res;
struct vm_area_struct *vma, *adjust, *remove, *remove2;
vma_start_write(curr);
remove = curr;
remove2 = next;
+ /*
+ * Note that the dup_anon_vma below cannot overwrite err
+ * since the first caller would do nothing unless next
+ * has an anon_vma.
+ */
if (!next->anon_vma)
err = dup_anon_vma(prev, curr, &anon_dup);
}
* Does the application expect PROT_READ to imply PROT_EXEC?
*
* (the exception is when the underlying filesystem is noexec
- * mounted, in which case we dont add PROT_EXEC.)
+ * mounted, in which case we don't add PROT_EXEC.)
*/
if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
if (!(file && path_noexec(&file->f_path)))
return 0;
}
-#if defined(CONFIG_STACK_GROWSUP) || defined(CONFIG_IA64)
+#if defined(CONFIG_STACK_GROWSUP)
/*
- * PA-RISC uses this for its stack; IA64 for its Register Backing Store.
+ * PA-RISC uses this for its stack.
* vma is the last one with address > vma->vm_end. Have to extend vma.
*/
static int expand_upwards(struct vm_area_struct *vma, unsigned long address)
validate_mm(mm);
return error;
}
-#endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
+#endif /* CONFIG_STACK_GROWSUP */
/*
* vma is the first one with address < vma->vm_start. Have to extend vma.
#else
int expand_stack_locked(struct vm_area_struct *vma, unsigned long address)
{
- if (unlikely(!(vma->vm_flags & VM_GROWSDOWN)))
- return -EINVAL;
return expand_downwards(vma, address);
}
* has already been checked or doesn't make sense to fail.
* VMA Iterator will point to the end VMA.
*/
- int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
- unsigned long addr, int new_below)
+ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
+ unsigned long addr, int new_below)
{
struct vma_prepare vp;
struct vm_area_struct *new;
* Split a vma into two pieces at address 'addr', a new vma is allocated
* either for the first part or the tail.
*/
- int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
- unsigned long addr, int new_below)
+ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
+ unsigned long addr, int new_below)
{
if (vma->vm_mm->map_count >= sysctl_max_map_count)
return -ENOMEM;
return __split_vma(vmi, vma, addr, new_below);
}
+ /*
+ * We are about to modify one or multiple of a VMA's flags, policy, userfaultfd
+ * context and anonymous VMA name within the range [start, end).
+ *
+ * As a result, we might be able to merge the newly modified VMA range with an
+ * adjacent VMA with identical properties.
+ *
+ * If no merge is possible and the range does not span the entirety of the VMA,
+ * we then need to split the VMA to accommodate the change.
+ *
+ * The function returns either the merged VMA, the original VMA if a split was
+ * required instead, or an error if the split failed.
+ */
+ struct vm_area_struct *vma_modify(struct vma_iterator *vmi,
+ struct vm_area_struct *prev,
+ struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ unsigned long vm_flags,
+ struct mempolicy *policy,
+ struct vm_userfaultfd_ctx uffd_ctx,
+ struct anon_vma_name *anon_name)
+ {
+ pgoff_t pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
+ struct vm_area_struct *merged;
+
+ merged = vma_merge(vmi, vma->vm_mm, prev, start, end, vm_flags,
+ vma->anon_vma, vma->vm_file, pgoff, policy,
+ uffd_ctx, anon_name);
+ if (merged)
+ return merged;
+
+ if (vma->vm_start < start) {
+ int err = split_vma(vmi, vma, start, 1);
+
+ if (err)
+ return ERR_PTR(err);
+ }
+
+ if (vma->vm_end > end) {
+ int err = split_vma(vmi, vma, end, 0);
+
+ if (err)
+ return ERR_PTR(err);
+ }
+
+ return vma;
+ }
+
+ /*
+ * Attempt to merge a newly mapped VMA with those adjacent to it. The caller
+ * must ensure that [start, end) does not overlap any existing VMA.
+ */
+ static struct vm_area_struct
+ *vma_merge_new_vma(struct vma_iterator *vmi, struct vm_area_struct *prev,
+ struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, pgoff_t pgoff)
+ {
+ return vma_merge(vmi, vma->vm_mm, prev, start, end, vma->vm_flags,
+ vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+ vma->vm_userfaultfd_ctx, anon_vma_name(vma));
+ }
+
+ /*
+ * Expand vma by delta bytes, potentially merging with an immediately adjacent
+ * VMA with identical properties.
+ */
+ struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi,
+ struct vm_area_struct *vma,
+ unsigned long delta)
+ {
+ pgoff_t pgoff = vma->vm_pgoff + vma_pages(vma);
+
+ /* vma is specified as prev, so case 1 or 2 will apply. */
+ return vma_merge(vmi, vma->vm_mm, vma, vma->vm_end, vma->vm_end + delta,
+ vma->vm_flags, vma->anon_vma, vma->vm_file, pgoff,
+ vma_policy(vma), vma->vm_userfaultfd_ctx,
+ anon_vma_name(vma));
+ }
+
/*
* do_vmi_align_munmap() - munmap the aligned region from @start to @end.
* @vmi: The vma iterator
unsigned long charged = 0;
unsigned long end = addr + len;
unsigned long merge_start = addr, merge_end = end;
+ bool writable_file_mapping = false;
pgoff_t vm_pgoff;
int error;
VMA_ITERATOR(vmi, mm, addr);
vma->vm_pgoff = pgoff;
if (file) {
- if (vm_flags & VM_SHARED) {
- error = mapping_map_writable(file->f_mapping);
- if (error)
- goto free_vma;
- }
-
vma->vm_file = get_file(file);
error = call_mmap(file, vma);
if (error)
goto unmap_and_free_vma;
+ if (vma_is_shared_maywrite(vma)) {
+ error = mapping_map_writable(file->f_mapping);
+ if (error)
+ goto close_and_free_vma;
+
+ writable_file_mapping = true;
+ }
+
/*
* Expansion is handled above, merging is handled below.
* Drivers should not alter the address of the VMA.
* vma again as we may succeed this time.
*/
if (unlikely(vm_flags != vma->vm_flags && prev)) {
- merge = vma_merge(&vmi, mm, prev, vma->vm_start,
- vma->vm_end, vma->vm_flags, NULL,
- vma->vm_file, vma->vm_pgoff, NULL,
- NULL_VM_UFFD_CTX, NULL);
+ merge = vma_merge_new_vma(&vmi, prev, vma,
+ vma->vm_start, vma->vm_end,
+ vma->vm_pgoff);
if (merge) {
/*
* ->mmap() can change vma->vm_file and fput
mm->map_count++;
if (vma->vm_file) {
i_mmap_lock_write(vma->vm_file->f_mapping);
- if (vma->vm_flags & VM_SHARED)
+ if (vma_is_shared_maywrite(vma))
mapping_allow_writable(vma->vm_file->f_mapping);
flush_dcache_mmap_lock(vma->vm_file->f_mapping);
/* Once vma denies write, undo our temporary denial count */
unmap_writable:
- if (file && vm_flags & VM_SHARED)
+ if (writable_file_mapping)
mapping_unmap_writable(file->f_mapping);
file = vma->vm_file;
ksm_add_vma(vma);
unmap_region(mm, &vmi.mas, vma, prev, next, vma->vm_start,
vma->vm_end, vma->vm_end, true);
}
- if (file && (vm_flags & VM_SHARED))
+ if (writable_file_mapping)
mapping_unmap_writable(file->f_mapping);
free_vma:
vm_area_free(vma);
}
EXPORT_SYMBOL(vm_brk_flags);
-int vm_brk(unsigned long addr, unsigned long len)
-{
- return vm_brk_flags(addr, len, 0);
-}
-EXPORT_SYMBOL(vm_brk);
-
/* Release all mmaps. */
void exit_mmap(struct mm_struct *mm)
{
}
if (vma_link(mm, vma)) {
- vm_unacct_memory(charged);
+ if (vma->vm_flags & VM_ACCOUNT)
+ vm_unacct_memory(charged);
return -ENOMEM;
}
if (new_vma && new_vma->vm_start < addr + len)
return NULL; /* should never get here */
- new_vma = vma_merge(&vmi, mm, prev, addr, addr + len, vma->vm_flags,
- vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
- vma->vm_userfaultfd_ctx, anon_vma_name(vma));
+ new_vma = vma_merge_new_vma(&vmi, prev, vma, addr, addr + len, pgoff);
if (new_vma) {
/*
* Source vma may have been merged into new_vma
* split a vma into two pieces at address 'addr', a new vma is allocated either
* for the first part or the tail.
*/
- int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
- unsigned long addr, int new_below)
+ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
+ unsigned long addr, int new_below)
{
struct vm_area_struct *new;
struct vm_region *region;
mmap_write_unlock(mm);
}
-int vm_brk(unsigned long addr, unsigned long len)
-{
- return -ENOMEM;
-}
-
/*
* expand (or shrink) an existing mapping, potentially moving it at the same
* time (controlled by the MREMAP_MAYMOVE flag and available VM space)
}
EXPORT_SYMBOL(filemap_map_pages);
- int __access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf,
- int len, unsigned int gup_flags)
+ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
+ void *buf, int len, unsigned int gup_flags)
{
struct vm_area_struct *vma;
int write = gup_flags & FOLL_WRITE;
if (!memcg_kmem_online() || !(gfp & __GFP_ACCOUNT))
return true;
- objcg = get_obj_cgroup_from_current();
+ objcg = current_obj_cgroup();
if (!objcg)
return true;
- if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size))) {
- obj_cgroup_put(objcg);
+ if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))
return false;
- }
*objcgp = objcg;
return true;
return;
if (likely(chunk && chunk->obj_cgroups)) {
+ obj_cgroup_get(objcg);
chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
rcu_read_lock();
rcu_read_unlock();
} else {
obj_cgroup_uncharge(objcg, pcpu_obj_full_size(size));
- obj_cgroup_put(objcg);
}
}
mutex_unlock(&pcpu_alloc_mutex);
}
+/**
+ * pcpu_alloc_size - the size of the dynamic percpu area
+ * @ptr: pointer to the dynamic percpu area
+ *
+ * Returns the size of the @ptr allocation. This is undefined for statically
+ * defined percpu variables as there is no corresponding chunk->bound_map.
+ *
+ * RETURNS:
+ * The size of the dynamic percpu area.
+ *
+ * CONTEXT:
+ * Can be called from atomic context.
+ */
+size_t pcpu_alloc_size(void __percpu *ptr)
+{
+ struct pcpu_chunk *chunk;
+ unsigned long bit_off, end;
+ void *addr;
+
+ if (!ptr)
+ return 0;
+
+ addr = __pcpu_ptr_to_addr(ptr);
+ /* No pcpu_lock here: ptr has not been freed, so chunk is still alive */
+ chunk = pcpu_chunk_addr_search(addr);
+ bit_off = (addr - chunk->base_addr) / PCPU_MIN_ALLOC_SIZE;
+ end = find_next_bit(chunk->bound_map, pcpu_chunk_map_bits(chunk),
+ bit_off + 1);
+ return (end - bit_off) * PCPU_MIN_ALLOC_SIZE;
+}
+
/**
* free_percpu - free percpu area
* @ptr: pointer to area to free
kmemleak_free_percpu(ptr);
addr = __pcpu_ptr_to_addr(ptr);
-
- spin_lock_irqsave(&pcpu_lock, flags);
-
chunk = pcpu_chunk_addr_search(addr);
off = addr - chunk->base_addr;
+ spin_lock_irqsave(&pcpu_lock, flags);
size = pcpu_free_area(chunk, off);
pcpu_memcg_free_hook(chunk, off, size);
#endif
static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
- struct folio **foliop, enum sgp_type sgp,
- gfp_t gfp, struct vm_area_struct *vma,
- vm_fault_t *fault_type);
+ struct folio **foliop, enum sgp_type sgp, gfp_t gfp,
+ struct mm_struct *fault_mm, vm_fault_t *fault_type);
static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
{
/*
* ... whereas tmpfs objects are accounted incrementally as
* pages are allocated, in order to allow large sparse files.
- * shmem_get_folio reports shmem_acct_block failure as -ENOSPC not -ENOMEM,
+ * shmem_get_folio reports shmem_acct_blocks failure as -ENOSPC not -ENOMEM,
* so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM.
*/
- static inline int shmem_acct_block(unsigned long flags, long pages)
+ static inline int shmem_acct_blocks(unsigned long flags, long pages)
{
if (!(flags & VM_NORESERVE))
return 0;
vm_unacct_memory(pages * VM_ACCT(PAGE_SIZE));
}
- static int shmem_inode_acct_block(struct inode *inode, long pages)
+ static int shmem_inode_acct_blocks(struct inode *inode, long pages)
{
struct shmem_inode_info *info = SHMEM_I(inode);
struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
int err = -ENOSPC;
- if (shmem_acct_block(info->flags, pages))
+ if (shmem_acct_blocks(info->flags, pages))
return err;
might_sleep(); /* when quotas */
if (sbinfo->max_blocks) {
- if (percpu_counter_compare(&sbinfo->used_blocks,
- sbinfo->max_blocks - pages) > 0)
+ if (!percpu_counter_limited_add(&sbinfo->used_blocks,
+ sbinfo->max_blocks, pages))
goto unacct;
err = dquot_alloc_block_nodirty(inode, pages);
- if (err)
+ if (err) {
+ percpu_counter_sub(&sbinfo->used_blocks, pages);
goto unacct;
-
- percpu_counter_add(&sbinfo->used_blocks, pages);
+ }
} else {
err = dquot_alloc_block_nodirty(inode, pages);
if (err)
{
struct address_space *mapping = inode->i_mapping;
- if (shmem_inode_acct_block(inode, pages))
+ if (shmem_inode_acct_blocks(inode, pages))
return false;
/* nrpages adjustment first, then shmem_recalc_inode() when balanced */
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
/*
- * Like filemap_add_folio, but error if expected item has gone.
+ * Somewhat like filemap_add_folio, but error if expected item has gone.
*/
static int shmem_add_to_page_cache(struct folio *folio,
struct address_space *mapping,
- pgoff_t index, void *expected, gfp_t gfp,
- struct mm_struct *charge_mm)
+ pgoff_t index, void *expected, gfp_t gfp)
{
XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
long nr = folio_nr_pages(folio);
- int error;
VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
folio->mapping = mapping;
folio->index = index;
- if (!folio_test_swapcache(folio)) {
- error = mem_cgroup_charge(folio, charge_mm, gfp);
- if (error) {
- if (folio_test_pmd_mappable(folio)) {
- count_vm_event(THP_FILE_FALLBACK);
- count_vm_event(THP_FILE_FALLBACK_CHARGE);
- }
- goto error;
- }
- }
+ gfp &= GFP_RECLAIM_MASK;
folio_throttle_swaprate(folio, gfp);
do {
xas_store(&xas, folio);
if (xas_error(&xas))
goto unlock;
- if (folio_test_pmd_mappable(folio)) {
- count_vm_event(THP_FILE_ALLOC);
+ if (folio_test_pmd_mappable(folio))
__lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, nr);
- }
- mapping->nrpages += nr;
__lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr);
__lruvec_stat_mod_folio(folio, NR_SHMEM, nr);
+ mapping->nrpages += nr;
unlock:
xas_unlock_irq(&xas);
} while (xas_nomem(&xas, gfp));
if (xas_error(&xas)) {
- error = xas_error(&xas);
- goto error;
+ folio->mapping = NULL;
+ folio_ref_sub(folio, nr);
+ return xas_error(&xas);
}
return 0;
- error:
- folio->mapping = NULL;
- folio_ref_sub(folio, nr);
- return error;
}
/*
- * Like delete_from_page_cache, but substitutes swap for @folio.
+ * Somewhat like filemap_remove_folio, but substitutes swap for @folio.
*/
static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
{
cond_resched_rcu();
}
}
-
rcu_read_unlock();
return swapped << PAGE_SHIFT;
void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
{
shmem_undo_range(inode, lstart, lend, false);
- inode->i_mtime = inode_set_ctime_current(inode);
+ inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
inode_inc_iversion(inode);
}
EXPORT_SYMBOL_GPL(shmem_truncate_range);
if (i_uid_needs_update(idmap, attr, inode) ||
i_gid_needs_update(idmap, attr, inode)) {
error = dquot_transfer(idmap, inode, attr);
-
if (error)
return error;
}
if (!error && update_ctime) {
inode_set_ctime_current(inode);
if (update_mtime)
- inode->i_mtime = inode_get_ctime(inode);
+ inode_set_mtime_to_ts(inode, inode_get_ctime(inode));
inode_inc_iversion(inode);
}
return error;
if (!xa_is_value(folio))
continue;
- error = shmem_swapin_folio(inode, indices[i],
- &folio, SGP_CACHE,
- mapping_gfp_mask(mapping),
- NULL, NULL);
+ error = shmem_swapin_folio(inode, indices[i], &folio, SGP_CACHE,
+ mapping_gfp_mask(mapping), NULL, NULL);
if (error == 0) {
folio_unlock(folio);
folio_put(folio);
return NULL;
}
#endif /* CONFIG_NUMA && CONFIG_TMPFS */
- #ifndef CONFIG_NUMA
- #define vm_policy vm_private_data
- #endif
- static void shmem_pseudo_vma_init(struct vm_area_struct *vma,
- struct shmem_inode_info *info, pgoff_t index)
- {
- /* Create a pseudo vma that just contains the policy */
- vma_init(vma, NULL);
- /* Bias interleave by inode number to distribute better across nodes */
- vma->vm_pgoff = index + info->vfs_inode.i_ino;
- vma->vm_policy = mpol_shared_policy_lookup(&info->policy, index);
- }
-
- static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma)
- {
- /* Drop reference taken by mpol_shared_policy_lookup() */
- mpol_cond_put(vma->vm_policy);
- }
+ static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
+ pgoff_t index, unsigned int order, pgoff_t *ilx);
- static struct folio *shmem_swapin(swp_entry_t swap, gfp_t gfp,
+ static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
{
- struct vm_area_struct pvma;
+ struct mempolicy *mpol;
+ pgoff_t ilx;
struct page *page;
- struct vm_fault vmf = {
- .vma = &pvma,
- };
- shmem_pseudo_vma_init(&pvma, info, index);
- page = swap_cluster_readahead(swap, gfp, &vmf);
- shmem_pseudo_vma_destroy(&pvma);
+ mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
+ page = swap_cluster_readahead(swap, gfp, mpol, ilx);
+ mpol_cond_put(mpol);
if (!page)
return NULL;
static struct folio *shmem_alloc_hugefolio(gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
{
- struct vm_area_struct pvma;
- struct address_space *mapping = info->vfs_inode.i_mapping;
- pgoff_t hindex;
- struct folio *folio;
+ struct mempolicy *mpol;
+ pgoff_t ilx;
+ struct page *page;
- hindex = round_down(index, HPAGE_PMD_NR);
- if (xa_find(&mapping->i_pages, &hindex, hindex + HPAGE_PMD_NR - 1,
- XA_PRESENT))
- return NULL;
+ mpol = shmem_get_pgoff_policy(info, index, HPAGE_PMD_ORDER, &ilx);
+ page = alloc_pages_mpol(gfp, HPAGE_PMD_ORDER, mpol, ilx, numa_node_id());
+ mpol_cond_put(mpol);
- shmem_pseudo_vma_init(&pvma, info, hindex);
- folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, &pvma, 0, true);
- shmem_pseudo_vma_destroy(&pvma);
- if (!folio)
- count_vm_event(THP_FILE_FALLBACK);
- return folio;
+ return page_rmappable_folio(page);
}
static struct folio *shmem_alloc_folio(gfp_t gfp,
- struct shmem_inode_info *info, pgoff_t index)
+ struct shmem_inode_info *info, pgoff_t index)
{
- struct vm_area_struct pvma;
- struct folio *folio;
+ struct mempolicy *mpol;
+ pgoff_t ilx;
+ struct page *page;
- shmem_pseudo_vma_init(&pvma, info, index);
- folio = vma_alloc_folio(gfp, 0, &pvma, 0, false);
- shmem_pseudo_vma_destroy(&pvma);
+ mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
+ page = alloc_pages_mpol(gfp, 0, mpol, ilx, numa_node_id());
+ mpol_cond_put(mpol);
- return folio;
+ return (struct folio *)page;
}
- static struct folio *shmem_alloc_and_acct_folio(gfp_t gfp, struct inode *inode,
- pgoff_t index, bool huge)
+ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
+ struct inode *inode, pgoff_t index,
+ struct mm_struct *fault_mm, bool huge)
{
+ struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
struct folio *folio;
- int nr;
- int err;
+ long pages;
+ int error;
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
huge = false;
- nr = huge ? HPAGE_PMD_NR : 1;
- err = shmem_inode_acct_block(inode, nr);
- if (err)
- goto failed;
+ if (huge) {
+ pages = HPAGE_PMD_NR;
+ index = round_down(index, HPAGE_PMD_NR);
+
+ /*
+ * Check for conflict before waiting on a huge allocation.
+ * Conflict might be that a huge page has just been allocated
+ * and added to page cache by a racing thread, or that there
+ * is already at least one small page in the huge extent.
+ * Be careful to retry when appropriate, but not forever!
+ * Elsewhere -EEXIST would be the right code, but not here.
+ */
+ if (xa_find(&mapping->i_pages, &index,
+ index + HPAGE_PMD_NR - 1, XA_PRESENT))
+ return ERR_PTR(-E2BIG);
- if (huge)
folio = shmem_alloc_hugefolio(gfp, info, index);
- else
+ if (!folio)
+ count_vm_event(THP_FILE_FALLBACK);
+ } else {
+ pages = 1;
folio = shmem_alloc_folio(gfp, info, index);
- if (folio) {
- __folio_set_locked(folio);
- __folio_set_swapbacked(folio);
- return folio;
}
+ if (!folio)
+ return ERR_PTR(-ENOMEM);
- err = -ENOMEM;
- shmem_inode_unacct_blocks(inode, nr);
- failed:
- return ERR_PTR(err);
+ __folio_set_locked(folio);
+ __folio_set_swapbacked(folio);
+
+ gfp &= GFP_RECLAIM_MASK;
+ error = mem_cgroup_charge(folio, fault_mm, gfp);
+ if (error) {
+ if (xa_find(&mapping->i_pages, &index,
+ index + pages - 1, XA_PRESENT)) {
+ error = -EEXIST;
+ } else if (huge) {
+ count_vm_event(THP_FILE_FALLBACK);
+ count_vm_event(THP_FILE_FALLBACK_CHARGE);
+ }
+ goto unlock;
+ }
+
+ error = shmem_add_to_page_cache(folio, mapping, index, NULL, gfp);
+ if (error)
+ goto unlock;
+
+ error = shmem_inode_acct_blocks(inode, pages);
+ if (error) {
+ struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+ long freed;
+ /*
+ * Try to reclaim some space by splitting a few
+ * large folios beyond i_size on the filesystem.
+ */
+ shmem_unused_huge_shrink(sbinfo, NULL, 2);
+ /*
+ * And do a shmem_recalc_inode() to account for freed pages:
+ * except our folio is there in cache, so not quite balanced.
+ */
+ spin_lock(&info->lock);
+ freed = pages + info->alloced - info->swapped -
+ READ_ONCE(mapping->nrpages);
+ if (freed > 0)
+ info->alloced -= freed;
+ spin_unlock(&info->lock);
+ if (freed > 0)
+ shmem_inode_unacct_blocks(inode, freed);
+ error = shmem_inode_acct_blocks(inode, pages);
+ if (error) {
+ filemap_remove_folio(folio);
+ goto unlock;
+ }
+ }
+
+ shmem_recalc_inode(inode, pages, 0);
+ folio_add_lru(folio);
+ return folio;
+
+ unlock:
+ folio_unlock(folio);
+ folio_put(folio);
+ return ERR_PTR(error);
}
/*
*/
static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
struct folio **foliop, enum sgp_type sgp,
- gfp_t gfp, struct vm_area_struct *vma,
+ gfp_t gfp, struct mm_struct *fault_mm,
vm_fault_t *fault_type)
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
- struct mm_struct *charge_mm = vma ? vma->vm_mm : NULL;
struct swap_info_struct *si;
struct folio *folio = NULL;
swp_entry_t swap;
if (fault_type) {
*fault_type |= VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
- count_memcg_event_mm(charge_mm, PGMAJFAULT);
+ count_memcg_event_mm(fault_mm, PGMAJFAULT);
}
/* Here we actually start the io */
- folio = shmem_swapin(swap, gfp, info, index);
+ folio = shmem_swapin_cluster(swap, gfp, info, index);
if (!folio) {
error = -ENOMEM;
goto failed;
}
error = shmem_add_to_page_cache(folio, mapping, index,
- swp_to_radix_entry(swap), gfp,
- charge_mm);
+ swp_to_radix_entry(swap), gfp);
if (error)
goto failed;
* vm. If we swap it in we mark it dirty since we also free the swap
* entry since a page cannot live in both the swap and page cache.
*
- * vma, vmf, and fault_type are only supplied by shmem_fault:
- * otherwise they are NULL.
+ * vmf and fault_type are only supplied by shmem_fault: otherwise they are NULL.
*/
static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
struct folio **foliop, enum sgp_type sgp, gfp_t gfp,
- struct vm_area_struct *vma, struct vm_fault *vmf,
- vm_fault_t *fault_type)
+ struct vm_fault *vmf, vm_fault_t *fault_type)
{
- struct address_space *mapping = inode->i_mapping;
- struct shmem_inode_info *info = SHMEM_I(inode);
- struct shmem_sb_info *sbinfo;
- struct mm_struct *charge_mm;
+ struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
+ struct mm_struct *fault_mm;
struct folio *folio;
- pgoff_t hindex;
- gfp_t huge_gfp;
int error;
- int once = 0;
- int alloced = 0;
+ bool alloced;
if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
return -EFBIG;
repeat:
if (sgp <= SGP_CACHE &&
- ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
+ ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode))
return -EINVAL;
- }
- sbinfo = SHMEM_SB(inode->i_sb);
- charge_mm = vma ? vma->vm_mm : NULL;
+ alloced = false;
+ fault_mm = vma ? vma->vm_mm : NULL;
- folio = filemap_get_entry(mapping, index);
+ folio = filemap_get_entry(inode->i_mapping, index);
if (folio && vma && userfaultfd_minor(vma)) {
if (!xa_is_value(folio))
folio_put(folio);
if (xa_is_value(folio)) {
error = shmem_swapin_folio(inode, index, &folio,
- sgp, gfp, vma, fault_type);
+ sgp, gfp, fault_mm, fault_type);
if (error == -EEXIST)
goto repeat;
folio_lock(folio);
/* Has the folio been truncated or swapped out? */
- if (unlikely(folio->mapping != mapping)) {
+ if (unlikely(folio->mapping != inode->i_mapping)) {
folio_unlock(folio);
folio_put(folio);
goto repeat;
return 0;
}
- if (!shmem_is_huge(inode, index, false,
- vma ? vma->vm_mm : NULL, vma ? vma->vm_flags : 0))
- goto alloc_nohuge;
+ if (shmem_is_huge(inode, index, false, fault_mm,
+ vma ? vma->vm_flags : 0)) {
+ gfp_t huge_gfp;
- huge_gfp = vma_thp_gfp_mask(vma);
- huge_gfp = limit_gfp_mask(huge_gfp, gfp);
- folio = shmem_alloc_and_acct_folio(huge_gfp, inode, index, true);
- if (IS_ERR(folio)) {
- alloc_nohuge:
- folio = shmem_alloc_and_acct_folio(gfp, inode, index, false);
+ huge_gfp = vma_thp_gfp_mask(vma);
+ huge_gfp = limit_gfp_mask(huge_gfp, gfp);
+ folio = shmem_alloc_and_add_folio(huge_gfp,
+ inode, index, fault_mm, true);
+ if (!IS_ERR(folio)) {
+ count_vm_event(THP_FILE_ALLOC);
+ goto alloced;
+ }
+ if (PTR_ERR(folio) == -EEXIST)
+ goto repeat;
}
- if (IS_ERR(folio)) {
- int retry = 5;
+ folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false);
+ if (IS_ERR(folio)) {
error = PTR_ERR(folio);
+ if (error == -EEXIST)
+ goto repeat;
folio = NULL;
- if (error != -ENOSPC)
- goto unlock;
- /*
- * Try to reclaim some space by splitting a large folio
- * beyond i_size on the filesystem.
- */
- while (retry--) {
- int ret;
-
- ret = shmem_unused_huge_shrink(sbinfo, NULL, 1);
- if (ret == SHRINK_STOP)
- break;
- if (ret)
- goto alloc_nohuge;
- }
goto unlock;
}
- hindex = round_down(index, folio_nr_pages(folio));
-
- if (sgp == SGP_WRITE)
- __folio_set_referenced(folio);
-
- error = shmem_add_to_page_cache(folio, mapping, hindex,
- NULL, gfp & GFP_RECLAIM_MASK,
- charge_mm);
- if (error)
- goto unacct;
-
- folio_add_lru(folio);
- shmem_recalc_inode(inode, folio_nr_pages(folio), 0);
+ alloced:
alloced = true;
-
if (folio_test_pmd_mappable(folio) &&
DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
folio_next_index(folio) - 1) {
+ struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+ struct shmem_inode_info *info = SHMEM_I(inode);
/*
* Part of the large folio is beyond i_size: subject
* to shrink under memory pressure.
spin_unlock(&sbinfo->shrinklist_lock);
}
+ if (sgp == SGP_WRITE)
+ folio_set_referenced(folio);
/*
* Let SGP_FALLOC use the SGP_WRITE optimization on a new folio.
*/
/* Perhaps the file has been truncated since we checked */
if (sgp <= SGP_CACHE &&
((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
- if (alloced) {
- folio_clear_dirty(folio);
- filemap_remove_folio(folio);
- shmem_recalc_inode(inode, 0, 0);
- }
error = -EINVAL;
goto unlock;
}
/*
* Error recovery.
*/
- unacct:
- shmem_inode_unacct_blocks(inode, folio_nr_pages(folio));
-
- if (folio_test_large(folio)) {
- folio_unlock(folio);
- folio_put(folio);
- goto alloc_nohuge;
- }
unlock:
+ if (alloced)
+ filemap_remove_folio(folio);
+ shmem_recalc_inode(inode, 0, 0);
if (folio) {
folio_unlock(folio);
folio_put(folio);
}
- if (error == -ENOSPC && !once++) {
- shmem_recalc_inode(inode, 0, 0);
- goto repeat;
- }
- if (error == -EEXIST)
- goto repeat;
return error;
}
enum sgp_type sgp)
{
return shmem_get_folio_gfp(inode, index, foliop, sgp,
- mapping_gfp_mask(inode->i_mapping), NULL, NULL, NULL);
+ mapping_gfp_mask(inode->i_mapping), NULL, NULL);
}
/*
* entry unconditionally - even if something else had already woken the
* target.
*/
- static int synchronous_wake_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
+ static int synchronous_wake_function(wait_queue_entry_t *wait,
+ unsigned int mode, int sync, void *key)
{
int ret = default_wake_function(wait, mode, sync, key);
list_del_init(&wait->entry);
return ret;
}
+ /*
+ * Trinity finds that probing a hole which tmpfs is punching can
+ * prevent the hole-punch from ever completing: which in turn
+ * locks writers out with its hold on i_rwsem. So refrain from
+ * faulting pages into the hole while it's being punched. Although
+ * shmem_undo_range() does remove the additions, it may be unable to
+ * keep up, as each new page needs its own unmap_mapping_range() call,
+ * and the i_mmap tree grows ever slower to scan if new vmas are added.
+ *
+ * It does not matter if we sometimes reach this check just before the
+ * hole-punch begins, so that one fault then races with the punch:
+ * we just need to make racing faults a rare case.
+ *
+ * The implementation below would be much simpler if we just used a
+ * standard mutex or completion: but we cannot take i_rwsem in fault,
+ * and bloating every shmem inode for this unlikely case would be sad.
+ */
+ static vm_fault_t shmem_falloc_wait(struct vm_fault *vmf, struct inode *inode)
+ {
+ struct shmem_falloc *shmem_falloc;
+ struct file *fpin = NULL;
+ vm_fault_t ret = 0;
+
+ spin_lock(&inode->i_lock);
+ shmem_falloc = inode->i_private;
+ if (shmem_falloc &&
+ shmem_falloc->waitq &&
+ vmf->pgoff >= shmem_falloc->start &&
+ vmf->pgoff < shmem_falloc->next) {
+ wait_queue_head_t *shmem_falloc_waitq;
+ DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
+
+ ret = VM_FAULT_NOPAGE;
+ fpin = maybe_unlock_mmap_for_io(vmf, NULL);
+ shmem_falloc_waitq = shmem_falloc->waitq;
+ prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
+ TASK_UNINTERRUPTIBLE);
+ spin_unlock(&inode->i_lock);
+ schedule();
+
+ /*
+ * shmem_falloc_waitq points into the shmem_fallocate()
+ * stack of the hole-punching task: shmem_falloc_waitq
+ * is usually invalid by the time we reach here, but
+ * finish_wait() does not dereference it in that case;
+ * though i_lock needed lest racing with wake_up_all().
+ */
+ spin_lock(&inode->i_lock);
+ finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
+ }
+ spin_unlock(&inode->i_lock);
+ if (fpin) {
+ fput(fpin);
+ ret = VM_FAULT_RETRY;
+ }
+ return ret;
+ }
+
static vm_fault_t shmem_fault(struct vm_fault *vmf)
{
- struct vm_area_struct *vma = vmf->vma;
- struct inode *inode = file_inode(vma->vm_file);
+ struct inode *inode = file_inode(vmf->vma->vm_file);
gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
struct folio *folio = NULL;
+ vm_fault_t ret = 0;
int err;
- vm_fault_t ret = VM_FAULT_LOCKED;
/*
* Trinity finds that probing a hole which tmpfs is punching can
- * prevent the hole-punch from ever completing: which in turn
- * locks writers out with its hold on i_rwsem. So refrain from
- * faulting pages into the hole while it's being punched. Although
- * shmem_undo_range() does remove the additions, it may be unable to
- * keep up, as each new page needs its own unmap_mapping_range() call,
- * and the i_mmap tree grows ever slower to scan if new vmas are added.
- *
- * It does not matter if we sometimes reach this check just before the
- * hole-punch begins, so that one fault then races with the punch:
- * we just need to make racing faults a rare case.
- *
- * The implementation below would be much simpler if we just used a
- * standard mutex or completion: but we cannot take i_rwsem in fault,
- * and bloating every shmem inode for this unlikely case would be sad.
+ * prevent the hole-punch from ever completing: noted in i_private.
*/
if (unlikely(inode->i_private)) {
- struct shmem_falloc *shmem_falloc;
-
- spin_lock(&inode->i_lock);
- shmem_falloc = inode->i_private;
- if (shmem_falloc &&
- shmem_falloc->waitq &&
- vmf->pgoff >= shmem_falloc->start &&
- vmf->pgoff < shmem_falloc->next) {
- struct file *fpin;
- wait_queue_head_t *shmem_falloc_waitq;
- DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
-
- ret = VM_FAULT_NOPAGE;
- fpin = maybe_unlock_mmap_for_io(vmf, NULL);
- if (fpin)
- ret = VM_FAULT_RETRY;
-
- shmem_falloc_waitq = shmem_falloc->waitq;
- prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
- TASK_UNINTERRUPTIBLE);
- spin_unlock(&inode->i_lock);
- schedule();
-
- /*
- * shmem_falloc_waitq points into the shmem_fallocate()
- * stack of the hole-punching task: shmem_falloc_waitq
- * is usually invalid by the time we reach here, but
- * finish_wait() does not dereference it in that case;
- * though i_lock needed lest racing with wake_up_all().
- */
- spin_lock(&inode->i_lock);
- finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
- spin_unlock(&inode->i_lock);
-
- if (fpin)
- fput(fpin);
+ ret = shmem_falloc_wait(vmf, inode);
+ if (ret)
return ret;
- }
- spin_unlock(&inode->i_lock);
}
+ WARN_ON_ONCE(vmf->page != NULL);
err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
- gfp, vma, vmf, &ret);
+ gfp, vmf, &ret);
if (err)
return vmf_error(err);
- if (folio)
+ if (folio) {
vmf->page = folio_file_page(folio, vmf->pgoff);
+ ret |= VM_FAULT_LOCKED;
+ }
return ret;
}
}
static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
- unsigned long addr)
+ unsigned long addr, pgoff_t *ilx)
{
struct inode *inode = file_inode(vma->vm_file);
pgoff_t index;
+ /*
+ * Bias interleave by inode number to distribute better across nodes;
+ * but this interface is independent of which page order is used, so
+ * supplies only that bias, letting caller apply the offset (adjusted
+ * by page order, as in shmem_get_pgoff_policy() and get_vma_policy()).
+ */
+ *ilx = inode->i_ino;
index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
return mpol_shared_policy_lookup(&SHMEM_I(inode)->policy, index);
}
- #endif
+
+ static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
+ pgoff_t index, unsigned int order, pgoff_t *ilx)
+ {
+ struct mempolicy *mpol;
+
+ /* Bias interleave by inode number to distribute better across nodes */
+ *ilx = info->vfs_inode.i_ino + (index >> order);
+
+ mpol = mpol_shared_policy_lookup(&info->policy, index);
+ return mpol ? mpol : get_task_policy(current);
+ }
+ #else
+ static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
+ pgoff_t index, unsigned int order, pgoff_t *ilx)
+ {
+ *ilx = 0;
+ return NULL;
+ }
+ #endif /* CONFIG_NUMA */
int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
{
struct shmem_inode_info *info = SHMEM_I(inode);
int ret;
- ret = seal_check_future_write(info->seals, vma);
+ ret = seal_check_write(info->seals, vma);
if (ret)
return ret;
if (err)
return ERR_PTR(err);
-
inode = new_inode(sb);
if (!inode) {
shmem_free_inode(sb, 0);
inode->i_ino = ino;
inode_init_owner(idmap, inode, dir, mode);
inode->i_blocks = 0;
- inode->i_atime = inode->i_mtime = inode_set_ctime_current(inode);
+ simple_inode_init_ts(inode);
inode->i_generation = get_random_u32();
info = SHMEM_I(inode);
memset(info, 0, (char *)inode - (char *)info);
atomic_set(&info->stop_eviction, 0);
info->seals = F_SEAL_SEAL;
info->flags = flags & VM_NORESERVE;
- info->i_crtime = inode->i_mtime;
+ info->i_crtime = inode_get_mtime(inode);
info->fsflags = (dir == NULL) ? 0 :
SHMEM_I(dir)->fsflags & SHMEM_FL_INHERITED;
if (info->fsflags)
shmem_set_inode_flags(inode, info->fsflags);
INIT_LIST_HEAD(&info->shrinklist);
INIT_LIST_HEAD(&info->swaplist);
- INIT_LIST_HEAD(&info->swaplist);
- if (sbinfo->noswap)
- mapping_set_unevictable(inode->i_mapping);
simple_xattrs_init(&info->xattrs);
cache_no_acl(inode);
+ if (sbinfo->noswap)
+ mapping_set_unevictable(inode->i_mapping);
mapping_set_large_folios(inode->i_mapping);
switch (mode & S_IFMT) {
int ret;
pgoff_t max_off;
- if (shmem_inode_acct_block(inode, 1)) {
+ if (shmem_inode_acct_blocks(inode, 1)) {
/*
* We may have got a page, returned -ENOENT triggering a retry,
* and now we find ourselves with -ENOMEM. Release the page, to
if (unlikely(pgoff >= max_off))
goto out_release;
- ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL,
- gfp & GFP_RECLAIM_MASK, dst_vma->vm_mm);
+ ret = mem_cgroup_charge(folio, dst_vma->vm_mm, gfp);
+ if (ret)
+ goto out_release;
+ ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp);
if (ret)
goto out_release;
}
ret = shmem_get_folio(inode, index, &folio, SGP_WRITE);
-
if (ret)
return ret;
error = simple_acl_create(dir, inode);
if (error)
goto out_iput;
- error = security_inode_init_security(inode, dir,
- &dentry->d_name,
+ error = security_inode_init_security(inode, dir, &dentry->d_name,
shmem_initxattrs, NULL);
if (error && error != -EOPNOTSUPP)
goto out_iput;
goto out_iput;
dir->i_size += BOGO_DIRENT_SIZE;
- dir->i_mtime = inode_set_ctime_current(dir);
+ inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
inode_inc_iversion(dir);
d_instantiate(dentry, inode);
dget(dentry); /* Extra count - pin the dentry in core */
int error;
inode = shmem_get_inode(idmap, dir->i_sb, dir, mode, 0, VM_NORESERVE);
-
if (IS_ERR(inode)) {
error = PTR_ERR(inode);
goto err_out;
}
-
- error = security_inode_init_security(inode, dir,
- NULL,
+ error = security_inode_init_security(inode, dir, NULL,
shmem_initxattrs, NULL);
if (error && error != -EOPNOTSUPP)
goto out_iput;
/*
* Link a file..
*/
- static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
+ static int shmem_link(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *dentry)
{
struct inode *inode = d_inode(old_dentry);
int ret = 0;
}
dir->i_size += BOGO_DIRENT_SIZE;
- dir->i_mtime = inode_set_ctime_to_ts(dir,
- inode_set_ctime_current(inode));
+ inode_set_mtime_to_ts(dir,
+ inode_set_ctime_to_ts(dir, inode_set_ctime_current(inode)));
inode_inc_iversion(dir);
inc_nlink(inode);
ihold(inode); /* New dentry reference */
- dget(dentry); /* Extra pinning count for the created dentry */
+ dget(dentry); /* Extra pinning count for the created dentry */
d_instantiate(dentry, inode);
out:
return ret;
simple_offset_remove(shmem_get_offset_ctx(dir), dentry);
dir->i_size -= BOGO_DIRENT_SIZE;
- dir->i_mtime = inode_set_ctime_to_ts(dir,
- inode_set_ctime_current(inode));
+ inode_set_mtime_to_ts(dir,
+ inode_set_ctime_to_ts(dir, inode_set_ctime_current(inode)));
inode_inc_iversion(dir);
drop_nlink(inode);
- dput(dentry); /* Undo the count from "create" - this does all the work */
+ dput(dentry); /* Undo the count from "create" - does all the work */
return 0;
}
inode = shmem_get_inode(idmap, dir->i_sb, dir, S_IFLNK | 0777, 0,
VM_NORESERVE);
-
if (IS_ERR(inode))
return PTR_ERR(inode);
folio_put(folio);
}
dir->i_size += BOGO_DIRENT_SIZE;
- dir->i_mtime = inode_set_ctime_current(dir);
+ inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
inode_inc_iversion(dir);
d_instantiate(dentry, inode);
dget(dentry);
folio_put(arg);
}
- static const char *shmem_get_link(struct dentry *dentry,
- struct inode *inode,
+ static const char *shmem_get_link(struct dentry *dentry, struct inode *inode,
struct delayed_call *done)
{
struct folio *folio = NULL;
* Callback for security_inode_init_security() for acquiring xattrs.
*/
static int shmem_initxattrs(struct inode *inode,
- const struct xattr *xattr_array,
- void *fs_info)
+ const struct xattr *xattr_array, void *fs_info)
{
struct shmem_inode_info *info = SHMEM_I(inode);
struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
.set = shmem_xattr_handler_set,
};
-static const struct xattr_handler *shmem_xattr_handlers[] = {
+static const struct xattr_handler * const shmem_xattr_handlers[] = {
&shmem_security_xattr_handler,
&shmem_trusted_xattr_handler,
&shmem_user_xattr_handler,
return alias ?: d_find_any_alias(inode);
}
-
static struct dentry *shmem_fh_to_dentry(struct super_block *sb,
struct fid *fid, int fh_len, int fh_type)
{
}
#endif /* CONFIG_TMPFS_QUOTA */
- inode = shmem_get_inode(&nop_mnt_idmap, sb, NULL, S_IFDIR | sbinfo->mode, 0,
- VM_NORESERVE);
+ inode = shmem_get_inode(&nop_mnt_idmap, sb, NULL,
+ S_IFDIR | sbinfo->mode, 0, VM_NORESERVE);
if (IS_ERR(inode)) {
error = PTR_ERR(inode);
goto failed;
.parameters = shmem_fs_parameters,
#endif
.kill_sb = kill_litter_super,
- #ifdef CONFIG_SHMEM
.fs_flags = FS_USERNS_MOUNT | FS_ALLOW_IDMAP,
- #else
- .fs_flags = FS_USERNS_MOUNT,
- #endif
};
void __init shmem_init(void)
for (i = 0; i < ARRAY_SIZE(values); i++) {
len += sysfs_emit_at(buf, len,
- shmem_huge == values[i] ? "%s[%s]" : "%s%s",
- i ? " " : "",
- shmem_format_huge(values[i]));
+ shmem_huge == values[i] ? "%s[%s]" : "%s%s",
+ i ? " " : "", shmem_format_huge(values[i]));
}
-
len += sysfs_emit_at(buf, len, "\n");
return len;
#define shmem_acct_size(flags, size) 0
#define shmem_unacct_size(flags, size) do {} while (0)
- static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block *sb, struct inode *dir,
- umode_t mode, dev_t dev, unsigned long flags)
+ static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap,
+ struct super_block *sb, struct inode *dir,
+ umode_t mode, dev_t dev, unsigned long flags)
{
struct inode *inode = ramfs_get_inode(sb, dir, mode, dev);
return inode ? inode : ERR_PTR(-ENOSPC);
/* common code */
- static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name, loff_t size,
- unsigned long flags, unsigned int i_flags)
+ static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name,
+ loff_t size, unsigned long flags, unsigned int i_flags)
{
struct inode *inode;
struct file *res;
inode = shmem_get_inode(&nop_mnt_idmap, mnt->mnt_sb, NULL,
S_IFREG | S_IRWXUGO, 0, flags);
-
if (IS_ERR(inode)) {
shmem_unacct_size(flags, size);
return ERR_CAST(inode);
BUG_ON(!shmem_mapping(mapping));
error = shmem_get_folio_gfp(inode, index, &folio, SGP_CACHE,
- gfp, NULL, NULL, NULL);
+ gfp, NULL, NULL);
if (error)
return ERR_PTR(error);
cond_resched();
}
}
+ EXPORT_SYMBOL(folio_copy);
int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
int sysctl_overcommit_ratio __read_mostly = 50;
{
const char *type;
- if (kmem_valid_obj(object)) {
- kmem_dump_obj(object);
+ if (kmem_dump_obj(object))
return;
- }
if (vmalloc_dump_obj(object))
return;
* @task: controlling RPC task
* @xdr: xdr_stream containing RPC Reply header
*
- * On success, @xdr is updated to point past the verifier and
- * zero is returned. Otherwise, @xdr is in an undefined state
- * and a negative errno is returned.
+ * Return values:
+ * %0: Verifier is valid. @xdr now points past the verifier.
+ * %-EIO: Verifier is corrupted or message ended early.
+ * %-EACCES: Verifier is intact but not valid.
+ * %-EPROTONOSUPPORT: Server does not support the requested auth type.
+ *
+ * When a negative errno is returned, @xdr is left in an undefined
+ * state.
*/
int
rpcauth_checkverf(struct rpc_task *task, struct xdr_stream *xdr)
test_bit(RPCAUTH_CRED_UPTODATE, &cred->cr_flags) != 0;
}
- static struct shrinker rpc_cred_shrinker = {
- .count_objects = rpcauth_cache_shrink_count,
- .scan_objects = rpcauth_cache_shrink_scan,
- .seeks = DEFAULT_SEEKS,
- };
+ static struct shrinker *rpc_cred_shrinker;
int __init rpcauth_init_module(void)
{
err = rpc_init_authunix();
if (err < 0)
goto out1;
- err = register_shrinker(&rpc_cred_shrinker, "sunrpc_cred");
- if (err < 0)
+ rpc_cred_shrinker = shrinker_alloc(0, "sunrpc_cred");
+ if (!rpc_cred_shrinker) {
+ err = -ENOMEM;
goto out2;
+ }
+
+ rpc_cred_shrinker->count_objects = rpcauth_cache_shrink_count;
+ rpc_cred_shrinker->scan_objects = rpcauth_cache_shrink_scan;
+
+ shrinker_register(rpc_cred_shrinker);
+
return 0;
out2:
rpc_destroy_authunix();
void rpcauth_remove_module(void)
{
rpc_destroy_authunix();
- unregister_shrinker(&rpc_cred_shrinker);
+ shrinker_free(rpc_cred_shrinker);
}
#include <inttypes.h>
#include <linux/types.h>
#include <linux/sched.h>
+#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
return 0;
}
-static void test_clone3(uint64_t flags, size_t size, int expected,
- enum test_mode test_mode)
+static bool test_clone3(uint64_t flags, size_t size, int expected,
+ enum test_mode test_mode)
{
int ret;
ret = call_clone3(flags, size, test_mode);
ksft_print_msg("[%d] clone3() with flags says: %d expected %d\n",
getpid(), ret, expected);
- if (ret != expected)
- ksft_test_result_fail(
+ if (ret != expected) {
+ ksft_print_msg(
"[%d] Result (%d) is different than expected (%d)\n",
getpid(), ret, expected);
- else
- ksft_test_result_pass(
- "[%d] Result (%d) matches expectation (%d)\n",
- getpid(), ret, expected);
-}
-
-int main(int argc, char *argv[])
-{
- uid_t uid = getuid();
-
- ksft_print_header();
- ksft_set_plan(19);
- test_clone3_supported();
-
- /* Just a simple clone3() should return 0.*/
- test_clone3(0, 0, 0, CLONE3_ARGS_NO_TEST);
-
- /* Do a clone3() in a new PID NS.*/
- if (uid == 0)
- test_clone3(CLONE_NEWPID, 0, 0, CLONE3_ARGS_NO_TEST);
- else
- ksft_test_result_skip("Skipping clone3() with CLONE_NEWPID\n");
+ return false;
+ }
- /* Do a clone3() with CLONE_ARGS_SIZE_VER0. */
- test_clone3(0, CLONE_ARGS_SIZE_VER0, 0, CLONE3_ARGS_NO_TEST);
+ return true;
+}
- /* Do a clone3() with CLONE_ARGS_SIZE_VER0 - 8 */
- test_clone3(0, CLONE_ARGS_SIZE_VER0 - 8, -EINVAL, CLONE3_ARGS_NO_TEST);
+typedef bool (*filter_function)(void);
+typedef size_t (*size_function)(void);
- /* Do a clone3() with sizeof(struct clone_args) + 8 */
- test_clone3(0, sizeof(struct __clone_args) + 8, 0, CLONE3_ARGS_NO_TEST);
+static bool not_root(void)
+{
+ if (getuid() != 0) {
+ ksft_print_msg("Not running as root\n");
+ return true;
+ }
- /* Do a clone3() with exit_signal having highest 32 bits non-zero */
- test_clone3(0, 0, -EINVAL, CLONE3_ARGS_INVAL_EXIT_SIGNAL_BIG);
+ return false;
+}
- /* Do a clone3() with negative 32-bit exit_signal */
- test_clone3(0, 0, -EINVAL, CLONE3_ARGS_INVAL_EXIT_SIGNAL_NEG);
++static bool no_timenamespace(void)
++{
++ if (not_root())
++ return true;
+
- /* Do a clone3() with exit_signal not fitting into CSIGNAL mask */
- test_clone3(0, 0, -EINVAL, CLONE3_ARGS_INVAL_EXIT_SIGNAL_CSIG);
++ if (!access("/proc/self/ns/time", F_OK))
++ return false;
+
- /* Do a clone3() with NSIG < exit_signal < CSIG */
- test_clone3(0, 0, -EINVAL, CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG);
++ ksft_print_msg("Time namespaces are not supported\n");
++ return true;
++}
+
- test_clone3(0, sizeof(struct __clone_args) + 8, 0, CLONE3_ARGS_ALL_0);
+static size_t page_size_plus_8(void)
+{
+ return getpagesize() + 8;
+}
- test_clone3(0, sizeof(struct __clone_args) + 16, -E2BIG,
- CLONE3_ARGS_ALL_0);
+struct test {
+ const char *name;
+ uint64_t flags;
+ size_t size;
+ size_function size_function;
+ int expected;
+ enum test_mode test_mode;
+ filter_function filter;
+};
- test_clone3(0, sizeof(struct __clone_args) * 2, -E2BIG,
- CLONE3_ARGS_ALL_0);
+static const struct test tests[] = {
+ {
+ .name = "simple clone3()",
+ .flags = 0,
+ .size = 0,
+ .expected = 0,
+ .test_mode = CLONE3_ARGS_NO_TEST,
+ },
+ {
+ .name = "clone3() in a new PID_NS",
+ .flags = CLONE_NEWPID,
+ .size = 0,
+ .expected = 0,
+ .test_mode = CLONE3_ARGS_NO_TEST,
+ .filter = not_root,
+ },
+ {
+ .name = "CLONE_ARGS_SIZE_VER0",
+ .flags = 0,
+ .size = CLONE_ARGS_SIZE_VER0,
+ .expected = 0,
+ .test_mode = CLONE3_ARGS_NO_TEST,
+ },
+ {
+ .name = "CLONE_ARGS_SIZE_VER0 - 8",
+ .flags = 0,
+ .size = CLONE_ARGS_SIZE_VER0 - 8,
+ .expected = -EINVAL,
+ .test_mode = CLONE3_ARGS_NO_TEST,
+ },
+ {
+ .name = "sizeof(struct clone_args) + 8",
+ .flags = 0,
+ .size = sizeof(struct __clone_args) + 8,
+ .expected = 0,
+ .test_mode = CLONE3_ARGS_NO_TEST,
+ },
+ {
+ .name = "exit_signal with highest 32 bits non-zero",
+ .flags = 0,
+ .size = 0,
+ .expected = -EINVAL,
+ .test_mode = CLONE3_ARGS_INVAL_EXIT_SIGNAL_BIG,
+ },
+ {
+ .name = "negative 32-bit exit_signal",
+ .flags = 0,
+ .size = 0,
+ .expected = -EINVAL,
+ .test_mode = CLONE3_ARGS_INVAL_EXIT_SIGNAL_NEG,
+ },
+ {
+ .name = "exit_signal not fitting into CSIGNAL mask",
+ .flags = 0,
+ .size = 0,
+ .expected = -EINVAL,
+ .test_mode = CLONE3_ARGS_INVAL_EXIT_SIGNAL_CSIG,
+ },
+ {
+ .name = "NSIG < exit_signal < CSIG",
+ .flags = 0,
+ .size = 0,
+ .expected = -EINVAL,
+ .test_mode = CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
+ },
+ {
+ .name = "Arguments sizeof(struct clone_args) + 8",
+ .flags = 0,
+ .size = sizeof(struct __clone_args) + 8,
+ .expected = 0,
+ .test_mode = CLONE3_ARGS_ALL_0,
+ },
+ {
+ .name = "Arguments sizeof(struct clone_args) + 16",
+ .flags = 0,
+ .size = sizeof(struct __clone_args) + 16,
+ .expected = -E2BIG,
+ .test_mode = CLONE3_ARGS_ALL_0,
+ },
+ {
+ .name = "Arguments sizeof(struct clone_arg) * 2",
+ .flags = 0,
+ .size = sizeof(struct __clone_args) + 16,
+ .expected = -E2BIG,
+ .test_mode = CLONE3_ARGS_ALL_0,
+ },
+ {
+ .name = "Arguments > page size",
+ .flags = 0,
+ .size_function = page_size_plus_8,
+ .expected = -E2BIG,
+ .test_mode = CLONE3_ARGS_NO_TEST,
+ },
+ {
+ .name = "CLONE_ARGS_SIZE_VER0 in a new PID NS",
+ .flags = CLONE_NEWPID,
+ .size = CLONE_ARGS_SIZE_VER0,
+ .expected = 0,
+ .test_mode = CLONE3_ARGS_NO_TEST,
+ .filter = not_root,
+ },
+ {
+ .name = "CLONE_ARGS_SIZE_VER0 - 8 in a new PID NS",
+ .flags = CLONE_NEWPID,
+ .size = CLONE_ARGS_SIZE_VER0 - 8,
+ .expected = -EINVAL,
+ .test_mode = CLONE3_ARGS_NO_TEST,
+ },
+ {
+ .name = "sizeof(struct clone_args) + 8 in a new PID NS",
+ .flags = CLONE_NEWPID,
+ .size = sizeof(struct __clone_args) + 8,
+ .expected = 0,
+ .test_mode = CLONE3_ARGS_NO_TEST,
+ .filter = not_root,
+ },
+ {
+ .name = "Arguments > page size in a new PID NS",
+ .flags = CLONE_NEWPID,
+ .size_function = page_size_plus_8,
+ .expected = -E2BIG,
+ .test_mode = CLONE3_ARGS_NO_TEST,
+ },
+ {
+ .name = "New time NS",
+ .flags = CLONE_NEWTIME,
+ .size = 0,
+ .expected = 0,
+ .test_mode = CLONE3_ARGS_NO_TEST,
++ .filter = no_timenamespace,
+ },
+ {
+ .name = "exit signal (SIGCHLD) in flags",
+ .flags = SIGCHLD,
+ .size = 0,
+ .expected = -EINVAL,
+ .test_mode = CLONE3_ARGS_NO_TEST,
+ },
+};
- /* Do a clone3() with > page size */
- test_clone3(0, getpagesize() + 8, -E2BIG, CLONE3_ARGS_NO_TEST);
+int main(int argc, char *argv[])
+{
+ size_t size;
+ int i;
- /* Do a clone3() with CLONE_ARGS_SIZE_VER0 in a new PID NS. */
- if (uid == 0)
- test_clone3(CLONE_NEWPID, CLONE_ARGS_SIZE_VER0, 0,
- CLONE3_ARGS_NO_TEST);
- else
- ksft_test_result_skip("Skipping clone3() with CLONE_NEWPID\n");
+ ksft_print_header();
+ ksft_set_plan(ARRAY_SIZE(tests));
+ test_clone3_supported();
- /* Do a clone3() with CLONE_ARGS_SIZE_VER0 - 8 in a new PID NS */
- test_clone3(CLONE_NEWPID, CLONE_ARGS_SIZE_VER0 - 8, -EINVAL,
- CLONE3_ARGS_NO_TEST);
+ for (i = 0; i < ARRAY_SIZE(tests); i++) {
+ if (tests[i].filter && tests[i].filter()) {
+ ksft_test_result_skip("%s\n", tests[i].name);
+ continue;
+ }
- /* Do a clone3() with sizeof(struct clone_args) + 8 in a new PID NS */
- if (uid == 0)
- test_clone3(CLONE_NEWPID, sizeof(struct __clone_args) + 8, 0,
- CLONE3_ARGS_NO_TEST);
- else
- ksft_test_result_skip("Skipping clone3() with CLONE_NEWPID\n");
+ if (tests[i].size_function)
+ size = tests[i].size_function();
+ else
+ size = tests[i].size;
- /* Do a clone3() with > page size in a new PID NS */
- test_clone3(CLONE_NEWPID, getpagesize() + 8, -E2BIG,
- CLONE3_ARGS_NO_TEST);
+ ksft_print_msg("Running test '%s'\n", tests[i].name);
- /* Do a clone3() in a new time namespace */
- if (access("/proc/self/ns/time", F_OK) == 0) {
- test_clone3(CLONE_NEWTIME, 0, 0, CLONE3_ARGS_NO_TEST);
- } else {
- ksft_print_msg("Time namespaces are not supported\n");
- ksft_test_result_skip("Skipping clone3() with CLONE_NEWTIME\n");
+ ksft_test_result(test_clone3(tests[i].flags, size,
+ tests[i].expected,
+ tests[i].test_mode),
+ "%s\n", tests[i].name);
}
- /* Do a clone3() with exit signal (SIGCHLD) in flags */
- test_clone3(SIGCHLD, 0, -EINVAL, CLONE3_ARGS_NO_TEST);
-
ksft_finished();
}
ensure_dir "$scheme_dir" "exist"
ensure_file "$scheme_dir/action" "exist" "600"
test_access_pattern "$scheme_dir/access_pattern"
+ ensure_file "$scheme_dir/apply_interval_us" "exist" "600"
test_quotas "$scheme_dir/quotas"
test_watermarks "$scheme_dir/watermarks"
test_filters "$scheme_dir/filters"
#define VALIDATION_NO_THRESHOLD 0 /* Verify the entire region */
#define MIN(X, Y) ((X) < (Y) ? (X) : (Y))
+ #define SIZE_MB(m) ((size_t)m * (1024 * 1024))
+ #define SIZE_KB(k) ((size_t)k * 1024)
struct config {
unsigned long long src_alignment;
unsigned long long dest_alignment;
unsigned long long region_size;
int overlapping;
+ int dest_preamble_size;
};
struct test {
_1MB = 1ULL << 20,
_2MB = 2ULL << 20,
_4MB = 4ULL << 20,
+ _5MB = 5ULL << 20,
_1GB = 1ULL << 30,
_2GB = 2ULL << 30,
PMD = _2MB,
return success;
}
+ /*
+ * Returns the start address of the mapping on success, else returns
+ * NULL on failure.
+ */
+ static void *get_source_mapping(struct config c)
+ {
+ unsigned long long addr = 0ULL;
+ void *src_addr = NULL;
+ unsigned long long mmap_min_addr;
+
+ mmap_min_addr = get_mmap_min_addr();
+ /*
+ * For some tests, we need to not have any mappings below the
+ * source mapping. Add some headroom to mmap_min_addr for this.
+ */
+ mmap_min_addr += 10 * _4MB;
+
+ retry:
+ addr += c.src_alignment;
+ if (addr < mmap_min_addr)
+ goto retry;
+
+ src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE,
+ MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
+ -1, 0);
+ if (src_addr == MAP_FAILED) {
+ if (errno == EPERM || errno == EEXIST)
+ goto retry;
+ goto error;
+ }
+ /*
+ * Check that the address is aligned to the specified alignment.
+ * Addresses which have alignments that are multiples of that
+ * specified are not considered valid. For instance, 1GB address is
+ * 2MB-aligned, however it will not be considered valid for a
+ * requested alignment of 2MB. This is done to reduce coincidental
+ * alignment in the tests.
+ */
+ if (((unsigned long long) src_addr & (c.src_alignment - 1)) ||
+ !((unsigned long long) src_addr & c.src_alignment)) {
+ munmap(src_addr, c.region_size);
+ goto retry;
+ }
+
+ if (!src_addr)
+ goto error;
+
+ return src_addr;
+ error:
+ ksft_print_msg("Failed to map source region: %s\n",
+ strerror(errno));
+ return NULL;
+ }
+
/*
* This test validates that merge is called when expanding a mapping.
* Mapping containing three pages is created, middle page is unmapped
}
/*
- * Returns the start address of the mapping on success, else returns
- * NULL on failure.
+ * Verify that an mremap within a range does not cause corruption
+ * of unrelated part of range.
+ *
+ * Consider the following range which is 2MB aligned and is
+ * a part of a larger 20MB range which is not shown. Each
+ * character is 256KB below making the source and destination
+ * 2MB each. The lower case letters are moved (s to d) and the
+ * upper case letters are not moved. The below test verifies
+ * that the upper case S letters are not corrupted by the
+ * adjacent mremap.
+ *
+ * |DDDDddddSSSSssss|
*/
- static void *get_source_mapping(struct config c)
+ static void mremap_move_within_range(char pattern_seed)
{
- unsigned long long addr = 0ULL;
- void *src_addr = NULL;
- unsigned long long mmap_min_addr;
+ char *test_name = "mremap mremap move within range";
+ void *src, *dest;
+ int i, success = 1;
+
+ size_t size = SIZE_MB(20);
+ void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (ptr == MAP_FAILED) {
+ perror("mmap");
+ success = 0;
+ goto out;
+ }
+ memset(ptr, 0, size);
- mmap_min_addr = get_mmap_min_addr();
+ src = ptr + SIZE_MB(6);
+ src = (void *)((unsigned long)src & ~(SIZE_MB(2) - 1));
- retry:
- addr += c.src_alignment;
- if (addr < mmap_min_addr)
- goto retry;
+ /* Set byte pattern for source block. */
+ srand(pattern_seed);
+ for (i = 0; i < SIZE_MB(2); i++) {
+ ((char *)src)[i] = (char) rand();
+ }
- src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE,
- MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
- -1, 0);
- if (src_addr == MAP_FAILED) {
- if (errno == EPERM || errno == EEXIST)
- goto retry;
- goto error;
+ dest = src - SIZE_MB(2);
+
+ void *new_ptr = mremap(src + SIZE_MB(1), SIZE_MB(1), SIZE_MB(1),
+ MREMAP_MAYMOVE | MREMAP_FIXED, dest + SIZE_MB(1));
+ if (new_ptr == MAP_FAILED) {
+ perror("mremap");
+ success = 0;
+ goto out;
}
- /*
- * Check that the address is aligned to the specified alignment.
- * Addresses which have alignments that are multiples of that
- * specified are not considered valid. For instance, 1GB address is
- * 2MB-aligned, however it will not be considered valid for a
- * requested alignment of 2MB. This is done to reduce coincidental
- * alignment in the tests.
- */
- if (((unsigned long long) src_addr & (c.src_alignment - 1)) ||
- !((unsigned long long) src_addr & c.src_alignment)) {
- munmap(src_addr, c.region_size);
- goto retry;
+
+ /* Verify byte pattern after remapping */
+ srand(pattern_seed);
+ for (i = 0; i < SIZE_MB(1); i++) {
+ char c = (char) rand();
+
+ if (((char *)src)[i] != c) {
+ ksft_print_msg("Data at src at %d got corrupted due to unrelated mremap\n",
+ i);
+ ksft_print_msg("Expected: %#x\t Got: %#x\n", c & 0xff,
+ ((char *) src)[i] & 0xff);
+ success = 0;
+ }
}
- if (!src_addr)
- goto error;
+ out:
+ if (munmap(ptr, size) == -1)
+ perror("munmap");
- return src_addr;
- error:
- ksft_print_msg("Failed to map source region: %s\n",
- strerror(errno));
- return NULL;
+ if (success)
+ ksft_test_result_pass("%s\n", test_name);
+ else
+ ksft_test_result_fail("%s\n", test_name);
}
/* Returns the time taken for the remap on success else returns -1. */
static long long remap_region(struct config c, unsigned int threshold_mb,
char pattern_seed)
{
- void *addr, *src_addr, *dest_addr;
+ void *addr, *src_addr, *dest_addr, *dest_preamble_addr;
unsigned long long i;
struct timespec t_start = {0, 0}, t_end = {0, 0};
long long start_ns, end_ns, align_mask, ret, offset;
goto out;
}
- /* Set byte pattern */
+ /* Set byte pattern for source block. */
srand(pattern_seed);
for (i = 0; i < threshold; i++)
memset((char *) src_addr + i, (char) rand(), 1);
addr = (void *) (((unsigned long long) src_addr + c.region_size
+ offset) & align_mask);
+ /* Remap after the destination block preamble. */
+ addr += c.dest_preamble_size;
+
/* See comment in get_source_mapping() */
if (!((unsigned long long) addr & c.dest_alignment))
addr = (void *) ((unsigned long long) addr | c.dest_alignment);
if (addr + c.dest_alignment < addr) {
ksft_print_msg("Couldn't find a valid region to remap to\n");
ret = -1;
- goto out;
+ goto clean_up_src;
}
addr += c.dest_alignment;
}
+ if (c.dest_preamble_size) {
+ dest_preamble_addr = mmap((void *) addr - c.dest_preamble_size, c.dest_preamble_size,
+ PROT_READ | PROT_WRITE,
+ MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
+ -1, 0);
+ if (dest_preamble_addr == MAP_FAILED) {
+ ksft_print_msg("Failed to map dest preamble region: %s\n",
+ strerror(errno));
+ ret = -1;
+ goto clean_up_src;
+ }
+
+ /* Set byte pattern for the dest preamble block. */
+ srand(pattern_seed);
+ for (i = 0; i < c.dest_preamble_size; i++)
+ memset((char *) dest_preamble_addr + i, (char) rand(), 1);
+ }
+
clock_gettime(CLOCK_MONOTONIC, &t_start);
dest_addr = mremap(src_addr, c.region_size, c.region_size,
MREMAP_MAYMOVE|MREMAP_FIXED, (char *) addr);
if (dest_addr == MAP_FAILED) {
ksft_print_msg("mremap failed: %s\n", strerror(errno));
ret = -1;
- goto clean_up_src;
+ goto clean_up_dest_preamble;
}
/* Verify byte pattern after remapping */
char c = (char) rand();
if (((char *) dest_addr)[i] != c) {
- ksft_print_msg("Data after remap doesn't match at offset %d\n",
+ ksft_print_msg("Data after remap doesn't match at offset %llu\n",
i);
ksft_print_msg("Expected: %#x\t Got: %#x\n", c & 0xff,
((char *) dest_addr)[i] & 0xff);
}
}
+ /* Verify the dest preamble byte pattern after remapping */
+ if (c.dest_preamble_size) {
+ srand(pattern_seed);
+ for (i = 0; i < c.dest_preamble_size; i++) {
+ char c = (char) rand();
+
+ if (((char *) dest_preamble_addr)[i] != c) {
+ ksft_print_msg("Preamble data after remap doesn't match at offset %d\n",
+ i);
+ ksft_print_msg("Expected: %#x\t Got: %#x\n", c & 0xff,
+ ((char *) dest_preamble_addr)[i] & 0xff);
+ ret = -1;
+ goto clean_up_dest;
+ }
+ }
+ }
+
start_ns = t_start.tv_sec * NS_PER_SEC + t_start.tv_nsec;
end_ns = t_end.tv_sec * NS_PER_SEC + t_end.tv_nsec;
ret = end_ns - start_ns;
*/
clean_up_dest:
munmap(dest_addr, c.region_size);
+ clean_up_dest_preamble:
+ if (c.dest_preamble_size && dest_preamble_addr)
+ munmap(dest_preamble_addr, c.dest_preamble_size);
clean_up_src:
munmap(src_addr, c.region_size);
out:
return ret;
}
+ /*
+ * Verify that an mremap aligning down does not destroy
+ * the beginning of the mapping just because the aligned
+ * down address landed on a mapping that maybe does not exist.
+ */
+ static void mremap_move_1mb_from_start(char pattern_seed)
+ {
+ char *test_name = "mremap move 1mb from start at 1MB+256KB aligned src";
+ void *src = NULL, *dest = NULL;
+ int i, success = 1;
+
+ /* Config to reuse get_source_mapping() to do an aligned mmap. */
+ struct config c = {
+ .src_alignment = SIZE_MB(1) + SIZE_KB(256),
+ .region_size = SIZE_MB(6)
+ };
+
+ src = get_source_mapping(c);
+ if (!src) {
+ success = 0;
+ goto out;
+ }
+
+ c.src_alignment = SIZE_MB(1) + SIZE_KB(256);
+ dest = get_source_mapping(c);
+ if (!dest) {
+ success = 0;
+ goto out;
+ }
+
+ /* Set byte pattern for source block. */
+ srand(pattern_seed);
+ for (i = 0; i < SIZE_MB(2); i++) {
+ ((char *)src)[i] = (char) rand();
+ }
+
+ /*
+ * Unmap the beginning of dest so that the aligned address
+ * falls on no mapping.
+ */
+ munmap(dest, SIZE_MB(1));
+
+ void *new_ptr = mremap(src + SIZE_MB(1), SIZE_MB(1), SIZE_MB(1),
+ MREMAP_MAYMOVE | MREMAP_FIXED, dest + SIZE_MB(1));
+ if (new_ptr == MAP_FAILED) {
+ perror("mremap");
+ success = 0;
+ goto out;
+ }
+
+ /* Verify byte pattern after remapping */
+ srand(pattern_seed);
+ for (i = 0; i < SIZE_MB(1); i++) {
+ char c = (char) rand();
+
+ if (((char *)src)[i] != c) {
+ ksft_print_msg("Data at src at %d got corrupted due to unrelated mremap\n",
+ i);
+ ksft_print_msg("Expected: %#x\t Got: %#x\n", c & 0xff,
+ ((char *) src)[i] & 0xff);
+ success = 0;
+ }
+ }
+
+ out:
+ if (src && munmap(src, c.region_size) == -1)
+ perror("munmap src");
+
+ if (dest && munmap(dest, c.region_size) == -1)
+ perror("munmap dest");
+
+ if (success)
+ ksft_test_result_pass("%s\n", test_name);
+ else
+ ksft_test_result_fail("%s\n", test_name);
+ }
+
static void run_mremap_test_case(struct test test_case, int *failures,
unsigned int threshold_mb,
unsigned int pattern_seed)
return 0;
}
- #define MAX_TEST 13
+ #define MAX_TEST 15
#define MAX_PERF_TEST 3
int main(int argc, char **argv)
{
unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD;
unsigned int pattern_seed;
int num_expand_tests = 2;
- struct test test_cases[MAX_TEST];
+ int num_misc_tests = 2;
+ struct test test_cases[MAX_TEST] = {};
struct test perf_test_cases[MAX_PERF_TEST];
int page_size;
time_t t;
test_cases[12] = MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
"2GB mremap - Source PUD-aligned, Destination PUD-aligned");
+ /* Src and Dest addr 1MB aligned. 5MB mremap. */
+ test_cases[13] = MAKE_TEST(_1MB, _1MB, _5MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+ "5MB mremap - Source 1MB-aligned, Destination 1MB-aligned");
+
+ /* Src and Dest addr 1MB aligned. 5MB mremap. */
+ test_cases[14] = MAKE_TEST(_1MB, _1MB, _5MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+ "5MB mremap - Source 1MB-aligned, Dest 1MB-aligned with 40MB Preamble");
+ test_cases[14].config.dest_preamble_size = 10 * _4MB;
+
perf_test_cases[0] = MAKE_TEST(page_size, page_size, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
"1GB mremap - Source PTE-aligned, Destination PTE-aligned");
/*
(threshold_mb * _1MB >= _1GB);
ksft_set_plan(ARRAY_SIZE(test_cases) + (run_perf_tests ?
- ARRAY_SIZE(perf_test_cases) : 0) + num_expand_tests);
+ ARRAY_SIZE(perf_test_cases) : 0) + num_expand_tests + num_misc_tests);
for (i = 0; i < ARRAY_SIZE(test_cases); i++)
run_mremap_test_case(test_cases[i], &failures, threshold_mb,
fclose(maps_fp);
+ mremap_move_within_range(pattern_seed);
+ mremap_move_1mb_from_start(pattern_seed);
+
if (run_perf_tests) {
ksft_print_msg("\n%s\n",
"mremap HAVE_MOVE_PMD/PUD optimization time comparison for 1GB region:");