A read-write single value file which exists on non-root
cgroups. The default is "100".
- The weight in the range [1, 10000].
+ For non idle groups (cpu.idle = 0), the weight is in the
+ range [1, 10000].
+
+ If the cgroup has been configured to be SCHED_IDLE (cpu.idle = 1),
+ then the weight will show as a 0.
cpu.weight.nice
A read-write single value file which exists on non-root
values similar to the sched_setattr(2). This maximum utilization
value is used to clamp the task specific maximum utilization clamp.
+ cpu.idle
+ A read-write single value file which exists on non-root cgroups.
+ The default is 0.
+
+ This is the cgroup analog of the per-task SCHED_IDLE sched policy.
+ Setting this value to a 1 will make the scheduling policy of the
+ cgroup SCHED_IDLE. The threads inside the cgroup will retain their
+ own relative priorities, but the cgroup itself will be treated as
+ very low priority relative to its peers.
+
Memory
limit, it will refuse to take any more stores before existing
entries fault back in or are written out to disk.
+ memory.zswap.writeback
+ A read-write single value file. The default value is "1". The
+ initial value of the root cgroup is 1, and when a new cgroup is
+ created, it inherits the current value of its parent.
+
+ When this is set to 0, all swapping attempts to swapping devices
+ are disabled. This included both zswap writebacks, and swapping due
+ to zswap store failures. If the zswap store failures are recurring
+ (for e.g if the pages are incompressible), users can observe
+ reclaim inefficiency after disabling writeback (because the same
+ pages might be rejected again and again).
+
+ Note that this is subtly different from setting memory.swap.max to
+ 0, as it still allows for pages to be written to the zswap pool.
+
memory.pressure
A read-only nested-keyed file.
treated to have an implicit value of "cpuset.cpus" in the
formation of local partition.
+ cpuset.cpus.isolated
+ A read-only and root cgroup only multiple values file.
+
+ This file shows the set of all isolated CPUs used in existing
+ isolated partitions. It will be empty if no isolated partition
+ is created.
+
cpuset.cpus.partition
A read-write single value file which exists on non-root
cpuset-enabled cgroups. This flag is owned by the parent cgroup
partition or scheduling domain. The set of exclusive CPUs is
determined by the value of its "cpuset.cpus.exclusive.effective".
- When set to "isolated", the CPUs in that partition will
- be in an isolated state without any load balancing from the
- scheduler. Tasks placed in such a partition with multiple
- CPUs should be carefully distributed and bound to each of the
- individual CPUs for optimal performance.
+ When set to "isolated", the CPUs in that partition will be in
+ an isolated state without any load balancing from the scheduler
+ and excluded from the unbound workqueues. Tasks placed in such
+ a partition with multiple CPUs should be carefully distributed
+ and bound to each of the individual CPUs for optimal performance.
A partition root ("root" or "isolated") can be in one of the
two possible states - valid or invalid. An invalid partition
T: git git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux.git
F: arch/arm/boot/dts/nxp/imx/
F: arch/arm/boot/dts/nxp/mxs/
+F: arch/arm64/boot/dts/freescale/
X: arch/arm64/boot/dts/freescale/fsl-*
X: arch/arm64/boot/dts/freescale/qoriq-*
X: drivers/media/i2c/
F: drivers/*/*wpcm*
ARM/NXP S32G ARCHITECTURE
-M: Chester Lin <clin@suse.com>
+M: Chester Lin <chester62515@gmail.com>
S: Supported
F: drivers/net/wireless/broadcom/brcm80211/
S: Maintained
F: mm/memcontrol.c
F: mm/swap_cgroup.c
+ F: samples/cgroup/*
F: tools/testing/selftests/cgroup/memcg_protection.m
F: tools/testing/selftests/cgroup/test_hugetlb_memcg.c
F: tools/testing/selftests/cgroup/test_kmem.c
S: Maintained
-W: http://sources.redhat.com/dm
Q: http://patchwork.kernel.org/project/dm-devel/list/
T: git git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git
-T: quilt http://people.redhat.com/agk/patches/linux/editing/
F: Documentation/admin-guide/device-mapper/
F: drivers/md/Kconfig
F: drivers/md/Makefile
F: drivers/gpu/drm/vboxvideo/
DRM DRIVER FOR VMWARE VIRTUAL GPU
-M: Zack Rusin <zackr@vmware.com>
-R: VMware Graphics Reviewers <linux-graphics-maintainer@vmware.com>
+M: Zack Rusin <zack.rusin@broadcom.com>
+R: Broadcom internal kernel review list <bcm-kernel-feedback-list@broadcom.com>
S: Supported
T: git git://anongit.freedesktop.org/drm/drm-misc
FILESYSTEMS (VFS and infrastructure)
S: Maintained
F: fs/*
F: fs/fhandle.c
F: include/linux/exportfs.h
+FILESYSTEMS [IDMAPPED MOUNTS]
+S: Maintained
+F: Documentation/filesystems/idmappings.rst
+F: fs/mnt_idmapping.c
+F: include/linux/mnt_idmapping.*
+F: tools/testing/selftests/mount_setattr/
+
FILESYSTEMS [IOMAP]
F: fs/iomap/
F: include/linux/iomap.h
+FILESYSTEMS [STACKABLE]
+S: Maintained
+F: fs/backing-file.c
+F: include/linux/backing-file.h
+
FINTEK F75375S HARDWARE MONITOR AND FAN CONTROLLER DRIVER
GPIO SUBSYSTEM
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux.git
-F: Documentation/ABI/obsolete/sysfs-gpio
-F: Documentation/ABI/testing/gpio-cdev
F: Documentation/admin-guide/gpio/
F: Documentation/devicetree/bindings/gpio/
F: Documentation/driver-api/gpio/
F: include/linux/gpio.h
F: include/linux/gpio/
F: include/linux/of_gpio.h
+
+GPIO UAPI
+S: Maintained
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux.git
+F: Documentation/ABI/obsolete/sysfs-gpio
+F: Documentation/ABI/testing/gpio-cdev
+F: drivers/gpio/gpiolib-cdev.c
F: include/uapi/linux/gpio.h
F: tools/gpio/
HISILICON NETWORK SUBSYSTEM 3 DRIVER (HNS3)
S: Maintained
W: http://www.hisilicon.com
F: include/linux/hisi_acc_qm.h
HISILICON ROCE DRIVER
S: Maintained
F: drivers/net/ethernet/huawei/hinic/
HUGETLB SUBSYSTEM
S: Maintained
F: drivers/media/platform/st/sti/hva
HWPOISON MEMORY FAILURE HANDLING
-M: Naoya Horiguchi <naoya.horiguchi@nec.com>
-R: Miaohe Lin <linmiaohe@huawei.com>
+M: Miaohe Lin <linmiaohe@huawei.com>
+R: Naoya Horiguchi <naoya.horiguchi@nec.com>
S: Maintained
F: mm/hwpoison-inject.c
W: https://github.com/o2genum/ideapad-slidebar
F: drivers/input/misc/ideapad_slidebar.c
-IDMAPPED MOUNTS
-S: Maintained
-T: git git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping.git
-F: Documentation/filesystems/idmappings.rst
-F: include/linux/mnt_idmapping.*
-F: tools/testing/selftests/mount_setattr/
-
IDT VersaClock 5 CLOCK DRIVER
S: Maintained
F: drivers/gpio/gpio-sch.c
F: drivers/gpio/gpio-sodaville.c
F: drivers/gpio/gpio-tangier.c
+F: drivers/gpio/gpio-tangier.h
INTEL GVT-g DRIVERS (Intel GPU Virtualization)
F: scripts/Kbuild*
F: scripts/Makefile*
F: scripts/basic/
+F: scripts/clang-tools/
F: scripts/dummy-tools/
F: scripts/mk*
F: scripts/mod/
S: Supported
W: https://github.com/linuxppc/wiki/wiki
F: arch/powerpc/platforms/40x/
F: arch/powerpc/platforms/44x/
-LINUX FOR POWERPC EMBEDDED PPC83XX AND PPC85XX
+LINUX FOR POWERPC EMBEDDED PPC85XX
S: Odd fixes
T: git git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux.git
F: Documentation/devicetree/bindings/cache/freescale-l2cache.txt
F: Documentation/devicetree/bindings/powerpc/fsl/
-F: arch/powerpc/platforms/83xx/
F: arch/powerpc/platforms/85xx/
-LINUX FOR POWERPC EMBEDDED PPC8XX
+LINUX FOR POWERPC EMBEDDED PPC8XX AND PPC83XX
S: Maintained
F: arch/powerpc/platforms/8xx/
+F: arch/powerpc/platforms/83xx/
LINUX KERNEL DUMP TEST MODULE (LKDTM)
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/core
F: Documentation/locking/
F: arch/*/include/asm/spinlock*.h
-F: include/linux/lockdep.h
+F: include/linux/lockdep*.h
F: include/linux/mutex*.h
F: include/linux/rwlock*.h
F: include/linux/rwsem*.h
F: drivers/net/ethernet/marvell/mvneta.*
MARVELL MVPP2 ETHERNET DRIVER
-M: Marcin Wojtas <mw@semihalf.com>
+M: Marcin Wojtas <marcin.s.wojtas@gmail.com>
S: Maintained
F: net/
F: tools/net/
F: tools/testing/selftests/net/
+X: net/9p/
X: net/bluetooth/
NETWORKING [IPSEC]
NETWORKING [MPTCP]
S: Maintained
F: drivers/bluetooth/btnxpuart.c
NXP C45 TJA11XX PHY DRIVER
S: Maintained
F: drivers/net/phy/nxp-c45-tja11xx.c
F: drivers/pci/controller/dwc/pcie-armada8k.c
PCI DRIVER FOR CADENCE PCIE IP
-S: Maintained
+S: Orphan
F: Documentation/devicetree/bindings/pci/cdns,*
-F: drivers/pci/controller/cadence/
+F: drivers/pci/controller/cadence/*cadence*
PCI DRIVER FOR FREESCALE LAYERSCAPE
F: drivers/misc/sgi-xp/
SHARED MEMORY COMMUNICATIONS (SMC) SOCKETS
S: Maintained
F: drivers/mmc/host/dw_mmc*
+SYNOPSYS DESIGNWARE PCIE PMU DRIVER
+S: Supported
+F: Documentation/admin-guide/perf/dwc_pcie_pmu.rst
+F: drivers/perf/dwc_pcie_pmu.c
+
SYNOPSYS HSDK RESET CONTROLLER DRIVER
S: Supported
F: include/linux/vmw_vmci*
VMWARE VMMOUSE SUBDRIVER
S: Supported
F: drivers/input/mouse/vmmouse.c
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PTE_DEVMAP
select ARCH_HAS_PTE_SPECIAL
+ select ARCH_HAS_HW_PTE_YOUNG
select ARCH_HAS_SETUP_DMA_OPS
select ARCH_HAS_SET_DIRECT_MAP
select ARCH_HAS_SET_MEMORY
select HAVE_MOVE_PUD
select HAVE_PCI
select HAVE_ACPI_APEI if (ACPI && EFI)
- select HAVE_ALIGNED_STRUCT_PAGE if SLUB
+ select HAVE_ALIGNED_STRUCT_PAGE
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_BITREVERSE
select HAVE_ARCH_COMPILER_H
# include/linux/mmzone.h requires the following to be true:
#
- # MAX_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
+ # MAX_PAGE_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
#
- # so the maximum value of MAX_ORDER is SECTION_SIZE_BITS - PAGE_SHIFT:
+ # so the maximum value of MAX_PAGE_ORDER is SECTION_SIZE_BITS - PAGE_SHIFT:
#
- # | SECTION_SIZE_BITS | PAGE_SHIFT | max MAX_ORDER | default MAX_ORDER |
- # ----+-------------------+--------------+-----------------+--------------------+
- # 4K | 27 | 12 | 15 | 10 |
- # 16K | 27 | 14 | 13 | 11 |
- # 64K | 29 | 16 | 13 | 13 |
+ # | SECTION_SIZE_BITS | PAGE_SHIFT | max MAX_PAGE_ORDER | default MAX_PAGE_ORDER |
+ # ----+-------------------+--------------+----------------------+-------------------------+
+ # 4K | 27 | 12 | 15 | 10 |
+ # 16K | 27 | 14 | 13 | 11 |
+ # 64K | 29 | 16 | 13 | 13 |
config ARCH_FORCE_MAX_ORDER
int
default "13" if ARM64_64K_PAGES
default "10"
help
The kernel page allocator limits the size of maximal physically
- contiguous allocations. The limit is called MAX_ORDER and it
+ contiguous allocations. The limit is called MAX_PAGE_ORDER and it
defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very
large blocks of physically contiguous memory is required.
The maximal size of allocation cannot exceed the size of the
- section, so the value of MAX_ORDER should satisfy
+ section, so the value of MAX_PAGE_ORDER should satisfy
- MAX_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
+ MAX_PAGE_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
Don't change if unsure.
config UNMAP_KERNEL_AT_EL0
- bool "Unmap kernel when running in userspace (aka \"KAISER\")" if EXPERT
+ bool "Unmap kernel when running in userspace (KPTI)" if EXPERT
default y
help
Speculation attacks against some high-performance processors can
#define KERNEL_END _end
/*
- * Generic and tag-based KASAN require 1/8th and 1/16th of the kernel virtual
- * address space for the shadow region respectively. They can bloat the stack
- * significantly, so double the (minimum) stack size when they are in use.
+ * Generic and Software Tag-Based KASAN modes require 1/8th and 1/16th of the
+ * kernel virtual address space for storing the shadow memory respectively.
+ *
+ * The mapping between a virtual memory address and its corresponding shadow
+ * memory address is defined based on the formula:
+ *
+ * shadow_addr = (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET
+ *
+ * where KASAN_SHADOW_SCALE_SHIFT is the order of the number of bits that map
+ * to a single shadow byte and KASAN_SHADOW_OFFSET is a constant that offsets
+ * the mapping. Note that KASAN_SHADOW_OFFSET does not point to the start of
+ * the shadow memory region.
+ *
+ * Based on this mapping, we define two constants:
+ *
+ * KASAN_SHADOW_START: the start of the shadow memory region;
+ * KASAN_SHADOW_END: the end of the shadow memory region.
+ *
+ * KASAN_SHADOW_END is defined first as the shadow address that corresponds to
+ * the upper bound of possible virtual kernel memory addresses UL(1) << 64
+ * according to the mapping formula.
+ *
+ * KASAN_SHADOW_START is defined second based on KASAN_SHADOW_END. The shadow
+ * memory start must map to the lowest possible kernel virtual memory address
+ * and thus it depends on the actual bitness of the address space.
+ *
+ * As KASAN inserts redzones between stack variables, this increases the stack
+ * memory usage significantly. Thus, we double the (minimum) stack size.
*/
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
#define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
- #define KASAN_SHADOW_END ((UL(1) << (64 - KASAN_SHADOW_SCALE_SHIFT)) \
- + KASAN_SHADOW_OFFSET)
- #define PAGE_END (KASAN_SHADOW_END - (1UL << (vabits_actual - KASAN_SHADOW_SCALE_SHIFT)))
+ #define KASAN_SHADOW_END ((UL(1) << (64 - KASAN_SHADOW_SCALE_SHIFT)) + KASAN_SHADOW_OFFSET)
+ #define _KASAN_SHADOW_START(va) (KASAN_SHADOW_END - (UL(1) << ((va) - KASAN_SHADOW_SCALE_SHIFT)))
+ #define KASAN_SHADOW_START _KASAN_SHADOW_START(vabits_actual)
+ #define PAGE_END KASAN_SHADOW_START
#define KASAN_THREAD_SHIFT 1
#else
#define KASAN_THREAD_SHIFT 0
#include <linux/types.h>
#include <asm/boot.h>
#include <asm/bug.h>
+#include <asm/sections.h>
#if VA_BITS > 48
extern u64 vabits_actual;
/* PHYS_OFFSET - the physical address of the start of memory. */
#define PHYS_OFFSET ({ VM_BUG_ON(memstart_addr & 1); memstart_addr; })
-/* the virtual base of the kernel image */
-extern u64 kimage_vaddr;
-
/* the offset between the kernel virtual and physical mappings */
extern u64 kimage_voffset;
static inline unsigned long kaslr_offset(void)
{
- return kimage_vaddr - KIMAGE_VADDR;
+ return (u64)&_text - KIMAGE_VADDR;
}
#ifdef CONFIG_RANDOMIZE_BASE
#define INIT_MEMBLOCK_MEMORY_REGIONS (INIT_MEMBLOCK_REGIONS * 8)
#endif
-#include <asm-generic/memory_model.h>
#endif /* __ASM_MEMORY_H */
select EDAC_ATOMIC_SCRUB
select EDAC_SUPPORT
select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY if ARCH_USING_PATCHABLE_FUNCTION_ENTRY
+ select FUNCTION_ALIGNMENT_4B
select GENERIC_ATOMIC64 if PPC32
select GENERIC_CLOCKEVENTS_BROADCAST if SMP
select GENERIC_CMOS_UPDATE
default "10"
help
The kernel page allocator limits the size of maximal physically
- contiguous allocations. The limit is called MAX_ORDER and it
+ contiguous allocations. The limit is called MAX_PAGE_ORDER and it
defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very
select ARCH_HAS_PMEM_API if X86_64
select ARCH_HAS_PTE_DEVMAP if X86_64
select ARCH_HAS_PTE_SPECIAL
+ select ARCH_HAS_HW_PTE_YOUNG
select ARCH_HAS_NONLEAF_PMD_YOUNG if PGTABLE_LEVELS > 2
select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64
select ARCH_HAS_COPY_MC if X86_64
select HAS_IOPORT
select HAVE_ACPI_APEI if ACPI
select HAVE_ACPI_APEI_NMI if ACPI
- select HAVE_ALIGNED_STRUCT_PAGE if SLUB
+ select HAVE_ALIGNED_STRUCT_PAGE
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HUGE_VMAP if X86_64 || X86_PAE
select HAVE_ARCH_HUGE_VMALLOC if X86_64
def_bool y
depends on INTEL_IOMMU && ACPI
-config X86_32_SMP
- def_bool y
- depends on X86_32 && SMP
-
config X86_64_SMP
def_bool y
depends on X86_64 && SMP
config HIGHMEM64G
bool "64GB"
- depends on !M486SX && !M486 && !M586 && !M586TSC && !M586MMX && !MGEODE_LX && !MGEODEGX1 && !MCYRIXIII && !MELAN && !MWINCHIPC6 && !MWINCHIP3D && !MK6
+ depends on X86_HAVE_PAE
select X86_PAE
help
Select this if you have a 32-bit processor and more than 4
config X86_PAE
bool "PAE (Physical Address Extension) Support"
- depends on X86_32 && !HIGHMEM4G
+ depends on X86_32 && X86_HAVE_PAE
select PHYS_ADDR_T_64BIT
select SWIOTLB
help
* later
*/
buf_extra = (PAGE_SIZE - size % PAGE_SIZE) % PAGE_SIZE;
- max_order = min(MAX_ORDER - 1, get_order(size));
+ max_order = min(MAX_PAGE_ORDER - 1, get_order(size));
} else {
/* allocate a single page for book keeping */
nr_pages = 1;
struct dma_buf_attachment *attach;
struct drm_gem_object *obj;
struct qaic_bo *bo;
- size_t size;
int ret;
bo = qaic_alloc_init_bo();
goto attach_fail;
}
- size = PAGE_ALIGN(attach->dmabuf->size);
- if (size == 0) {
+ if (!attach->dmabuf->size) {
ret = -EINVAL;
goto size_align_fail;
}
- drm_gem_private_object_init(dev, obj, size);
+ drm_gem_private_object_init(dev, obj, attach->dmabuf->size);
/*
* skipping dma_buf_map_attachment() as we do not know the direction
* just yet. Once the direction is known in the subsequent IOCTL to
config FS_IOMAP
bool
+# Stackable filesystems
+config FS_STACK
+ bool
+
config BUFFER_HEAD
bool
config ARCH_SUPPORTS_HUGETLBFS
def_bool n
- config HUGETLBFS
+ menuconfig HUGETLBFS
bool "HugeTLB file system support"
depends on X86 || SPARC64 || ARCH_SUPPORTS_HUGETLBFS || BROKEN
depends on (SYSFS || SYSCTL)
If unsure, say N.
+ if HUGETLBFS
+ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
+ bool "HugeTLB Vmemmap Optimization (HVO) defaults to on"
+ default n
+ depends on HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+ help
+ The HugeTLB Vmemmap Optimization (HVO) defaults to off. Say Y here to
+ enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off
+ (boot command line) or hugetlb_optimize_vmemmap (sysctl).
+ endif # HUGETLBFS
+
config HUGETLB_PAGE
def_bool HUGETLBFS
select XARRAY_MULTI
depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
depends on SPARSEMEM_VMEMMAP
- config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
- bool "HugeTLB Vmemmap Optimization (HVO) defaults to on"
- default n
- depends on HUGETLB_PAGE_OPTIMIZE_VMEMMAP
- help
- The HugeTLB VmemmapvOptimization (HVO) defaults to off. Say Y here to
- enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off
- (boot command line) or hugetlb_optimize_vmemmap (sysctl).
-
config ARCH_HAS_GIGANTIC_PAGE
bool
retry:
bch2_trans_begin(trans);
- ret = bch2_create_trans(trans,
+ ret = bch2_subvol_is_ro_trans(trans, dir->ei_subvol) ?:
+ bch2_create_trans(trans,
inode_inum(dir), &dir_u, &inode_u,
!(flags & BCH_CREATE_TMPFILE)
? &dentry->d_name : NULL,
lockdep_assert_held(&inode->v.i_rwsem);
- ret = __bch2_link(c, inode, dir, dentry);
+ ret = bch2_subvol_is_ro(c, dir->ei_subvol) ?:
+ bch2_subvol_is_ro(c, inode->ei_subvol) ?:
+ __bch2_link(c, inode, dir, dentry);
if (unlikely(ret))
return ret;
static int bch2_unlink(struct inode *vdir, struct dentry *dentry)
{
- return __bch2_unlink(vdir, dentry, false);
+ struct bch_inode_info *dir= to_bch_ei(vdir);
+ struct bch_fs *c = dir->v.i_sb->s_fs_info;
+
+ return bch2_subvol_is_ro(c, dir->ei_subvol) ?:
+ __bch2_unlink(vdir, dentry, false);
}
static int bch2_symlink(struct mnt_idmap *idmap,
src_inode,
dst_inode);
+ ret = bch2_subvol_is_ro_trans(trans, src_dir->ei_subvol) ?:
+ bch2_subvol_is_ro_trans(trans, dst_dir->ei_subvol);
+ if (ret)
+ goto err;
+
if (inode_attr_changing(dst_dir, src_inode, Inode_opt_project)) {
ret = bch2_fs_quota_transfer(c, src_inode,
dst_dir->ei_qid,
struct dentry *dentry, struct iattr *iattr)
{
struct bch_inode_info *inode = to_bch_ei(dentry->d_inode);
+ struct bch_fs *c = inode->v.i_sb->s_fs_info;
int ret;
lockdep_assert_held(&inode->v.i_rwsem);
- ret = setattr_prepare(idmap, dentry, iattr);
+ ret = bch2_subvol_is_ro(c, inode->ei_subvol) ?:
+ setattr_prepare(idmap, dentry, iattr);
if (ret)
return ret;
return bch2_err_class(ret);
}
+static int bch2_open(struct inode *vinode, struct file *file)
+{
+ if (file->f_flags & (O_WRONLY|O_RDWR)) {
+ struct bch_inode_info *inode = to_bch_ei(vinode);
+ struct bch_fs *c = inode->v.i_sb->s_fs_info;
+
+ int ret = bch2_subvol_is_ro(c, inode->ei_subvol);
+ if (ret)
+ return ret;
+ }
+
+ return generic_file_open(vinode, file);
+}
+
static const struct file_operations bch_file_operations = {
+ .open = bch2_open,
.llseek = bch2_llseek,
.read_iter = bch2_read_iter,
.write_iter = bch2_write_iter,
.mmap = bch2_mmap,
- .open = generic_file_open,
.fsync = bch2_fsync,
.splice_read = filemap_splice_read,
.splice_write = iter_file_splice_write,
#ifdef CONFIG_MIGRATION
.migrate_folio = filemap_migrate_folio,
#endif
- .error_remove_page = generic_error_remove_page,
+ .error_remove_folio = generic_error_remove_folio,
};
struct bcachefs_fid {
{
struct bch_inode_info *inode = to_bch_ei(vinode);
struct bch_inode_info *dir = to_bch_ei(vdir);
-
- if (*len < sizeof(struct bcachefs_fid_with_parent) / sizeof(u32))
- return FILEID_INVALID;
+ int min_len;
if (!S_ISDIR(inode->v.i_mode) && dir) {
struct bcachefs_fid_with_parent *fid = (void *) fh;
+ min_len = sizeof(*fid) / sizeof(u32);
+ if (*len < min_len) {
+ *len = min_len;
+ return FILEID_INVALID;
+ }
+
fid->fid = bch2_inode_to_fid(inode);
fid->dir = bch2_inode_to_fid(dir);
- *len = sizeof(*fid) / sizeof(u32);
+ *len = min_len;
return FILEID_BCACHEFS_WITH_PARENT;
} else {
struct bcachefs_fid *fid = (void *) fh;
+ min_len = sizeof(*fid) / sizeof(u32);
+ if (*len < min_len) {
+ *len = min_len;
+ return FILEID_INVALID;
+ }
*fid = bch2_inode_to_fid(inode);
- *len = sizeof(*fid) / sizeof(u32);
+ *len = min_len;
return FILEID_BCACHEFS_WITHOUT_PARENT;
}
}
struct bch_fs *c = sb->s_fs_info;
int ret;
+ if (test_bit(BCH_FS_EMERGENCY_RO, &c->flags))
+ return 0;
+
down_write(&c->state_lock);
ret = bch2_fs_read_write(c);
up_write(&c->state_lock);
* And at reserve time, it's always aligned to page size, so
* just free one page here.
*/
- btrfs_qgroup_free_data(inode, NULL, 0, PAGE_SIZE);
+ btrfs_qgroup_free_data(inode, NULL, 0, PAGE_SIZE, NULL);
btrfs_free_path(path);
btrfs_end_transaction(trans);
return ret;
*/
if (state_flags & EXTENT_DELALLOC)
btrfs_qgroup_free_data(BTRFS_I(inode), NULL, start,
- end - start + 1);
+ end - start + 1, NULL);
clear_extent_bit(io_tree, start, end,
EXTENT_CLEAR_ALL_BITS | EXTENT_DO_ACCOUNTING,
* reserved data space.
* Since the IO will never happen for this page.
*/
- btrfs_qgroup_free_data(inode, NULL, cur, range_end + 1 - cur);
+ btrfs_qgroup_free_data(inode, NULL, cur, range_end + 1 - cur, NULL);
if (!inode_evicting) {
clear_extent_bit(tree, cur, range_end, EXTENT_LOCKED |
EXTENT_DELALLOC | EXTENT_UPTODATE |
struct btrfs_path *path;
u64 start = ins->objectid;
u64 len = ins->offset;
- int qgroup_released;
+ u64 qgroup_released = 0;
int ret;
memset(&stack_fi, 0, sizeof(stack_fi));
btrfs_set_stack_file_extent_compression(&stack_fi, BTRFS_COMPRESS_NONE);
/* Encryption and other encoding is reserved and all 0 */
- qgroup_released = btrfs_qgroup_release_data(inode, file_offset, len);
- if (qgroup_released < 0)
- return ERR_PTR(qgroup_released);
+ ret = btrfs_qgroup_release_data(inode, file_offset, len, &qgroup_released);
+ if (ret < 0)
+ return ERR_PTR(ret);
if (trans) {
ret = insert_reserved_file_extent(trans, inode,
btrfs_delalloc_release_metadata(inode, disk_num_bytes, ret < 0);
out_qgroup_free_data:
if (ret < 0)
- btrfs_qgroup_free_data(inode, data_reserved, start, num_bytes);
+ btrfs_qgroup_free_data(inode, data_reserved, start, num_bytes, NULL);
out_free_data_space:
/*
* If btrfs_reserve_extent() succeeded, then we already decremented
.release_folio = btrfs_release_folio,
.migrate_folio = btrfs_migrate_folio,
.dirty_folio = filemap_dirty_folio,
- .error_remove_page = generic_error_remove_page,
+ .error_remove_folio = generic_error_remove_folio,
.swap_activate = btrfs_swap_activate,
.swap_deactivate = btrfs_swap_deactivate,
};
* Various filesystems appear to want __find_get_block to be non-blocking.
* But it's the page lock which protects the buffers. To get around this,
* we get exclusion from try_to_free_buffers with the blockdev mapping's
- * private_lock.
+ * i_private_lock.
*
- * Hack idea: for the blockdev mapping, private_lock contention
+ * Hack idea: for the blockdev mapping, i_private_lock contention
* may be quite high. This code could TryLock the page, and if that
- * succeeds, there is no need to take private_lock.
+ * succeeds, there is no need to take i_private_lock.
*/
static struct buffer_head *
__find_get_block_slow(struct block_device *bdev, sector_t block)
int all_mapped = 1;
static DEFINE_RATELIMIT_STATE(last_warned, HZ, 1);
- index = block >> (PAGE_SHIFT - bd_inode->i_blkbits);
+ index = ((loff_t)block << bd_inode->i_blkbits) / PAGE_SIZE;
folio = __filemap_get_folio(bd_mapping, index, FGP_ACCESSED, 0);
if (IS_ERR(folio))
goto out;
- spin_lock(&bd_mapping->private_lock);
+ spin_lock(&bd_mapping->i_private_lock);
head = folio_buffers(folio);
if (!head)
goto out_unlock;
1 << bd_inode->i_blkbits);
}
out_unlock:
- spin_unlock(&bd_mapping->private_lock);
+ spin_unlock(&bd_mapping->i_private_lock);
folio_put(folio);
out:
return ret;
}
/*
- * Completion handler for block_write_full_page() - pages which are unlocked
- * during I/O, and which have PageWriteback cleared upon I/O completion.
+ * Completion handler for block_write_full_folio() - folios which are unlocked
+ * during I/O, and which have the writeback flag cleared upon I/O completion.
*/
- void end_buffer_async_write(struct buffer_head *bh, int uptodate)
+ static void end_buffer_async_write(struct buffer_head *bh, int uptodate)
{
unsigned long flags;
struct buffer_head *first;
spin_unlock_irqrestore(&first->b_uptodate_lock, flags);
return;
}
- EXPORT_SYMBOL(end_buffer_async_write);
/*
* If a page's buffers are under async readin (end_buffer_async_read
*
* The functions mark_buffer_inode_dirty(), fsync_inode_buffers(),
* inode_has_buffers() and invalidate_inode_buffers() are provided for the
- * management of a list of dependent buffers at ->i_mapping->private_list.
+ * management of a list of dependent buffers at ->i_mapping->i_private_list.
*
* Locking is a little subtle: try_to_free_buffers() will remove buffers
* from their controlling inode's queue when they are being freed. But
* try_to_free_buffers() will be operating against the *blockdev* mapping
* at the time, not against the S_ISREG file which depends on those buffers.
- * So the locking for private_list is via the private_lock in the address_space
+ * So the locking for i_private_list is via the i_private_lock in the address_space
* which backs the buffers. Which is different from the address_space
* against which the buffers are listed. So for a particular address_space,
- * mapping->private_lock does *not* protect mapping->private_list! In fact,
- * mapping->private_list will always be protected by the backing blockdev's
- * ->private_lock.
+ * mapping->i_private_lock does *not* protect mapping->i_private_list! In fact,
+ * mapping->i_private_list will always be protected by the backing blockdev's
+ * ->i_private_lock.
*
* Which introduces a requirement: all buffers on an address_space's
- * ->private_list must be from the same address_space: the blockdev's.
+ * ->i_private_list must be from the same address_space: the blockdev's.
*
- * address_spaces which do not place buffers at ->private_list via these
- * utility functions are free to use private_lock and private_list for
- * whatever they want. The only requirement is that list_empty(private_list)
+ * address_spaces which do not place buffers at ->i_private_list via these
+ * utility functions are free to use i_private_lock and i_private_list for
+ * whatever they want. The only requirement is that list_empty(i_private_list)
* be true at clear_inode() time.
*
* FIXME: clear_inode should not call invalidate_inode_buffers(). The
*/
/*
- * The buffer's backing address_space's private_lock must be held
+ * The buffer's backing address_space's i_private_lock must be held
*/
static void __remove_assoc_queue(struct buffer_head *bh)
{
int inode_has_buffers(struct inode *inode)
{
- return !list_empty(&inode->i_data.private_list);
+ return !list_empty(&inode->i_data.i_private_list);
}
/*
* sync_mapping_buffers - write out & wait upon a mapping's "associated" buffers
* @mapping: the mapping which wants those buffers written
*
- * Starts I/O against the buffers at mapping->private_list, and waits upon
+ * Starts I/O against the buffers at mapping->i_private_list, and waits upon
* that I/O.
*
* Basically, this is a convenience function for fsync().
*/
int sync_mapping_buffers(struct address_space *mapping)
{
- struct address_space *buffer_mapping = mapping->private_data;
+ struct address_space *buffer_mapping = mapping->i_private_data;
- if (buffer_mapping == NULL || list_empty(&mapping->private_list))
+ if (buffer_mapping == NULL || list_empty(&mapping->i_private_list))
return 0;
- return fsync_buffers_list(&buffer_mapping->private_lock,
- &mapping->private_list);
+ return fsync_buffers_list(&buffer_mapping->i_private_lock,
+ &mapping->i_private_list);
}
EXPORT_SYMBOL(sync_mapping_buffers);
struct address_space *buffer_mapping = bh->b_folio->mapping;
mark_buffer_dirty(bh);
- if (!mapping->private_data) {
- mapping->private_data = buffer_mapping;
+ if (!mapping->i_private_data) {
+ mapping->i_private_data = buffer_mapping;
} else {
- BUG_ON(mapping->private_data != buffer_mapping);
+ BUG_ON(mapping->i_private_data != buffer_mapping);
}
if (!bh->b_assoc_map) {
- spin_lock(&buffer_mapping->private_lock);
+ spin_lock(&buffer_mapping->i_private_lock);
list_move_tail(&bh->b_assoc_buffers,
- &mapping->private_list);
+ &mapping->i_private_list);
bh->b_assoc_map = mapping;
- spin_unlock(&buffer_mapping->private_lock);
+ spin_unlock(&buffer_mapping->i_private_lock);
}
}
EXPORT_SYMBOL(mark_buffer_dirty_inode);
* bit, see a bunch of clean buffers and we'd end up with dirty buffers/clean
* page on the dirty page list.
*
- * We use private_lock to lock against try_to_free_buffers while using the
+ * We use i_private_lock to lock against try_to_free_buffers while using the
* page's buffer list. Also use this to protect against clean buffers being
* added to the page after it was set dirty.
*
struct buffer_head *head;
bool newly_dirty;
- spin_lock(&mapping->private_lock);
+ spin_lock(&mapping->i_private_lock);
head = folio_buffers(folio);
if (head) {
struct buffer_head *bh = head;
*/
folio_memcg_lock(folio);
newly_dirty = !folio_test_set_dirty(folio);
- spin_unlock(&mapping->private_lock);
+ spin_unlock(&mapping->i_private_lock);
if (newly_dirty)
__folio_mark_dirty(folio, mapping, 1);
smp_mb();
if (buffer_dirty(bh)) {
list_add(&bh->b_assoc_buffers,
- &mapping->private_list);
+ &mapping->i_private_list);
bh->b_assoc_map = mapping;
}
spin_unlock(lock);
* probably unmounting the fs, but that doesn't mean we have already
* done a sync(). Just drop the buffers from the inode list.
*
- * NOTE: we take the inode's blockdev's mapping's private_lock. Which
+ * NOTE: we take the inode's blockdev's mapping's i_private_lock. Which
* assumes that all the buffers are against the blockdev. Not true
* for reiserfs.
*/
{
if (inode_has_buffers(inode)) {
struct address_space *mapping = &inode->i_data;
- struct list_head *list = &mapping->private_list;
- struct address_space *buffer_mapping = mapping->private_data;
+ struct list_head *list = &mapping->i_private_list;
+ struct address_space *buffer_mapping = mapping->i_private_data;
- spin_lock(&buffer_mapping->private_lock);
+ spin_lock(&buffer_mapping->i_private_lock);
while (!list_empty(list))
__remove_assoc_queue(BH_ENTRY(list->next));
- spin_unlock(&buffer_mapping->private_lock);
+ spin_unlock(&buffer_mapping->i_private_lock);
}
}
EXPORT_SYMBOL(invalidate_inode_buffers);
if (inode_has_buffers(inode)) {
struct address_space *mapping = &inode->i_data;
- struct list_head *list = &mapping->private_list;
- struct address_space *buffer_mapping = mapping->private_data;
+ struct list_head *list = &mapping->i_private_list;
+ struct address_space *buffer_mapping = mapping->i_private_data;
- spin_lock(&buffer_mapping->private_lock);
+ spin_lock(&buffer_mapping->i_private_lock);
while (!list_empty(list)) {
struct buffer_head *bh = BH_ENTRY(list->next);
if (buffer_dirty(bh)) {
}
__remove_assoc_queue(bh);
}
- spin_unlock(&buffer_mapping->private_lock);
+ spin_unlock(&buffer_mapping->i_private_lock);
}
return ret;
}
* Initialise the state of a blockdev folio's buffers.
*/
static sector_t folio_init_buffers(struct folio *folio,
- struct block_device *bdev, sector_t block, int size)
+ struct block_device *bdev, unsigned size)
{
struct buffer_head *head = folio_buffers(folio);
struct buffer_head *bh = head;
bool uptodate = folio_test_uptodate(folio);
+ sector_t block = div_u64(folio_pos(folio), size);
sector_t end_block = blkdev_max_block(bdev, size);
do {
}
/*
- * Create the page-cache page that contains the requested block.
+ * Create the page-cache folio that contains the requested block.
*
* This is used purely for blockdev mappings.
+ *
+ * Returns false if we have a failure which cannot be cured by retrying
+ * without sleeping. Returns true if we succeeded, or the caller should retry.
*/
- static int
- grow_dev_page(struct block_device *bdev, sector_t block,
- pgoff_t index, int size, int sizebits, gfp_t gfp)
+ static bool grow_dev_folio(struct block_device *bdev, sector_t block,
+ pgoff_t index, unsigned size, gfp_t gfp)
{
struct inode *inode = bdev->bd_inode;
struct folio *folio;
struct buffer_head *bh;
- sector_t end_block;
- int ret = 0;
+ sector_t end_block = 0;
folio = __filemap_get_folio(inode->i_mapping, index,
FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp);
if (IS_ERR(folio))
- return PTR_ERR(folio);
+ return false;
bh = folio_buffers(folio);
if (bh) {
if (bh->b_size == size) {
- end_block = folio_init_buffers(folio, bdev,
- (sector_t)index << sizebits, size);
- goto done;
+ end_block = folio_init_buffers(folio, bdev, size);
+ goto unlock;
+ }
+
+ /*
+ * Retrying may succeed; for example the folio may finish
+ * writeback, or buffers may be cleaned. This should not
+ * happen very often; maybe we have old buffers attached to
+ * this blockdev's page cache and we're trying to change
+ * the block size?
+ */
+ if (!try_to_free_buffers(folio)) {
+ end_block = ~0ULL;
+ goto unlock;
}
- if (!try_to_free_buffers(folio))
- goto failed;
}
- ret = -ENOMEM;
bh = folio_alloc_buffers(folio, size, gfp | __GFP_ACCOUNT);
if (!bh)
- goto failed;
+ goto unlock;
/*
* Link the folio to the buffers and initialise them. Take the
* lock to be atomic wrt __find_get_block(), which does not
* run under the folio lock.
*/
- spin_lock(&inode->i_mapping->private_lock);
+ spin_lock(&inode->i_mapping->i_private_lock);
link_dev_buffers(folio, bh);
- end_block = folio_init_buffers(folio, bdev,
- (sector_t)index << sizebits, size);
+ end_block = folio_init_buffers(folio, bdev, size);
- spin_unlock(&inode->i_mapping->private_lock);
+ spin_unlock(&inode->i_mapping->i_private_lock);
- done:
- ret = (block < end_block) ? 1 : -ENXIO;
- failed:
+ unlock:
folio_unlock(folio);
folio_put(folio);
- return ret;
+ return block < end_block;
}
/*
- * Create buffers for the specified block device block's page. If
- * that page was dirty, the buffers are set dirty also.
+ * Create buffers for the specified block device block's folio. If
+ * that folio was dirty, the buffers are set dirty also. Returns false
+ * if we've hit a permanent error.
*/
- static int
- grow_buffers(struct block_device *bdev, sector_t block, int size, gfp_t gfp)
+ static bool grow_buffers(struct block_device *bdev, sector_t block,
+ unsigned size, gfp_t gfp)
{
- pgoff_t index;
- int sizebits;
-
- sizebits = PAGE_SHIFT - __ffs(size);
- index = block >> sizebits;
+ loff_t pos;
/*
- * Check for a block which wants to lie outside our maximum possible
- * pagecache index. (this comparison is done using sector_t types).
+ * Check for a block which lies outside our maximum possible
+ * pagecache index.
*/
- if (unlikely(index != block >> sizebits)) {
- printk(KERN_ERR "%s: requested out-of-range block %llu for "
- "device %pg\n",
+ if (check_mul_overflow(block, (sector_t)size, &pos) || pos > MAX_LFS_FILESIZE) {
+ printk(KERN_ERR "%s: requested out-of-range block %llu for device %pg\n",
__func__, (unsigned long long)block,
bdev);
- return -EIO;
+ return false;
}
- /* Create a page with the proper size buffers.. */
- return grow_dev_page(bdev, block, index, size, sizebits, gfp);
+ /* Create a folio with the proper size buffers */
+ return grow_dev_folio(bdev, block, pos / PAGE_SIZE, size, gfp);
}
static struct buffer_head *
for (;;) {
struct buffer_head *bh;
- int ret;
bh = __find_get_block(bdev, block, size);
if (bh)
return bh;
- ret = grow_buffers(bdev, block, size, gfp);
- if (ret < 0)
+ if (!grow_buffers(bdev, block, size, gfp))
return NULL;
}
}
* and then attach the address_space's inode to its superblock's dirty
* inode list.
*
- * mark_buffer_dirty() is atomic. It takes bh->b_folio->mapping->private_lock,
+ * mark_buffer_dirty() is atomic. It takes bh->b_folio->mapping->i_private_lock,
* i_pages lock and mapping->host->i_lock.
*/
void mark_buffer_dirty(struct buffer_head *bh)
if (bh->b_assoc_map) {
struct address_space *buffer_mapping = bh->b_folio->mapping;
- spin_lock(&buffer_mapping->private_lock);
+ spin_lock(&buffer_mapping->i_private_lock);
list_del_init(&bh->b_assoc_buffers);
bh->b_assoc_map = NULL;
- spin_unlock(&buffer_mapping->private_lock);
+ spin_unlock(&buffer_mapping->i_private_lock);
}
__brelse(bh);
}
/*
* We attach and possibly dirty the buffers atomically wrt
- * block_dirty_folio() via private_lock. try_to_free_buffers
+ * block_dirty_folio() via i_private_lock. try_to_free_buffers
* is already excluded via the folio lock.
*/
struct buffer_head *create_empty_buffers(struct folio *folio,
} while (bh);
tail->b_this_page = head;
- spin_lock(&folio->mapping->private_lock);
+ spin_lock(&folio->mapping->i_private_lock);
if (folio_test_uptodate(folio) || folio_test_dirty(folio)) {
bh = head;
do {
} while (bh != head);
}
folio_attach_private(folio, head);
- spin_unlock(&folio->mapping->private_lock);
+ spin_unlock(&folio->mapping->i_private_lock);
return head;
}
struct inode *bd_inode = bdev->bd_inode;
struct address_space *bd_mapping = bd_inode->i_mapping;
struct folio_batch fbatch;
- pgoff_t index = block >> (PAGE_SHIFT - bd_inode->i_blkbits);
+ pgoff_t index = ((loff_t)block << bd_inode->i_blkbits) / PAGE_SIZE;
pgoff_t end;
int i, count;
struct buffer_head *bh;
struct buffer_head *head;
- end = (block + len - 1) >> (PAGE_SHIFT - bd_inode->i_blkbits);
+ end = ((loff_t)(block + len - 1) << bd_inode->i_blkbits) / PAGE_SIZE;
folio_batch_init(&fbatch);
while (filemap_get_folios(bd_mapping, &index, end, &fbatch)) {
count = folio_batch_count(&fbatch);
if (!folio_buffers(folio))
continue;
/*
- * We use folio lock instead of bd_mapping->private_lock
+ * We use folio lock instead of bd_mapping->i_private_lock
* to pin buffers here since we can afford to sleep and
* it scales better than a global spinlock lock.
*/
}
EXPORT_SYMBOL(clean_bdev_aliases);
- /*
- * Size is a power-of-two in the range 512..PAGE_SIZE,
- * and the case we care about most is PAGE_SIZE.
- *
- * So this *could* possibly be written with those
- * constraints in mind (relevant mostly if some
- * architecture has a slow bit-scan instruction)
- */
- static inline int block_size_bits(unsigned int blocksize)
- {
- return ilog2(blocksize);
- }
-
static struct buffer_head *folio_create_buffers(struct folio *folio,
struct inode *inode,
unsigned int b_state)
*/
/*
- * While block_write_full_page is writing back the dirty buffers under
+ * While block_write_full_folio is writing back the dirty buffers under
* the page lock, whoever dirtied the buffers may decide to clean them
* again at any time. We handle that by only looking at the buffer
* state inside lock_buffer().
*
- * If block_write_full_page() is called for regular writeback
+ * If block_write_full_folio() is called for regular writeback
* (wbc->sync_mode == WB_SYNC_NONE) then it will redirty a page which has a
* locked buffer. This only can happen if someone has written the buffer
* directly, with submit_bh(). At the address_space level PageWriteback
* prevents this contention from occurring.
*
- * If block_write_full_page() is called with wbc->sync_mode ==
+ * If block_write_full_folio() is called with wbc->sync_mode ==
* WB_SYNC_ALL, the writes are posted using REQ_SYNC; this
* causes the writes to be flagged as synchronous writes.
*/
int __block_write_full_folio(struct inode *inode, struct folio *folio,
- get_block_t *get_block, struct writeback_control *wbc,
- bh_end_io_t *handler)
+ get_block_t *get_block, struct writeback_control *wbc)
{
int err;
sector_t block;
sector_t last_block;
struct buffer_head *bh, *head;
- unsigned int blocksize, bbits;
+ size_t blocksize;
int nr_underway = 0;
blk_opf_t write_flags = wbc_to_write_flags(wbc);
bh = head;
blocksize = bh->b_size;
- bbits = block_size_bits(blocksize);
- block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
- last_block = (i_size_read(inode) - 1) >> bbits;
+ block = div_u64(folio_pos(folio), blocksize);
+ last_block = div_u64(i_size_read(inode) - 1, blocksize);
/*
* Get all the dirty buffers mapped to disk addresses and
* truncate in progress.
*/
/*
- * The buffer was zeroed by block_write_full_page()
+ * The buffer was zeroed by block_write_full_folio()
*/
clear_buffer_dirty(bh);
set_buffer_uptodate(bh);
continue;
}
if (test_clear_buffer_dirty(bh)) {
- mark_buffer_async_write_endio(bh, handler);
+ mark_buffer_async_write_endio(bh,
+ end_buffer_async_write);
} else {
unlock_buffer(bh);
}
if (buffer_mapped(bh) && buffer_dirty(bh) &&
!buffer_delay(bh)) {
lock_buffer(bh);
- mark_buffer_async_write_endio(bh, handler);
+ mark_buffer_async_write_endio(bh,
+ end_buffer_async_write);
} else {
/*
* The buffer may have been set dirty during
iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
const struct iomap *iomap)
{
- loff_t offset = block << inode->i_blkbits;
+ loff_t offset = (loff_t)block << inode->i_blkbits;
bh->b_bdev = iomap->bdev;
int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
get_block_t *get_block, const struct iomap *iomap)
{
- unsigned from = pos & (PAGE_SIZE - 1);
- unsigned to = from + len;
+ size_t from = offset_in_folio(folio, pos);
+ size_t to = from + len;
struct inode *inode = folio->mapping->host;
- unsigned block_start, block_end;
+ size_t block_start, block_end;
sector_t block;
int err = 0;
- unsigned blocksize, bbits;
+ size_t blocksize;
struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
BUG_ON(!folio_test_locked(folio));
- BUG_ON(from > PAGE_SIZE);
- BUG_ON(to > PAGE_SIZE);
+ BUG_ON(to > folio_size(folio));
BUG_ON(from > to);
head = folio_create_buffers(folio, inode, 0);
blocksize = head->b_size;
- bbits = block_size_bits(blocksize);
-
- block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
+ block = div_u64(folio_pos(folio), blocksize);
- for(bh = head, block_start = 0; bh != head || !block_start;
+ for (bh = head, block_start = 0; bh != head || !block_start;
block++, block_start=block_end, bh = bh->b_this_page) {
block_end = block_start + blocksize;
if (block_end <= from || block_start >= to) {
struct inode *inode = folio->mapping->host;
sector_t iblock, lblock;
struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
- unsigned int blocksize, bbits;
+ size_t blocksize;
int nr, i;
int fully_mapped = 1;
bool page_error = false;
head = folio_create_buffers(folio, inode, 0);
blocksize = head->b_size;
- bbits = block_size_bits(blocksize);
- iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
- lblock = (limit+blocksize-1) >> bbits;
+ iblock = div_u64(folio_pos(folio), blocksize);
+ lblock = div_u64(limit + blocksize - 1, blocksize);
bh = head;
nr = 0;
i = 0;
return 0;
length = blocksize - length;
- iblock = (sector_t)index << (PAGE_SHIFT - inode->i_blkbits);
-
+ iblock = ((loff_t)index * PAGE_SIZE) >> inode->i_blkbits;
+
folio = filemap_grab_folio(mapping, index);
if (IS_ERR(folio))
return PTR_ERR(folio);
/*
* The generic ->writepage function for buffer-backed address_spaces
*/
- int block_write_full_page(struct page *page, get_block_t *get_block,
- struct writeback_control *wbc)
+ int block_write_full_folio(struct folio *folio, struct writeback_control *wbc,
+ void *get_block)
{
- struct folio *folio = page_folio(page);
struct inode * const inode = folio->mapping->host;
loff_t i_size = i_size_read(inode);
/* Is the folio fully inside i_size? */
if (folio_pos(folio) + folio_size(folio) <= i_size)
- return __block_write_full_folio(inode, folio, get_block, wbc,
- end_buffer_async_write);
+ return __block_write_full_folio(inode, folio, get_block, wbc);
/* Is the folio fully outside i_size? (truncate in progress) */
if (folio_pos(folio) >= i_size) {
*/
folio_zero_segment(folio, offset_in_folio(folio, i_size),
folio_size(folio));
- return __block_write_full_folio(inode, folio, get_block, wbc,
- end_buffer_async_write);
+ return __block_write_full_folio(inode, folio, get_block, wbc);
}
- EXPORT_SYMBOL(block_write_full_page);
sector_t generic_block_bmap(struct address_space *mapping, sector_t block,
get_block_t *get_block)
* are unused, and releases them if so.
*
* Exclusion against try_to_free_buffers may be obtained by either
- * locking the folio or by holding its mapping's private_lock.
+ * locking the folio or by holding its mapping's i_private_lock.
*
* If the folio is dirty but all the buffers are clean then we need to
* be sure to mark the folio clean as well. This is because the folio
* The same applies to regular filesystem folios: if all the buffers are
* clean then we set the folio clean and proceed. To do that, we require
* total exclusion from block_dirty_folio(). That is obtained with
- * private_lock.
+ * i_private_lock.
*
* try_to_free_buffers() is non-blocking.
*/
goto out;
}
- spin_lock(&mapping->private_lock);
+ spin_lock(&mapping->i_private_lock);
ret = drop_buffers(folio, &buffers_to_free);
/*
* the folio's buffers clean. We discover that here and clean
* the folio also.
*
- * private_lock must be held over this entire operation in order
+ * i_private_lock must be held over this entire operation in order
* to synchronise against block_dirty_folio and prevent the
* dirty bit from being lost.
*/
if (ret)
folio_cancel_dirty(folio);
- spin_unlock(&mapping->private_lock);
+ spin_unlock(&mapping->i_private_lock);
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
* We need to pick up the new inode size which generic_commit_write gave us
* `file' can be NULL - eg, when called from page_symlink().
*
- * ext4 never places buffers on inode->i_mapping->private_list. metadata
+ * ext4 never places buffers on inode->i_mapping->i_private_list. metadata
* buffers are managed internally.
*/
static int ext4_write_end(struct file *file,
}
/* Any metadata buffers to write? */
- if (!list_empty(&inode->i_mapping->private_list))
+ if (!list_empty(&inode->i_mapping->i_private_list))
return true;
return inode->i_state & I_DIRTY_DATASYNC;
}
.direct_IO = noop_direct_IO,
.migrate_folio = buffer_migrate_folio,
.is_partially_uptodate = block_is_partially_uptodate,
- .error_remove_page = generic_error_remove_page,
+ .error_remove_folio = generic_error_remove_folio,
.swap_activate = ext4_iomap_swap_activate,
};
.direct_IO = noop_direct_IO,
.migrate_folio = buffer_migrate_folio_norefs,
.is_partially_uptodate = block_is_partially_uptodate,
- .error_remove_page = generic_error_remove_page,
+ .error_remove_folio = generic_error_remove_folio,
.swap_activate = ext4_iomap_swap_activate,
};
.direct_IO = noop_direct_IO,
.migrate_folio = buffer_migrate_folio,
.is_partially_uptodate = block_is_partially_uptodate,
- .error_remove_page = generic_error_remove_page,
+ .error_remove_folio = generic_error_remove_folio,
.swap_activate = ext4_iomap_swap_activate,
};
* at inode creation time. If this is a device special inode,
* i_mapping may not point to the original address space.
*/
- resv_map = (struct resv_map *)(&inode->i_data)->private_data;
+ resv_map = (struct resv_map *)(&inode->i_data)->i_private_data;
/* Only regular and link inodes have associated reserve maps */
if (resv_map)
resv_map_release(&resv_map->refs);
&hugetlbfs_i_mmap_rwsem_key);
inode->i_mapping->a_ops = &hugetlbfs_aops;
simple_inode_init_ts(inode);
- inode->i_mapping->private_data = resv_map;
+ inode->i_mapping->i_private_data = resv_map;
info->seals = F_SEAL_SEAL;
switch (mode & S_IFMT) {
default:
#define hugetlbfs_migrate_folio NULL
#endif
- static int hugetlbfs_error_remove_page(struct address_space *mapping,
- struct page *page)
+ static int hugetlbfs_error_remove_folio(struct address_space *mapping,
+ struct folio *folio)
{
return 0;
}
.write_end = hugetlbfs_write_end,
.dirty_folio = noop_dirty_folio,
.migrate_folio = hugetlbfs_migrate_folio,
- .error_remove_page = hugetlbfs_error_remove_page,
+ .error_remove_folio = hugetlbfs_error_remove_folio,
};
atomic_set(&mapping->nr_thps, 0);
#endif
mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
- mapping->private_data = NULL;
+ mapping->i_private_data = NULL;
mapping->writeback_index = 0;
init_rwsem(&mapping->invalidate_lock);
lockdep_set_class_and_name(&mapping->invalidate_lock,
{
xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
init_rwsem(&mapping->i_mmap_rwsem);
- INIT_LIST_HEAD(&mapping->private_list);
- spin_lock_init(&mapping->private_lock);
+ INIT_LIST_HEAD(&mapping->i_private_list);
+ spin_lock_init(&mapping->i_private_lock);
mapping->i_mmap = RB_ROOT_CACHED;
}
if (!mapping_shrinkable(&inode->i_data))
return;
- if (list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru))
+ if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru))
this_cpu_inc(nr_unused);
else if (rotate)
inode->i_state |= I_REFERENCED;
static void inode_lru_list_del(struct inode *inode)
{
- if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
+ if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru))
this_cpu_dec(nr_unused);
}
* nor even WARN_ON(!mapping_empty).
*/
xa_unlock_irq(&inode->i_data.i_pages);
- BUG_ON(!list_empty(&inode->i_data.private_list));
+ BUG_ON(!list_empty(&inode->i_data.i_private_list));
BUG_ON(!(inode->i_state & I_FREEING));
BUG_ON(inode->i_state & I_CLEAR);
BUG_ON(!list_empty(&inode->i_wb_list));
* earlier than or equal to either the ctime or mtime,
* or if at least a day has passed since the last atime update.
*/
-static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
+static bool relatime_need_update(struct vfsmount *mnt, struct inode *inode,
struct timespec64 now)
{
struct timespec64 atime, mtime, ctime;
if (!(mnt->mnt_flags & MNT_RELATIME))
- return 1;
+ return true;
/*
* Is mtime younger than or equal to atime? If yes, update atime:
*/
atime = inode_get_atime(inode);
mtime = inode_get_mtime(inode);
if (timespec64_compare(&mtime, &atime) >= 0)
- return 1;
+ return true;
/*
* Is ctime younger than or equal to atime? If yes, update atime:
*/
ctime = inode_get_ctime(inode);
if (timespec64_compare(&ctime, &atime) >= 0)
- return 1;
+ return true;
/*
* Is the previous atime value older than a day? If yes,
* update atime:
*/
if ((long)(now.tv_sec - atime.tv_sec) >= 24*60*60)
- return 1;
+ return true;
/*
* Good, we can skip the atime update:
*/
- return 0;
+ return false;
}
/**
* the vfsmount must be passed through @idmap. This function will then take
* care to map the inode according to @idmap before checking permissions.
* On non-idmapped mounts or if permission checking is to be performed on the
- * raw inode simply passs @nop_mnt_idmap.
+ * raw inode simply pass @nop_mnt_idmap.
*/
bool inode_owner_or_capable(struct mnt_idmap *idmap,
const struct inode *inode)
* page cleaned. The VM has already locked the page and marked it clean.
*
* For non-resident attributes, ntfs_writepage() writes the @page by calling
- * the ntfs version of the generic block_write_full_page() function,
+ * the ntfs version of the generic block_write_full_folio() function,
* ntfs_write_block(), which in turn if necessary creates and writes the
* buffers associated with the page asynchronously.
*
* vfs inode dirty code path for the inode the mft record belongs to or via the
* vm page dirty code path for the page the mft record is in.
*
- * Based on ntfs_read_folio() and fs/buffer.c::block_write_full_page().
+ * Based on ntfs_read_folio() and fs/buffer.c::block_write_full_folio().
*
* Return 0 on success and -errno on error.
*/
.bmap = ntfs_bmap,
.migrate_folio = buffer_migrate_folio,
.is_partially_uptodate = block_is_partially_uptodate,
- .error_remove_page = generic_error_remove_page,
+ .error_remove_folio = generic_error_remove_folio,
};
/*
#endif /* NTFS_RW */
.migrate_folio = buffer_migrate_folio,
.is_partially_uptodate = block_is_partially_uptodate,
- .error_remove_page = generic_error_remove_page,
+ .error_remove_folio = generic_error_remove_folio,
};
/*
#endif /* NTFS_RW */
.migrate_folio = buffer_migrate_folio,
.is_partially_uptodate = block_is_partially_uptodate,
- .error_remove_page = generic_error_remove_page,
+ .error_remove_folio = generic_error_remove_folio,
};
#ifdef NTFS_RW
*
* If the page does not have buffers, we create them and set them uptodate.
* The page may not be locked which is why we need to handle the buffers under
- * the mapping->private_lock. Once the buffers are marked dirty we no longer
+ * the mapping->i_private_lock. Once the buffers are marked dirty we no longer
* need the lock since try_to_free_buffers() does not free dirty buffers.
*/
void mark_ntfs_record_dirty(struct page *page, const unsigned int ofs) {
BUG_ON(!PageUptodate(page));
end = ofs + ni->itype.index.block_size;
bh_size = VFS_I(ni)->i_sb->s_blocksize;
- spin_lock(&mapping->private_lock);
+ spin_lock(&mapping->i_private_lock);
if (unlikely(!page_has_buffers(page))) {
- spin_unlock(&mapping->private_lock);
+ spin_unlock(&mapping->i_private_lock);
bh = head = alloc_page_buffers(page, bh_size, true);
- spin_lock(&mapping->private_lock);
+ spin_lock(&mapping->i_private_lock);
if (likely(!page_has_buffers(page))) {
struct buffer_head *tail;
break;
set_buffer_dirty(bh);
} while ((bh = bh->b_this_page) != head);
- spin_unlock(&mapping->private_lock);
+ spin_unlock(&mapping->i_private_lock);
filemap_dirty_folio(mapping, page_folio(page));
if (unlikely(buffers_to_free)) {
do {
const char *name = NULL;
if (file) {
- struct inode *inode = file_inode(vma->vm_file);
+ const struct inode *inode = file_user_inode(vma->vm_file);
+
dev = inode->i_sb->s_dev;
ino = inode->i_ino;
pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
__show_smap(m, &mss, false);
seq_printf(m, "THPeligible: %8u\n",
- hugepage_vma_check(vma, vma->vm_flags, true, false, true));
+ !!thp_vma_allowable_orders(vma, vma->vm_flags, true, false,
+ true, THP_ORDERS_ALL));
if (arch_pkeys_enabled())
seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma));
#define PM_SCAN_CATEGORIES (PAGE_IS_WPALLOWED | PAGE_IS_WRITTEN | \
PAGE_IS_FILE | PAGE_IS_PRESENT | \
PAGE_IS_SWAPPED | PAGE_IS_PFNZERO | \
- PAGE_IS_HUGE)
+ PAGE_IS_HUGE | PAGE_IS_SOFT_DIRTY)
#define PM_SCAN_FLAGS (PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC)
struct pagemap_scan_private {
if (is_zero_pfn(pte_pfn(pte)))
categories |= PAGE_IS_PFNZERO;
+ if (pte_soft_dirty(pte))
+ categories |= PAGE_IS_SOFT_DIRTY;
} else if (is_swap_pte(pte)) {
swp_entry_t swp;
!PageAnon(pfn_swap_entry_to_page(swp)))
categories |= PAGE_IS_FILE;
}
+ if (pte_swp_soft_dirty(pte))
+ categories |= PAGE_IS_SOFT_DIRTY;
}
return categories;
if (is_zero_pfn(pmd_pfn(pmd)))
categories |= PAGE_IS_PFNZERO;
+ if (pmd_soft_dirty(pmd))
+ categories |= PAGE_IS_SOFT_DIRTY;
} else if (is_swap_pmd(pmd)) {
swp_entry_t swp;
categories |= PAGE_IS_SWAPPED;
if (!pmd_swp_uffd_wp(pmd))
categories |= PAGE_IS_WRITTEN;
+ if (pmd_swp_soft_dirty(pmd))
+ categories |= PAGE_IS_SOFT_DIRTY;
if (p->masks_of_interest & PAGE_IS_FILE) {
swp = pmd_to_swp_entry(pmd);
categories |= PAGE_IS_FILE;
if (is_zero_pfn(pte_pfn(pte)))
categories |= PAGE_IS_PFNZERO;
+ if (pte_soft_dirty(pte))
+ categories |= PAGE_IS_SOFT_DIRTY;
} else if (is_swap_pte(pte)) {
categories |= PAGE_IS_SWAPPED;
if (!pte_swp_uffd_wp_any(pte))
categories |= PAGE_IS_WRITTEN;
+ if (pte_swp_soft_dirty(pte))
+ categories |= PAGE_IS_SOFT_DIRTY;
}
return categories;
if (wp_allowed)
vma_category |= PAGE_IS_WPALLOWED;
+ if (vma->vm_flags & VM_SOFTDIRTY)
+ vma_category |= PAGE_IS_SOFT_DIRTY;
+
if (!pagemap_scan_is_interesting_vma(vma_category, p))
return 1;
*/
if (!folio_clear_dirty_for_io(folio))
WARN_ON(1);
- if (folio_start_writeback(folio))
- WARN_ON(1);
+ folio_start_writeback(folio);
*_count -= folio_nr_pages(folio);
folio_unlock(folio);
int rc;
/* The folio should be locked, dirty and not undergoing writeback. */
- if (folio_start_writeback(folio))
- WARN_ON(1);
+ folio_start_writeback(folio);
count -= folio_nr_pages(folio);
len = folio_size(folio);
/* we do not want atime to be less than mtime, it broke some apps */
atime = inode_set_atime_to_ts(inode, current_time(inode));
mtime = inode_get_mtime(inode);
- if (timespec64_compare(&atime, &mtime))
+ if (timespec64_compare(&atime, &mtime) < 0)
inode_set_atime_to_ts(inode, inode_get_mtime(inode));
if (PAGE_SIZE > rc)
CPUHP_PERF_POWER,
CPUHP_PERF_SUPERH,
CPUHP_X86_HPET_DEAD,
- CPUHP_X86_APB_DEAD,
CPUHP_X86_MCE_DEAD,
CPUHP_VIRT_NET_DEAD,
CPUHP_IBMVNIC_DEAD,
CPUHP_SLUB_DEAD,
CPUHP_DEBUG_OBJ_DEAD,
CPUHP_MM_WRITEBACK_DEAD,
- /* Must be after CPUHP_MM_VMSTAT_DEAD */
- CPUHP_MM_DEMOTION_DEAD,
CPUHP_MM_VMSTAT_DEAD,
CPUHP_SOFTIRQ_DEAD,
CPUHP_NET_MVNETA_DEAD,
CPUHP_NET_DEV_DEAD,
CPUHP_PCI_XGENE_DEAD,
CPUHP_IOMMU_IOVA_DEAD,
- CPUHP_LUSTRE_CFS_DEAD,
CPUHP_AP_ARM_CACHE_B15_RAC_DEAD,
CPUHP_PADATA_DEAD,
CPUHP_AP_DTPM_CPU_DEAD,
CPUHP_X2APIC_PREPARE,
CPUHP_SMPCFD_PREPARE,
CPUHP_RELAY_PREPARE,
- CPUHP_SLAB_PREPARE,
CPUHP_MD_RAID5_PREPARE,
CPUHP_RCUTREE_PREP,
CPUHP_CPUIDLE_COUPLED_PREPARE,
CPUHP_XEN_EVTCHN_PREPARE,
CPUHP_ARM_SHMOBILE_SCU_PREPARE,
CPUHP_SH_SH3X_PREPARE,
- CPUHP_NET_FLOW_PREPARE,
CPUHP_TOPOLOGY_PREPARE,
CPUHP_NET_IUCV_PREPARE,
CPUHP_ARM_BL_PREPARE,
CPUHP_TRACE_RB_PREPARE,
CPUHP_MM_ZS_PREPARE,
- CPUHP_MM_ZSWP_MEM_PREPARE,
CPUHP_MM_ZSWP_POOL_PREPARE,
CPUHP_KVM_PPC_BOOK3S_PREPARE,
CPUHP_ZCOMP_PREPARE,
CPUHP_AP_IRQ_ARMADA_XP_STARTING,
CPUHP_AP_IRQ_BCM2836_STARTING,
CPUHP_AP_IRQ_MIPS_GIC_STARTING,
- CPUHP_AP_IRQ_RISCV_STARTING,
CPUHP_AP_IRQ_LOONGARCH_STARTING,
CPUHP_AP_IRQ_SIFIVE_PLIC_STARTING,
CPUHP_AP_ARM_MVEBU_COHERENCY,
- CPUHP_AP_MICROCODE_LOADER,
CPUHP_AP_PERF_X86_AMD_UNCORE_STARTING,
CPUHP_AP_PERF_X86_STARTING,
CPUHP_AP_PERF_X86_AMD_IBS_STARTING,
- CPUHP_AP_PERF_X86_CQM_STARTING,
CPUHP_AP_PERF_X86_CSTATE_STARTING,
CPUHP_AP_PERF_XTENSA_STARTING,
- CPUHP_AP_MIPS_OP_LOONGSON3_STARTING,
CPUHP_AP_ARM_VFP_STARTING,
CPUHP_AP_ARM64_DEBUG_MONITORS_STARTING,
CPUHP_AP_PERF_ARM_HW_BREAKPOINT_STARTING,
CPUHP_AP_QCOM_TIMER_STARTING,
CPUHP_AP_TEGRA_TIMER_STARTING,
CPUHP_AP_ARMADA_TIMER_STARTING,
- CPUHP_AP_MARCO_TIMER_STARTING,
CPUHP_AP_MIPS_GIC_TIMER_STARTING,
CPUHP_AP_ARC_TIMER_STARTING,
CPUHP_AP_RISCV_TIMER_STARTING,
CPUHP_AP_PERF_X86_AMD_UNCORE_ONLINE,
CPUHP_AP_PERF_X86_AMD_POWER_ONLINE,
CPUHP_AP_PERF_X86_RAPL_ONLINE,
- CPUHP_AP_PERF_X86_CQM_ONLINE,
CPUHP_AP_PERF_X86_CSTATE_ONLINE,
- CPUHP_AP_PERF_X86_IDXD_ONLINE,
CPUHP_AP_PERF_S390_CF_ONLINE,
CPUHP_AP_PERF_S390_SF_ONLINE,
CPUHP_AP_PERF_ARM_CCI_ONLINE,
CPUHP_AP_RCUTREE_ONLINE,
CPUHP_AP_BASE_CACHEINFO_ONLINE,
CPUHP_AP_ONLINE_DYN,
- CPUHP_AP_ONLINE_DYN_END = CPUHP_AP_ONLINE_DYN + 30,
- /* Must be after CPUHP_AP_ONLINE_DYN for node_states[N_CPU] update */
- CPUHP_AP_MM_DEMOTION_ONLINE,
+ CPUHP_AP_ONLINE_DYN_END = CPUHP_AP_ONLINE_DYN + 40,
CPUHP_AP_X86_HPET_ONLINE,
CPUHP_AP_X86_KVM_CLK_ONLINE,
CPUHP_AP_ACTIVE,
bool (*is_partially_uptodate) (struct folio *, size_t from,
size_t count);
void (*is_dirty_writeback) (struct folio *, bool *dirty, bool *wb);
- int (*error_remove_page)(struct address_space *, struct page *);
+ int (*error_remove_folio)(struct address_space *, struct folio *);
/* swapfile support */
int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
* @a_ops: Methods.
* @flags: Error bits and flags (AS_*).
* @wb_err: The most recent error which has occurred.
- * @private_lock: For use by the owner of the address_space.
- * @private_list: For use by the owner of the address_space.
- * @private_data: For use by the owner of the address_space.
+ * @i_private_lock: For use by the owner of the address_space.
+ * @i_private_list: For use by the owner of the address_space.
+ * @i_private_data: For use by the owner of the address_space.
*/
struct address_space {
struct inode *host;
unsigned long flags;
struct rw_semaphore i_mmap_rwsem;
errseq_t wb_err;
- spinlock_t private_lock;
- struct list_head private_list;
- void *private_data;
+ spinlock_t i_private_lock;
+ struct list_head i_private_list;
+ void * i_private_data;
} __attribute__((aligned(sizeof(long)))) __randomize_layout;
/*
* On most architectures that alignment is already the case; but
*/
struct file {
union {
+ /* fput() uses task work when closing and freeing file (default). */
+ struct callback_head f_task_work;
+ /* fput() must use workqueue (most kernel threads). */
struct llist_node f_llist;
- struct rcu_head f_rcuhead;
unsigned int f_iocb_flags;
};
struct sb_writers {
unsigned short frozen; /* Is sb frozen? */
- unsigned short freeze_holders; /* Who froze fs? */
+ int freeze_kcount; /* How many kernel freeze requests? */
+ int freeze_ucount; /* How many userspace freeze requests? */
struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS];
};
#define __sb_writers_release(sb, lev) \
percpu_rwsem_release(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)
+/**
+ * __sb_write_started - check if sb freeze level is held
+ * @sb: the super we write to
+ * @level: the freeze level
+ *
+ * * > 0 - sb freeze level is held
+ * * 0 - sb freeze level is not held
+ * * < 0 - !CONFIG_LOCKDEP/LOCK_STATE_UNKNOWN
+ */
+static inline int __sb_write_started(const struct super_block *sb, int level)
+{
+ return lockdep_is_held_type(sb->s_writers.rw_sem + level - 1, 1);
+}
+
+/**
+ * sb_write_started - check if SB_FREEZE_WRITE is held
+ * @sb: the super we write to
+ *
+ * May be false positive with !CONFIG_LOCKDEP/LOCK_STATE_UNKNOWN.
+ */
static inline bool sb_write_started(const struct super_block *sb)
{
- return lockdep_is_held_type(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1, 1);
+ return __sb_write_started(sb, SB_FREEZE_WRITE);
+}
+
+/**
+ * sb_write_not_started - check if SB_FREEZE_WRITE is not held
+ * @sb: the super we write to
+ *
+ * May be false positive with !CONFIG_LOCKDEP/LOCK_STATE_UNKNOWN.
+ */
+static inline bool sb_write_not_started(const struct super_block *sb)
+{
+ return __sb_write_started(sb, SB_FREEZE_WRITE) <= 0;
+}
+
+/**
+ * file_write_started - check if SB_FREEZE_WRITE is held
+ * @file: the file we write to
+ *
+ * May be false positive with !CONFIG_LOCKDEP/LOCK_STATE_UNKNOWN.
+ * May be false positive with !S_ISREG, because file_start_write() has
+ * no effect on !S_ISREG.
+ */
+static inline bool file_write_started(const struct file *file)
+{
+ if (!S_ISREG(file_inode(file)->i_mode))
+ return true;
+ return sb_write_started(file_inode(file)->i_sb);
+}
+
+/**
+ * file_write_not_started - check if SB_FREEZE_WRITE is not held
+ * @file: the file we write to
+ *
+ * May be false positive with !CONFIG_LOCKDEP/LOCK_STATE_UNKNOWN.
+ * May be false positive with !S_ISREG, because file_start_write() has
+ * no effect on !S_ISREG.
+ */
+static inline bool file_write_not_started(const struct file *file)
+{
+ if (!S_ISREG(file_inode(file)->i_mode))
+ return true;
+ return sb_write_not_started(file_inode(file)->i_sb);
}
/**
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
loff_t, size_t, unsigned int);
-extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t len, unsigned int flags);
int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t *len, unsigned int remap_flags,
struct file *dst_file, loff_t dst_pos,
loff_t len, unsigned int remap_flags);
+/**
+ * enum freeze_holder - holder of the freeze
+ * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
+ * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
+ * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
+ *
+ * Indicate who the owner of the freeze or thaw request is and whether
+ * the freeze needs to be exclusive or can nest.
+ * Without @FREEZE_MAY_NEST, multiple freeze and thaw requests from the
+ * same holder aren't allowed. It is however allowed to hold a single
+ * @FREEZE_HOLDER_USERSPACE and a single @FREEZE_HOLDER_KERNEL freeze at
+ * the same time. This is relied upon by some filesystems during online
+ * repair or similar.
+ */
enum freeze_holder {
FREEZE_HOLDER_KERNEL = (1U << 0),
FREEZE_HOLDER_USERSPACE = (1U << 1),
+ FREEZE_MAY_NEST = (1U << 2),
};
struct super_operations {
const struct cred *creds);
struct file *dentry_create(const struct path *path, int flags, umode_t mode,
const struct cred *cred);
-struct file *backing_file_open(const struct path *user_path, int flags,
- const struct path *real_path,
- const struct cred *cred);
struct path *backing_file_user_path(struct file *f);
/*
- * file_user_path - get the path to display for memory mapped file
- *
* When mmapping a file on a stackable filesystem (e.g., overlayfs), the file
* stored in ->vm_file is a backing file whose f_inode is on the underlying
- * filesystem. When the mapped file path is displayed to user (e.g. via
- * /proc/<pid>/maps), this helper should be used to get the path to display
- * to the user, which is the path of the fd that user has requested to map.
+ * filesystem. When the mapped file path and inode number are displayed to
+ * user (e.g. via /proc/<pid>/maps), these helpers should be used to get the
+ * path and inode number to display to the user, which is the path of the fd
+ * that user has requested to map and the inode number that would be returned
+ * by fstat() on that same fd.
*/
+/* Get the path to display in /proc/<pid>/maps */
static inline const struct path *file_user_path(struct file *f)
{
if (unlikely(f->f_mode & FMODE_BACKING))
return backing_file_user_path(f);
return &f->f_path;
}
+/* Get the inode whose inode number to display in /proc/<pid>/maps */
+static inline const struct inode *file_user_inode(struct file *f)
+{
+ if (unlikely(f->f_mode & FMODE_BACKING))
+ return d_inode(backing_file_user_path(f)->dentry);
+ return file_inode(f);
+}
static inline struct file *file_clone_open(struct file *file)
{
size_t len, unsigned int flags);
extern ssize_t iter_file_splice_write(struct pipe_inode_info *,
struct file *, loff_t *, size_t, unsigned int);
-extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
- loff_t *opos, size_t len, unsigned int flags);
extern void
extern struct file_system_type *get_filesystem(struct file_system_type *fs);
extern void put_filesystem(struct file_system_type *fs);
extern struct file_system_type *get_fs_type(const char *name);
-extern struct super_block *get_active_super(struct block_device *bdev);
extern void drop_super(struct super_block *sb);
extern void drop_super_exclusive(struct super_block *sb);
extern void iterate_supers(void (*)(struct super_block *, void *), void *);
* @pmd_huge_pte: Protected by ptdesc->ptl, used for THPs.
* @__page_mapping: Aliases with page->mapping. Unused for page tables.
* @pt_mm: Used for x86 pgds.
- * @pt_frag_refcount: For fragmented page table tracking. Powerpc and s390 only.
+ * @pt_frag_refcount: For fragmented page table tracking. Powerpc only.
* @_pt_pad_2: Padding to ensure proper alignment.
* @ptl: Lock for the page table.
* @__page_type: Same as page->page_type. Unused for page tables.
- * @_refcount: Same as page refcount. Used for s390 page tables.
+ * @__page_refcount: Same as page refcount.
* @pt_memcg_data: Memcg data. Tracked for page tables here.
*
* This struct overlays struct page for now. Do not modify without a good
#endif
};
unsigned int __page_type;
- atomic_t _refcount;
+ atomic_t __page_refcount;
#ifdef CONFIG_MEMCG
unsigned long pt_memcg_data;
#endif
TABLE_MATCH(mapping, __page_mapping);
TABLE_MATCH(rcu_head, pt_rcu_head);
TABLE_MATCH(page_type, __page_type);
- TABLE_MATCH(_refcount, _refcount);
+ TABLE_MATCH(_refcount, __page_refcount);
#ifdef CONFIG_MEMCG
TABLE_MATCH(memcg_data, pt_memcg_data);
#endif
*/
unsigned long pids_active[2];
+ /* MM scan sequence ID when scan first started after VMA creation */
+ int start_scan_seq;
+
/*
* MM scan sequence ID when the VMA was last completely scanned.
* A VMA is not eligible for scanning if prev_scan_seq == numa_scan_seq
*/
unsigned long ksm_zero_pages;
#endif /* CONFIG_KSM */
- #ifdef CONFIG_LRU_GEN
+ #ifdef CONFIG_LRU_GEN_WALKS_MMU
struct {
/* this mm_struct is on lru_gen_mm_list */
struct list_head list;
struct mem_cgroup *memcg;
#endif
} lru_gen;
- #endif /* CONFIG_LRU_GEN */
+ #endif /* CONFIG_LRU_GEN_WALKS_MMU */
} __randomize_layout;
/*
spinlock_t lock;
};
+ #endif /* CONFIG_LRU_GEN */
+
+ #ifdef CONFIG_LRU_GEN_WALKS_MMU
+
void lru_gen_add_mm(struct mm_struct *mm);
void lru_gen_del_mm(struct mm_struct *mm);
- #ifdef CONFIG_MEMCG
void lru_gen_migrate_mm(struct mm_struct *mm);
- #endif
static inline void lru_gen_init_mm(struct mm_struct *mm)
{
WRITE_ONCE(mm->lru_gen.bitmap, -1);
}
- #else /* !CONFIG_LRU_GEN */
+ #else /* !CONFIG_LRU_GEN_WALKS_MMU */
static inline void lru_gen_add_mm(struct mm_struct *mm)
{
{
}
- #ifdef CONFIG_MEMCG
static inline void lru_gen_migrate_mm(struct mm_struct *mm)
{
}
- #endif
static inline void lru_gen_init_mm(struct mm_struct *mm)
{
{
}
- #endif /* CONFIG_LRU_GEN */
+ #endif /* CONFIG_LRU_GEN_WALKS_MMU */
struct vma_iterator {
struct ma_state mas;
.mas = { \
.tree = &(__mm)->mm_mt, \
.index = __addr, \
- .node = MAS_START, \
+ .node = NULL, \
+ .status = ma_start, \
}, \
}
/*
* Flags to pass to kmem_cache_create().
- * The ones marked DEBUG are only valid if CONFIG_DEBUG_SLAB is set.
+ * The ones marked DEBUG need CONFIG_SLUB_DEBUG enabled, otherwise are no-op
*/
/* DEBUG: Perform (expensive) checks on alloc/free */
#define SLAB_CONSISTENCY_CHECKS ((slab_flags_t __force)0x00000100U)
* Kmalloc array related definitions
*/
-#ifdef CONFIG_SLAB
/*
- * SLAB and SLUB directly allocates requests fitting in to an order-1 page
+ * SLUB directly allocates requests fitting in to an order-1 page
* (PAGE_SIZE*2). Larger requests are passed to the page allocator.
*/
#define KMALLOC_SHIFT_HIGH (PAGE_SHIFT + 1)
- #define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT)
+ #define KMALLOC_SHIFT_MAX (MAX_PAGE_ORDER + PAGE_SHIFT)
#ifndef KMALLOC_SHIFT_LOW
-#define KMALLOC_SHIFT_LOW 5
-#endif
-#endif
-
-#ifdef CONFIG_SLUB
-#define KMALLOC_SHIFT_HIGH (PAGE_SHIFT + 1)
-#define KMALLOC_SHIFT_MAX (MAX_PAGE_ORDER + PAGE_SHIFT)
-#ifndef KMALLOC_SHIFT_LOW
#define KMALLOC_SHIFT_LOW 3
#endif
-#endif
/* Maximum allocatable size */
#define KMALLOC_MAX_SIZE (1UL << KMALLOC_SHIFT_MAX)
void __init kmem_cache_init_late(void);
-#if defined(CONFIG_SMP) && defined(CONFIG_SLAB)
-int slab_prepare_cpu(unsigned int cpu);
-int slab_dead_cpu(unsigned int cpu);
-#else
-#define slab_prepare_cpu NULL
-#define slab_dead_cpu NULL
-#endif
-
#endif /* _LINUX_SLAB_H */
(HAVE_ARCH_KASAN_SW_TAGS && CC_HAS_KASAN_SW_TAGS)) && \
CC_HAS_WORKING_NOSANITIZE_ADDRESS) || \
HAVE_ARCH_KASAN_HW_TAGS
- depends on (SLUB && SYSFS && !SLUB_TINY) || (SLAB && !DEBUG_SLAB)
+ depends on SYSFS && !SLUB_TINY
select STACKDEPOT_ALWAYS_INIT
help
Enables KASAN (Kernel Address Sanitizer) - a dynamic memory safety
bool "Generic KASAN"
depends on HAVE_ARCH_KASAN && CC_HAS_KASAN_GENERIC
depends on CC_HAS_WORKING_NOSANITIZE_ADDRESS
- select SLUB_DEBUG if SLUB
+ select SLUB_DEBUG
select CONSTRUCTORS
help
Enables Generic KASAN.
overhead of ~50% for dynamic allocations.
The performance slowdown is ~x3.
- (Incompatible with CONFIG_DEBUG_SLAB: the kernel does not boot.)
-
config KASAN_SW_TAGS
bool "Software Tag-Based KASAN"
depends on HAVE_ARCH_KASAN_SW_TAGS && CC_HAS_KASAN_SW_TAGS
depends on CC_HAS_WORKING_NOSANITIZE_ADDRESS
- select SLUB_DEBUG if SLUB
+ select SLUB_DEBUG
select CONSTRUCTORS
help
Enables Software Tag-Based KASAN.
May potentially introduce problems related to pointer casting and
comparison, as it embeds a tag into the top byte of each pointer.
- (Incompatible with CONFIG_DEBUG_SLAB: the kernel does not boot.)
-
config KASAN_HW_TAGS
bool "Hardware Tag-Based KASAN"
depends on HAVE_ARCH_KASAN_HW_TAGS
- depends on SLUB
help
Enables Hardware Tag-Based KASAN.
choice
prompt "Instrumentation type"
depends on KASAN_GENERIC || KASAN_SW_TAGS
- default KASAN_OUTLINE
+ default KASAN_INLINE if !ARCH_DISABLE_KASAN_INLINE
config KASAN_OUTLINE
bool "Outline instrumentation"
A part of the KASAN test suite that is not integrated with KUnit.
Incompatible with Hardware Tag-Based KASAN.
+ config KASAN_EXTRA_INFO
+ bool "Record and report more information"
+ depends on KASAN
+ help
+ Record and report more information to help us find the cause of the
+ bug and to help us correlate the error with other system events.
+
+ Currently, the CPU number and timestamp are additionally
+ recorded for each heap block at allocation and free time, and
+ 8 bytes will be added to each metadata structure that records
+ allocation or free information.
+
+ In Generic KASAN, each kmalloc-8 and kmalloc-16 object will add
+ 16 bytes of additional memory consumption, and each kmalloc-32
+ object will add 8 bytes of additional memory consumption, not
+ affecting other larger objects.
+
+ In SW_TAGS KASAN and HW_TAGS KASAN, depending on the stack_ring_size
+ boot parameter, it will add 8 * stack_ring_size bytes of additional
+ memory consumption.
+
endif # KASAN
The cost is that if the page was never dirtied and needs to be
swapped out again, it will be re-compressed.
+ config ZSWAP_SHRINKER_DEFAULT_ON
+ bool "Shrink the zswap pool on memory pressure"
+ depends on ZSWAP
+ default n
+ help
+ If selected, the zswap shrinker will be enabled, and the pages
+ stored in the zswap pool will become available for reclaim (i.e
+ written back to the backing swap device) on memory pressure.
+
+ This means that zswap writeback could happen even if the pool is
+ not yet full, or the cgroup zswap limit has not been reached,
+ reducing the chance that cold pages will reside in the zswap pool
+ and consume memory indefinitely.
+
choice
prompt "Default compressor"
depends on ZSWAP
For more information, see zsmalloc documentation.
-menu "SLAB allocator options"
-
-choice
- prompt "Choose SLAB allocator"
- default SLUB
- help
- This option allows to select a slab allocator.
-
-config SLAB_DEPRECATED
- bool "SLAB (DEPRECATED)"
- depends on !PREEMPT_RT
- help
- Deprecated and scheduled for removal in a few cycles. Replaced by
- SLUB.
-
- and the people listed in the SLAB ALLOCATOR section of MAINTAINERS
- file, explaining why.
-
- The regular slab allocator that is established and known to work
- well in all environments. It organizes cache hot objects in
- per cpu and per node queues.
+menu "Slab allocator options"
config SLUB
- bool "SLUB (Unqueued Allocator)"
- help
- SLUB is a slab allocator that minimizes cache line usage
- instead of managing queues of cached objects (SLAB approach).
- Per cpu caching is realized using slabs of objects instead
- of queues of objects. SLUB can use memory efficiently
- and has enhanced diagnostics. SLUB is the default choice for
- a slab allocator.
-
-endchoice
-
-config SLAB
- bool
- default y
- depends on SLAB_DEPRECATED
+ def_bool y
config SLUB_TINY
- bool "Configure SLUB for minimal memory footprint"
- depends on SLUB && EXPERT
+ bool "Configure for minimal memory footprint"
+ depends on EXPERT
select SLAB_MERGE_DEFAULT
help
- Configures the SLUB allocator in a way to achieve minimal memory
+ Configures the slab allocator in a way to achieve minimal memory
footprint, sacrificing scalability, debugging and other features.
This is intended only for the smallest system that had used the
SLOB allocator and is not recommended for systems with more than
config SLAB_MERGE_DEFAULT
bool "Allow slab caches to be merged"
default y
- depends on SLAB || SLUB
help
For reduced kernel memory fragmentation, slab caches can be
merged when they share the same size and other characteristics.
config SLAB_FREELIST_RANDOM
bool "Randomize slab freelist"
- depends on SLAB || (SLUB && !SLUB_TINY)
+ depends on !SLUB_TINY
help
Randomizes the freelist order used on creating new pages. This
security feature reduces the predictability of the kernel slab
config SLAB_FREELIST_HARDENED
bool "Harden slab freelist metadata"
- depends on SLAB || (SLUB && !SLUB_TINY)
+ depends on !SLUB_TINY
help
Many kernel heap attacks try to target slab cache metadata and
other infrastructure. This options makes minor performance
sacrifices to harden the kernel slab allocator against common
- freelist exploit methods. Some slab implementations have more
- sanity-checking than others. This option is most effective with
- CONFIG_SLUB.
+ freelist exploit methods.
config SLUB_STATS
default n
- bool "Enable SLUB performance statistics"
- depends on SLUB && SYSFS && !SLUB_TINY
+ bool "Enable performance statistics"
+ depends on SYSFS && !SLUB_TINY
help
- SLUB statistics are useful to debug SLUBs allocation behavior in
+ The statistics are useful to debug slab allocation behavior in
order find ways to optimize the allocator. This should never be
enabled for production use since keeping statistics slows down
the allocator by a few percentage points. The slabinfo command
config SLUB_CPU_PARTIAL
default y
- depends on SLUB && SMP && !SLUB_TINY
- bool "SLUB per cpu partial cache"
+ depends on SMP && !SLUB_TINY
+ bool "Enable per cpu partial caches"
help
Per cpu partial caches accelerate objects allocation and freeing
that is local to a processor at the price of more indeterminism
config RANDOM_KMALLOC_CACHES
default n
- depends on SLUB && !SLUB_TINY
+ depends on !SLUB_TINY
bool "Randomize slab caches for normal kmalloc"
help
A hardening feature that creates multiple copies of slab caches for
limited degree of memory and CPU overhead that relates to hardware and
system workload.
-endmenu # SLAB allocator options
+endmenu # Slab allocator options
config SHUFFLE_PAGE_ALLOCATOR
bool "Page allocator randomization"
the presence of a memory-side-cache. There are also incidental
security benefits as it reduces the predictability of page
allocations to compliment SLAB_FREELIST_RANDOM, but the
- default granularity of shuffling on the MAX_ORDER i.e, 10th
+ default granularity of shuffling on the MAX_PAGE_ORDER i.e, 10th
order of pages is selected based on cache utilization benefits
on x86.
HUGETLB_PAGE_ORDER when there are multiple HugeTLB page sizes available
on a platform.
- Note that the pageblock_order cannot exceed MAX_ORDER and will be
- clamped down to MAX_ORDER.
+ Note that the pageblock_order cannot exceed MAX_PAGE_ORDER and will be
+ clamped down to MAX_PAGE_ORDER.
config CONTIG_ALLOC
def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
from userspace allocation. Keeping a user from writing to low pages
can help reduce the impact of kernel NULL pointer bugs.
- For most ia64, ppc64 and x86 users with lots of address space
+ For most ppc64 and x86 users with lots of address space
a value of 65536 is reasonable and should cause no problems.
On arm and other archs it should not be higher than 32768.
Programs which use vm86 functionality or have some need to map
madvise(MADV_HUGEPAGE) but it won't risk to increase the
memory footprint of applications without a guaranteed
benefit.
+
+ config TRANSPARENT_HUGEPAGE_NEVER
+ bool "never"
+ help
+ Disable Transparent Hugepage by default. It can still be
+ enabled at runtime via sysfs.
endchoice
config THP_SWAP
from evicted generations for debugging purpose.
This option has a per-memcg and per-node memory overhead.
+
+ config LRU_GEN_WALKS_MMU
+ def_bool y
+ depends on LRU_GEN && ARCH_HAS_HW_PTE_YOUNG
# }
config ARCH_SUPPORTS_PER_VMA_LOCK
static atomic_t huge_zero_refcount;
struct page *huge_zero_page __read_mostly;
unsigned long huge_zero_pfn __read_mostly = ~0UL;
+ unsigned long huge_anon_orders_always __read_mostly;
+ unsigned long huge_anon_orders_madvise __read_mostly;
+ unsigned long huge_anon_orders_inherit __read_mostly;
+
+ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
+ unsigned long vm_flags, bool smaps,
+ bool in_pf, bool enforce_sysfs,
+ unsigned long orders)
+ {
+ /* Check the intersection of requested and supported orders. */
+ orders &= vma_is_anonymous(vma) ?
+ THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE;
+ if (!orders)
+ return 0;
- bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
- bool smaps, bool in_pf, bool enforce_sysfs)
- {
if (!vma->vm_mm) /* vdso */
- return false;
+ return 0;
/*
* Explicitly disabled through madvise or prctl, or some
* */
if ((vm_flags & VM_NOHUGEPAGE) ||
test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
- return false;
+ return 0;
/*
* If the hardware/firmware marked hugepage support disabled.
*/
if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
- return false;
+ return 0;
/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
if (vma_is_dax(vma))
- return in_pf;
+ return in_pf ? orders : 0;
/*
* khugepaged special VMA and hugetlb VMA.
* VM_MIXEDMAP set.
*/
if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
- return false;
+ return 0;
/*
- * Check alignment for file vma and size for both file and anon vma.
+ * Check alignment for file vma and size for both file and anon vma by
+ * filtering out the unsuitable orders.
*
* Skip the check for page fault. Huge fault does the check in fault
- * handlers. And this check is not suitable for huge PUD fault.
+ * handlers.
*/
- if (!in_pf &&
- !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE)))
- return false;
+ if (!in_pf) {
+ int order = highest_order(orders);
+ unsigned long addr;
+
+ while (orders) {
+ addr = vma->vm_end - (PAGE_SIZE << order);
+ if (thp_vma_suitable_order(vma, addr, order))
+ break;
+ order = next_order(&orders, order);
+ }
+
+ if (!orders)
+ return 0;
+ }
/*
* Enabled via shmem mount options or sysfs settings.
*/
if (!in_pf && shmem_file(vma->vm_file))
return shmem_is_huge(file_inode(vma->vm_file), vma->vm_pgoff,
- !enforce_sysfs, vma->vm_mm, vm_flags);
-
- /* Enforce sysfs THP requirements as necessary */
- if (enforce_sysfs &&
- (!hugepage_flags_enabled() || (!(vm_flags & VM_HUGEPAGE) &&
- !hugepage_flags_always())))
- return false;
+ !enforce_sysfs, vma->vm_mm, vm_flags)
+ ? orders : 0;
if (!vma_is_anonymous(vma)) {
+ /*
+ * Enforce sysfs THP requirements as necessary. Anonymous vmas
+ * were already handled in thp_vma_allowable_orders().
+ */
+ if (enforce_sysfs &&
+ (!hugepage_global_enabled() || (!(vm_flags & VM_HUGEPAGE) &&
+ !hugepage_global_always())))
+ return 0;
+
/*
* Trust that ->huge_fault() handlers know what they are doing
* in fault path.
*/
if (((in_pf || smaps)) && vma->vm_ops->huge_fault)
- return true;
+ return orders;
/* Only regular file is valid in collapse path */
if (((!in_pf || smaps)) && file_thp_enabled(vma))
- return true;
- return false;
+ return orders;
+ return 0;
}
if (vma_is_temporary_stack(vma))
- return false;
+ return 0;
/*
* THPeligible bit of smaps should show 1 for proper VMAs even
* the first page fault.
*/
if (!vma->anon_vma)
- return (smaps || in_pf);
+ return (smaps || in_pf) ? orders : 0;
- return true;
+ return orders;
}
static bool get_huge_zero_page(void)
.attrs = hugepage_attr,
};
+ static void hugepage_exit_sysfs(struct kobject *hugepage_kobj);
+ static void thpsize_release(struct kobject *kobj);
+ static DEFINE_SPINLOCK(huge_anon_orders_lock);
+ static LIST_HEAD(thpsize_list);
+
+ struct thpsize {
+ struct kobject kobj;
+ struct list_head node;
+ int order;
+ };
+
+ #define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
+
+ static ssize_t thpsize_enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+ {
+ int order = to_thpsize(kobj)->order;
+ const char *output;
+
+ if (test_bit(order, &huge_anon_orders_always))
+ output = "[always] inherit madvise never";
+ else if (test_bit(order, &huge_anon_orders_inherit))
+ output = "always [inherit] madvise never";
+ else if (test_bit(order, &huge_anon_orders_madvise))
+ output = "always inherit [madvise] never";
+ else
+ output = "always inherit madvise [never]";
+
+ return sysfs_emit(buf, "%s\n", output);
+ }
+
+ static ssize_t thpsize_enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+ {
+ int order = to_thpsize(kobj)->order;
+ ssize_t ret = count;
+
+ if (sysfs_streq(buf, "always")) {
+ spin_lock(&huge_anon_orders_lock);
+ clear_bit(order, &huge_anon_orders_inherit);
+ clear_bit(order, &huge_anon_orders_madvise);
+ set_bit(order, &huge_anon_orders_always);
+ spin_unlock(&huge_anon_orders_lock);
+ } else if (sysfs_streq(buf, "inherit")) {
+ spin_lock(&huge_anon_orders_lock);
+ clear_bit(order, &huge_anon_orders_always);
+ clear_bit(order, &huge_anon_orders_madvise);
+ set_bit(order, &huge_anon_orders_inherit);
+ spin_unlock(&huge_anon_orders_lock);
+ } else if (sysfs_streq(buf, "madvise")) {
+ spin_lock(&huge_anon_orders_lock);
+ clear_bit(order, &huge_anon_orders_always);
+ clear_bit(order, &huge_anon_orders_inherit);
+ set_bit(order, &huge_anon_orders_madvise);
+ spin_unlock(&huge_anon_orders_lock);
+ } else if (sysfs_streq(buf, "never")) {
+ spin_lock(&huge_anon_orders_lock);
+ clear_bit(order, &huge_anon_orders_always);
+ clear_bit(order, &huge_anon_orders_inherit);
+ clear_bit(order, &huge_anon_orders_madvise);
+ spin_unlock(&huge_anon_orders_lock);
+ } else
+ ret = -EINVAL;
+
+ return ret;
+ }
+
+ static struct kobj_attribute thpsize_enabled_attr =
+ __ATTR(enabled, 0644, thpsize_enabled_show, thpsize_enabled_store);
+
+ static struct attribute *thpsize_attrs[] = {
+ &thpsize_enabled_attr.attr,
+ NULL,
+ };
+
+ static const struct attribute_group thpsize_attr_group = {
+ .attrs = thpsize_attrs,
+ };
+
+ static const struct kobj_type thpsize_ktype = {
+ .release = &thpsize_release,
+ .sysfs_ops = &kobj_sysfs_ops,
+ };
+
+ static struct thpsize *thpsize_create(int order, struct kobject *parent)
+ {
+ unsigned long size = (PAGE_SIZE << order) / SZ_1K;
+ struct thpsize *thpsize;
+ int ret;
+
+ thpsize = kzalloc(sizeof(*thpsize), GFP_KERNEL);
+ if (!thpsize)
+ return ERR_PTR(-ENOMEM);
+
+ ret = kobject_init_and_add(&thpsize->kobj, &thpsize_ktype, parent,
+ "hugepages-%lukB", size);
+ if (ret) {
+ kfree(thpsize);
+ return ERR_PTR(ret);
+ }
+
+ ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group);
+ if (ret) {
+ kobject_put(&thpsize->kobj);
+ return ERR_PTR(ret);
+ }
+
+ thpsize->order = order;
+ return thpsize;
+ }
+
+ static void thpsize_release(struct kobject *kobj)
+ {
+ kfree(to_thpsize(kobj));
+ }
+
static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
{
int err;
+ struct thpsize *thpsize;
+ unsigned long orders;
+ int order;
+
+ /*
+ * Default to setting PMD-sized THP to inherit the global setting and
+ * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time
+ * constant so we have to do this here.
+ */
+ huge_anon_orders_inherit = BIT(PMD_ORDER);
*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
if (unlikely(!*hugepage_kobj)) {
goto remove_hp_group;
}
+ orders = THP_ORDERS_ALL_ANON;
+ order = highest_order(orders);
+ while (orders) {
+ thpsize = thpsize_create(order, *hugepage_kobj);
+ if (IS_ERR(thpsize)) {
+ pr_err("failed to create thpsize for order %d\n", order);
+ err = PTR_ERR(thpsize);
+ goto remove_all;
+ }
+ list_add(&thpsize->node, &thpsize_list);
+ order = next_order(&orders, order);
+ }
+
return 0;
+ remove_all:
+ hugepage_exit_sysfs(*hugepage_kobj);
+ return err;
remove_hp_group:
sysfs_remove_group(*hugepage_kobj, &hugepage_attr_group);
delete_obj:
static void __init hugepage_exit_sysfs(struct kobject *hugepage_kobj)
{
+ struct thpsize *thpsize, *tmp;
+
+ list_for_each_entry_safe(thpsize, tmp, &thpsize_list, node) {
+ list_del(&thpsize->node);
+ kobject_put(&thpsize->kobj);
+ }
+
sysfs_remove_group(hugepage_kobj, &khugepaged_attr_group);
sysfs_remove_group(hugepage_kobj, &hugepage_attr_group);
kobject_put(hugepage_kobj);
/*
* hugepages can't be allocated by the buddy allocator
*/
- MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER > MAX_ORDER);
+ MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER > MAX_PAGE_ORDER);
/*
* we use page->mapping and page->index in second tail page
* as list_head: assuming THP order >= 2
struct folio *folio;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
- if (!transhuge_vma_suitable(vma, haddr))
+ if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
return VM_FAULT_FALLBACK;
if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;
{
spinlock_t *dst_ptl, *src_ptl;
struct page *src_page;
+ struct folio *src_folio;
pmd_t pmd;
pgtable_t pgtable = NULL;
int ret = -ENOMEM;
src_page = pmd_page(pmd);
VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+ src_folio = page_folio(src_page);
- get_page(src_page);
- if (unlikely(page_try_dup_anon_rmap(src_page, true, src_vma))) {
+ folio_get(src_folio);
+ if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, src_vma))) {
/* Page maybe pinned: split and retry the fault on PTEs. */
- put_page(src_page);
+ folio_put(src_folio);
pte_free(dst_mm, pgtable);
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
}
/*
- * TODO: once we support anonymous pages, use page_try_dup_anon_rmap()
- * and split if duplicating fails.
+ * TODO: once we support anonymous pages, use
+ * folio_try_dup_anon_rmap_*() and split if duplicating fails.
*/
pudp_set_wrprotect(src_mm, addr, src_pud);
pud = pud_mkold(pud_wrprotect(pud));
if (pmd_present(orig_pmd)) {
page = pmd_page(orig_pmd);
- page_remove_rmap(page, vma, true);
+ folio_remove_rmap_pmd(page_folio(page), page, vma);
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
VM_BUG_ON_PAGE(!PageHead(page), page);
} else if (thp_migration_supported()) {
return ret;
}
+ #ifdef CONFIG_USERFAULTFD
+ /*
+ * The PT lock for src_pmd and the mmap_lock for reading are held by
+ * the caller, but it must return after releasing the page_table_lock.
+ * Just move the page from src_pmd to dst_pmd if possible.
+ * Return zero if succeeded in moving the page, -EAGAIN if it needs to be
+ * repeated by the caller, or other errors in case of failure.
+ */
+ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pmd_t dst_pmdval,
+ struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
+ unsigned long dst_addr, unsigned long src_addr)
+ {
+ pmd_t _dst_pmd, src_pmdval;
+ struct page *src_page;
+ struct folio *src_folio;
+ struct anon_vma *src_anon_vma;
+ spinlock_t *src_ptl, *dst_ptl;
+ pgtable_t src_pgtable;
+ struct mmu_notifier_range range;
+ int err = 0;
+
+ src_pmdval = *src_pmd;
+ src_ptl = pmd_lockptr(mm, src_pmd);
+
+ lockdep_assert_held(src_ptl);
+ mmap_assert_locked(mm);
+
+ /* Sanity checks before the operation */
+ if (WARN_ON_ONCE(!pmd_none(dst_pmdval)) || WARN_ON_ONCE(src_addr & ~HPAGE_PMD_MASK) ||
+ WARN_ON_ONCE(dst_addr & ~HPAGE_PMD_MASK)) {
+ spin_unlock(src_ptl);
+ return -EINVAL;
+ }
+
+ if (!pmd_trans_huge(src_pmdval)) {
+ spin_unlock(src_ptl);
+ if (is_pmd_migration_entry(src_pmdval)) {
+ pmd_migration_entry_wait(mm, &src_pmdval);
+ return -EAGAIN;
+ }
+ return -ENOENT;
+ }
+
+ src_page = pmd_page(src_pmdval);
+ if (unlikely(!PageAnonExclusive(src_page))) {
+ spin_unlock(src_ptl);
+ return -EBUSY;
+ }
+
+ src_folio = page_folio(src_page);
+ folio_get(src_folio);
+ spin_unlock(src_ptl);
+
+ flush_cache_range(src_vma, src_addr, src_addr + HPAGE_PMD_SIZE);
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, src_addr,
+ src_addr + HPAGE_PMD_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
+
+ folio_lock(src_folio);
+
+ /*
+ * split_huge_page walks the anon_vma chain without the page
+ * lock. Serialize against it with the anon_vma lock, the page
+ * lock is not enough.
+ */
+ src_anon_vma = folio_get_anon_vma(src_folio);
+ if (!src_anon_vma) {
+ err = -EAGAIN;
+ goto unlock_folio;
+ }
+ anon_vma_lock_write(src_anon_vma);
+
+ dst_ptl = pmd_lockptr(mm, dst_pmd);
+ double_pt_lock(src_ptl, dst_ptl);
+ if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
+ !pmd_same(*dst_pmd, dst_pmdval))) {
+ err = -EAGAIN;
+ goto unlock_ptls;
+ }
+ if (folio_maybe_dma_pinned(src_folio) ||
+ !PageAnonExclusive(&src_folio->page)) {
+ err = -EBUSY;
+ goto unlock_ptls;
+ }
+
+ if (WARN_ON_ONCE(!folio_test_head(src_folio)) ||
+ WARN_ON_ONCE(!folio_test_anon(src_folio))) {
+ err = -EBUSY;
+ goto unlock_ptls;
+ }
+
+ folio_move_anon_rmap(src_folio, dst_vma);
+ WRITE_ONCE(src_folio->index, linear_page_index(dst_vma, dst_addr));
+
+ src_pmdval = pmdp_huge_clear_flush(src_vma, src_addr, src_pmd);
+ /* Folio got pinned from under us. Put it back and fail the move. */
+ if (folio_maybe_dma_pinned(src_folio)) {
+ set_pmd_at(mm, src_addr, src_pmd, src_pmdval);
+ err = -EBUSY;
+ goto unlock_ptls;
+ }
+
+ _dst_pmd = mk_huge_pmd(&src_folio->page, dst_vma->vm_page_prot);
+ /* Follow mremap() behavior and treat the entry dirty after the move */
+ _dst_pmd = pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
+ set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
+
+ src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
+ pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+ unlock_ptls:
+ double_pt_unlock(src_ptl, dst_ptl);
+ anon_vma_unlock_write(src_anon_vma);
+ put_anon_vma(src_anon_vma);
+ unlock_folio:
+ /* unblock rmap walks */
+ folio_unlock(src_folio);
+ mmu_notifier_invalidate_range_end(&range);
+ folio_put(src_folio);
+ return err;
+ }
+ #endif /* CONFIG_USERFAULTFD */
+
/*
* Returns page table lock pointer if a given pmd maps a thp, NULL otherwise.
*
unsigned long haddr, bool freeze)
{
struct mm_struct *mm = vma->vm_mm;
+ struct folio *folio;
struct page *page;
pgtable_t pgtable;
pmd_t old_pmd, _pmd;
page = pfn_swap_entry_to_page(entry);
} else {
page = pmd_page(old_pmd);
- if (!PageDirty(page) && pmd_dirty(old_pmd))
- set_page_dirty(page);
- if (!PageReferenced(page) && pmd_young(old_pmd))
- SetPageReferenced(page);
- page_remove_rmap(page, vma, true);
- put_page(page);
+ folio = page_folio(page);
+ if (!folio_test_dirty(folio) && pmd_dirty(old_pmd))
+ folio_set_dirty(folio);
+ if (!folio_test_referenced(folio) && pmd_young(old_pmd))
+ folio_set_referenced(folio);
+ folio_remove_rmap_pmd(folio, page, vma);
+ folio_put(folio);
}
add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
return;
uffd_wp = pmd_swp_uffd_wp(old_pmd);
} else {
page = pmd_page(old_pmd);
+ folio = page_folio(page);
if (pmd_dirty(old_pmd)) {
dirty = true;
- SetPageDirty(page);
+ folio_set_dirty(folio);
}
write = pmd_write(old_pmd);
young = pmd_young(old_pmd);
soft_dirty = pmd_soft_dirty(old_pmd);
uffd_wp = pmd_uffd_wp(old_pmd);
- VM_BUG_ON_PAGE(!page_count(page), page);
+ VM_WARN_ON_FOLIO(!folio_ref_count(folio), folio);
+ VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
/*
* Without "freeze", we'll simply split the PMD, propagating the
* In case we cannot clear PageAnonExclusive(), split the PMD
* only and let try_to_migrate_one() fail later.
*
- * See page_try_share_anon_rmap(): invalidate PMD first.
+ * See folio_try_share_anon_rmap_pmd(): invalidate PMD first.
*/
- anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
- if (freeze && anon_exclusive && page_try_share_anon_rmap(page))
+ anon_exclusive = PageAnonExclusive(page);
+ if (freeze && anon_exclusive &&
+ folio_try_share_anon_rmap_pmd(folio, page))
freeze = false;
- if (!freeze)
- page_ref_add(page, HPAGE_PMD_NR - 1);
+ if (!freeze) {
+ rmap_t rmap_flags = RMAP_NONE;
+
+ folio_ref_add(folio, HPAGE_PMD_NR - 1);
+ if (anon_exclusive)
+ rmap_flags |= RMAP_EXCLUSIVE;
+ folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
+ vma, haddr, rmap_flags);
+ }
}
/*
entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
if (write)
entry = pte_mkwrite(entry, vma);
- if (anon_exclusive)
- SetPageAnonExclusive(page + i);
if (!young)
entry = pte_mkold(entry);
/* NOTE: this may set soft-dirty too on some archs */
entry = pte_mksoft_dirty(entry);
if (uffd_wp)
entry = pte_mkuffd_wp(entry);
- page_add_anon_rmap(page + i, vma, addr, RMAP_NONE);
}
VM_BUG_ON(!pte_none(ptep_get(pte)));
set_pte_at(mm, addr, pte, entry);
pte_unmap(pte - 1);
if (!pmd_migration)
- page_remove_rmap(page, vma, true);
+ folio_remove_rmap_pmd(folio, page, vma);
if (freeze)
put_page(page);
static void unmap_folio(struct folio *folio)
{
enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
- TTU_SYNC;
+ TTU_SYNC | TTU_BATCH_FLUSH;
VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
try_to_migrate(folio, ttu_flags);
else
try_to_unmap(folio, ttu_flags | TTU_IGNORE_MLOCK);
+
+ try_to_unmap_flush();
}
static void remap_page(struct folio *folio, unsigned long nr)
clear_compound_head(page_tail);
/* Finally unfreeze refcount. Additional reference from page cache. */
- page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
- PageSwapCache(head)));
+ page_ref_unfreeze(page_tail, 1 + (!folio_test_anon(folio) ||
+ folio_test_swapcache(folio)));
- if (page_is_young(head))
- set_page_young(page_tail);
- if (page_is_idle(head))
- set_page_idle(page_tail);
+ if (folio_test_young(folio))
+ folio_set_young(new_folio);
+ if (folio_test_idle(folio))
+ folio_set_idle(new_folio);
folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
if (!list_empty(&folio->_deferred_list)) {
ds_queue->split_queue_len--;
- list_del(&folio->_deferred_list);
+ list_del_init(&folio->_deferred_list);
}
spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
}
int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
struct page *page)
{
+ struct folio *folio = page_folio(page);
struct vm_area_struct *vma = pvmw->vma;
struct mm_struct *mm = vma->vm_mm;
unsigned long address = pvmw->address;
flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
- /* See page_try_share_anon_rmap(): invalidate PMD first. */
- anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
- if (anon_exclusive && page_try_share_anon_rmap(page)) {
+ /* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
+ anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
+ if (anon_exclusive && folio_try_share_anon_rmap_pmd(folio, page)) {
set_pmd_at(mm, address, pvmw->pmd, pmdval);
return -EBUSY;
}
if (pmd_dirty(pmdval))
- set_page_dirty(page);
+ folio_set_dirty(folio);
if (pmd_write(pmdval))
entry = make_writable_migration_entry(page_to_pfn(page));
else if (anon_exclusive)
if (pmd_uffd_wp(pmdval))
pmdswp = pmd_swp_mkuffd_wp(pmdswp);
set_pmd_at(mm, address, pvmw->pmd, pmdswp);
- page_remove_rmap(page, vma, true);
- put_page(page);
+ folio_remove_rmap_pmd(folio, page, vma);
+ folio_put(folio);
trace_set_migration_pmd(address, pmd_val(pmdswp));
return 0;
void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
{
+ struct folio *folio = page_folio(new);
struct vm_area_struct *vma = pvmw->vma;
struct mm_struct *mm = vma->vm_mm;
unsigned long address = pvmw->address;
return;
entry = pmd_to_swp_entry(*pvmw->pmd);
- get_page(new);
+ folio_get(folio);
pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
if (pmd_swp_soft_dirty(*pvmw->pmd))
pmde = pmd_mksoft_dirty(pmde);
if (!is_migration_entry_young(entry))
pmde = pmd_mkold(pmde);
/* NOTE: this may contain setting soft-dirty on some archs */
- if (PageDirty(new) && is_migration_entry_dirty(entry))
+ if (folio_test_dirty(folio) && is_migration_entry_dirty(entry))
pmde = pmd_mkdirty(pmde);
- if (PageAnon(new)) {
- rmap_t rmap_flags = RMAP_COMPOUND;
+ if (folio_test_anon(folio)) {
+ rmap_t rmap_flags = RMAP_NONE;
if (!is_readable_migration_entry(entry))
rmap_flags |= RMAP_EXCLUSIVE;
- page_add_anon_rmap(new, vma, haddr, rmap_flags);
+ folio_add_anon_rmap_pmd(folio, new, vma, haddr, rmap_flags);
} else {
- page_add_file_rmap(new, vma, true);
+ folio_add_file_rmap_pmd(folio, new, vma);
}
- VM_BUG_ON(pmd_write(pmde) && PageAnon(new) && !PageAnonExclusive(new));
+ VM_BUG_ON(pmd_write(pmde) && folio_test_anon(folio) && !PageAnonExclusive(new));
set_pmd_at(mm, haddr, pvmw->pmd, pmde);
/* No need to invalidate - it was non-present before */
* The VERY common case is inode->mapping == &inode->i_data but,
* this may not be true for device special inodes.
*/
- return (struct resv_map *)(&inode->i_data)->private_data;
+ return (struct resv_map *)(&inode->i_data)->i_private_data;
}
static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
/*
* Put bootmem huge pages into the standard lists after mem_map is up.
- * Note: This only applies to gigantic (order > MAX_ORDER) pages.
+ * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
*/
static void __init gather_bootmem_prealloc(void)
{
* The number of default huge pages (for this size) could have been
* specified as the first hugetlb parameter: hugepages=X. If so,
* then default_hstate_max_huge_pages is set. If the default huge
- * page size is gigantic (> MAX_ORDER), then the pages must be
+ * page size is gigantic (> MAX_PAGE_ORDER), then the pages must be
* allocated here from bootmem allocator.
*/
if (default_hstate_max_huge_pages) {
pte_t newpte = make_huge_pte(vma, &new_folio->page, 1);
__folio_mark_uptodate(new_folio);
- hugepage_add_new_anon_rmap(new_folio, vma, addr);
+ hugetlb_add_new_anon_rmap(new_folio, vma, addr);
if (userfaultfd_wp(vma) && huge_pte_uffd_wp(old))
newpte = huge_pte_mkuffd_wp(newpte);
set_huge_pte_at(vma->vm_mm, addr, ptep, newpte, sz);
* sleep during the process.
*/
if (!folio_test_anon(pte_folio)) {
- page_dup_file_rmap(&pte_folio->page, true);
- } else if (page_try_dup_anon_rmap(&pte_folio->page,
- true, src_vma)) {
+ hugetlb_add_file_rmap(pte_folio);
+ } else if (hugetlb_try_dup_anon_rmap(pte_folio, src_vma)) {
pte_t src_pte_old = entry;
struct folio *new_folio;
make_pte_marker(PTE_MARKER_UFFD_WP),
sz);
hugetlb_count_sub(pages_per_huge_page(h), mm);
- page_remove_rmap(page, vma, true);
+ hugetlb_remove_rmap(page_folio(page));
spin_unlock(ptl);
tlb_remove_page_size(tlb, page, huge_page_size(h));
/* Break COW or unshare */
huge_ptep_clear_flush(vma, haddr, ptep);
- page_remove_rmap(&old_folio->page, vma, true);
- hugepage_add_new_anon_rmap(new_folio, vma, haddr);
+ hugetlb_remove_rmap(old_folio);
+ hugetlb_add_new_anon_rmap(new_folio, vma, haddr);
if (huge_pte_uffd_wp(pte))
newpte = huge_pte_mkuffd_wp(newpte);
set_huge_pte_at(mm, haddr, ptep, newpte, huge_page_size(h));
goto backout;
if (anon_rmap)
- hugepage_add_new_anon_rmap(folio, vma, haddr);
+ hugetlb_add_new_anon_rmap(folio, vma, haddr);
else
- page_dup_file_rmap(&folio->page, true);
+ hugetlb_add_file_rmap(folio);
new_pte = make_huge_pte(vma, &folio->page, ((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)));
/*
goto out_release_unlock;
if (folio_in_pagecache)
- page_dup_file_rmap(&folio->page, true);
+ hugetlb_add_file_rmap(folio);
else
- hugepage_add_new_anon_rmap(folio, dst_vma, dst_addr);
+ hugetlb_add_new_anon_rmap(folio, dst_vma, dst_addr);
/*
* For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY
#include <linux/module.h>
#include <linux/printk.h>
#include <linux/sched.h>
+ #include <linux/sched/clock.h>
#include <linux/sched/task_stack.h>
#include <linux/slab.h>
+ #include <linux/stackdepot.h>
#include <linux/stacktrace.h>
#include <linux/string.h>
#include <linux/types.h>
return NULL;
}
- depot_stack_handle_t kasan_save_stack(gfp_t flags, bool can_alloc)
+ depot_stack_handle_t kasan_save_stack(gfp_t flags, depot_flags_t depot_flags)
{
unsigned long entries[KASAN_STACK_DEPTH];
unsigned int nr_entries;
nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0);
- return __stack_depot_save(entries, nr_entries, flags, can_alloc);
+ return stack_depot_save_flags(entries, nr_entries, flags, depot_flags);
}
- void kasan_set_track(struct kasan_track *track, gfp_t flags)
+ void kasan_set_track(struct kasan_track *track, depot_stack_handle_t stack)
{
+ #ifdef CONFIG_KASAN_EXTRA_INFO
+ u32 cpu = raw_smp_processor_id();
+ u64 ts_nsec = local_clock();
+
+ track->cpu = cpu;
+ track->timestamp = ts_nsec >> 3;
+ #endif /* CONFIG_KASAN_EXTRA_INFO */
track->pid = current->pid;
- track->stack = kasan_save_stack(flags, true);
+ track->stack = stack;
+ }
+
+ void kasan_save_track(struct kasan_track *track, gfp_t flags)
+ {
+ depot_stack_handle_t stack;
+
+ stack = kasan_save_stack(flags,
+ STACK_DEPOT_FLAG_CAN_ALLOC | STACK_DEPOT_FLAG_GET);
+ kasan_set_track(track, stack);
}
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
void __kasan_unpoison_range(const void *address, size_t size)
{
+ if (is_kfence_address(address))
+ return;
+
kasan_unpoison(address, size, false);
}
KASAN_SLAB_REDZONE, false);
}
- void __kasan_unpoison_object_data(struct kmem_cache *cache, void *object)
+ void __kasan_unpoison_new_object(struct kmem_cache *cache, void *object)
{
kasan_unpoison(object, cache->object_size, false);
}
- void __kasan_poison_object_data(struct kmem_cache *cache, void *object)
+ void __kasan_poison_new_object(struct kmem_cache *cache, void *object)
{
kasan_poison(object, round_up(cache->object_size, KASAN_GRANULE_SIZE),
KASAN_SLAB_REDZONE, false);
* 2. A cache might be SLAB_TYPESAFE_BY_RCU, which means objects can be
* accessed after being freed. We preassign tags for objects in these
* caches as well.
- * 3. For SLAB allocator we can't preassign tags randomly since the freelist
- * is stored as an array of indexes instead of a linked list. Assign tags
- * based on objects indexes, so that objects that are next to each other
- * get different tags.
*/
static inline u8 assign_tag(struct kmem_cache *cache,
const void *object, bool init)
if (!cache->ctor && !(cache->flags & SLAB_TYPESAFE_BY_RCU))
return init ? KASAN_TAG_KERNEL : kasan_random_tag();
- /* For caches that either have a constructor or SLAB_TYPESAFE_BY_RCU: */
-#ifdef CONFIG_SLAB
- /* For SLAB assign tags based on the object index in the freelist. */
- return (u8)obj_to_index(cache, virt_to_slab(object), (void *)object);
-#else
/*
- * For SLUB assign a random tag during slab creation, otherwise reuse
+ * For caches that either have a constructor or SLAB_TYPESAFE_BY_RCU,
+ * assign a random tag during slab creation, otherwise reuse
* the already assigned tag.
*/
return init ? kasan_random_tag() : get_tag(object);
-#endif
}
void * __must_check __kasan_init_slab_obj(struct kmem_cache *cache,
return (void *)object;
}
- static inline bool ____kasan_slab_free(struct kmem_cache *cache, void *object,
- unsigned long ip, bool quarantine, bool init)
+ static inline bool poison_slab_object(struct kmem_cache *cache, void *object,
+ unsigned long ip, bool init)
{
void *tagged_object;
tagged_object = object;
object = kasan_reset_tag(object);
- if (is_kfence_address(object))
- return false;
-
- if (unlikely(nearest_obj(cache, virt_to_slab(object), object) !=
- object)) {
+ if (unlikely(nearest_obj(cache, virt_to_slab(object), object) != object)) {
kasan_report_invalid_free(tagged_object, ip, KASAN_REPORT_INVALID_FREE);
return true;
}
- /* RCU slabs could be legally used after free within the RCU period */
+ /* RCU slabs could be legally used after free within the RCU period. */
if (unlikely(cache->flags & SLAB_TYPESAFE_BY_RCU))
return false;
kasan_poison(object, round_up(cache->object_size, KASAN_GRANULE_SIZE),
KASAN_SLAB_FREE, init);
- if ((IS_ENABLED(CONFIG_KASAN_GENERIC) && !quarantine))
- return false;
-
if (kasan_stack_collection_enabled())
kasan_save_free_info(cache, tagged_object);
- return kasan_quarantine_put(cache, object);
+ return false;
}
bool __kasan_slab_free(struct kmem_cache *cache, void *object,
unsigned long ip, bool init)
{
- return ____kasan_slab_free(cache, object, ip, true, init);
+ if (is_kfence_address(object))
+ return false;
+
+ /*
+ * If the object is buggy, do not let slab put the object onto the
+ * freelist. The object will thus never be allocated again and its
+ * metadata will never get released.
+ */
+ if (poison_slab_object(cache, object, ip, init))
+ return true;
+
+ /*
+ * If the object is put into quarantine, do not let slab put the object
+ * onto the freelist for now. The object's metadata is kept until the
+ * object gets evicted from quarantine.
+ */
+ if (kasan_quarantine_put(cache, object))
+ return true;
+
+ /*
+ * If the object is not put into quarantine, it will likely be quickly
+ * reallocated. Thus, release its metadata now.
+ */
+ kasan_release_object_meta(cache, object);
+
+ /* Let slab put the object onto the freelist. */
+ return false;
}
- static inline bool ____kasan_kfree_large(void *ptr, unsigned long ip)
+ static inline bool check_page_allocation(void *ptr, unsigned long ip)
{
if (!kasan_arch_is_ready())
return false;
return true;
}
- /*
- * The object will be poisoned by kasan_poison_pages() or
- * kasan_slab_free_mempool().
- */
-
return false;
}
void __kasan_kfree_large(void *ptr, unsigned long ip)
{
- ____kasan_kfree_large(ptr, ip);
+ check_page_allocation(ptr, ip);
+
+ /* The object will be poisoned by kasan_poison_pages(). */
}
- void __kasan_slab_free_mempool(void *ptr, unsigned long ip)
+ static inline void unpoison_slab_object(struct kmem_cache *cache, void *object,
+ gfp_t flags, bool init)
{
- struct folio *folio;
-
- folio = virt_to_folio(ptr);
-
/*
- * Even though this function is only called for kmem_cache_alloc and
- * kmalloc backed mempool allocations, those allocations can still be
- * !PageSlab() when the size provided to kmalloc is larger than
- * KMALLOC_MAX_SIZE, and kmalloc falls back onto page_alloc.
+ * Unpoison the whole object. For kmalloc() allocations,
+ * poison_kmalloc_redzone() will do precise poisoning.
*/
- if (unlikely(!folio_test_slab(folio))) {
- if (____kasan_kfree_large(ptr, ip))
- return;
- kasan_poison(ptr, folio_size(folio), KASAN_PAGE_FREE, false);
- } else {
- struct slab *slab = folio_slab(folio);
+ kasan_unpoison(object, cache->object_size, init);
- ____kasan_slab_free(slab->slab_cache, ptr, ip, false, false);
- }
+ /* Save alloc info (if possible) for non-kmalloc() allocations. */
+ if (kasan_stack_collection_enabled() && !is_kmalloc_cache(cache))
+ kasan_save_alloc_info(cache, object, flags);
}
void * __must_check __kasan_slab_alloc(struct kmem_cache *cache,
tag = assign_tag(cache, object, false);
tagged_object = set_tag(object, tag);
- /*
- * Unpoison the whole object.
- * For kmalloc() allocations, kasan_kmalloc() will do precise poisoning.
- */
- kasan_unpoison(tagged_object, cache->object_size, init);
-
- /* Save alloc info (if possible) for non-kmalloc() allocations. */
- if (kasan_stack_collection_enabled() && !is_kmalloc_cache(cache))
- kasan_save_alloc_info(cache, tagged_object, flags);
+ /* Unpoison the object and save alloc info for non-kmalloc() allocations. */
+ unpoison_slab_object(cache, tagged_object, flags, init);
return tagged_object;
}
- static inline void *____kasan_kmalloc(struct kmem_cache *cache,
+ static inline void poison_kmalloc_redzone(struct kmem_cache *cache,
const void *object, size_t size, gfp_t flags)
{
unsigned long redzone_start;
unsigned long redzone_end;
- if (gfpflags_allow_blocking(flags))
- kasan_quarantine_reduce();
-
- if (unlikely(object == NULL))
- return NULL;
-
- if (is_kfence_address(kasan_reset_tag(object)))
- return (void *)object;
-
- /*
- * The object has already been unpoisoned by kasan_slab_alloc() for
- * kmalloc() or by kasan_krealloc() for krealloc().
- */
-
/*
* The redzone has byte-level precision for the generic mode.
* Partially poison the last object granule to cover the unaligned
if (kasan_stack_collection_enabled() && is_kmalloc_cache(cache))
kasan_save_alloc_info(cache, (void *)object, flags);
- /* Keep the tag that was set by kasan_slab_alloc(). */
- return (void *)object;
}
void * __must_check __kasan_kmalloc(struct kmem_cache *cache, const void *object,
size_t size, gfp_t flags)
{
- return ____kasan_kmalloc(cache, object, size, flags);
+ if (gfpflags_allow_blocking(flags))
+ kasan_quarantine_reduce();
+
+ if (unlikely(object == NULL))
+ return NULL;
+
+ if (is_kfence_address(object))
+ return (void *)object;
+
+ /* The object has already been unpoisoned by kasan_slab_alloc(). */
+ poison_kmalloc_redzone(cache, object, size, flags);
+
+ /* Keep the tag that was set by kasan_slab_alloc(). */
+ return (void *)object;
}
EXPORT_SYMBOL(__kasan_kmalloc);
- void * __must_check __kasan_kmalloc_large(const void *ptr, size_t size,
+ static inline void poison_kmalloc_large_redzone(const void *ptr, size_t size,
gfp_t flags)
{
unsigned long redzone_start;
unsigned long redzone_end;
- if (gfpflags_allow_blocking(flags))
- kasan_quarantine_reduce();
-
- if (unlikely(ptr == NULL))
- return NULL;
-
- /*
- * The object has already been unpoisoned by kasan_unpoison_pages() for
- * alloc_pages() or by kasan_krealloc() for krealloc().
- */
-
/*
* The redzone has byte-level precision for the generic mode.
* Partially poison the last object granule to cover the unaligned
kasan_poison_last_granule(ptr, size);
/* Poison the aligned part of the redzone. */
- redzone_start = round_up((unsigned long)(ptr + size),
- KASAN_GRANULE_SIZE);
+ redzone_start = round_up((unsigned long)(ptr + size), KASAN_GRANULE_SIZE);
redzone_end = (unsigned long)ptr + page_size(virt_to_page(ptr));
kasan_poison((void *)redzone_start, redzone_end - redzone_start,
KASAN_PAGE_REDZONE, false);
+ }
+
+ void * __must_check __kasan_kmalloc_large(const void *ptr, size_t size,
+ gfp_t flags)
+ {
+ if (gfpflags_allow_blocking(flags))
+ kasan_quarantine_reduce();
+
+ if (unlikely(ptr == NULL))
+ return NULL;
+ /* The object has already been unpoisoned by kasan_unpoison_pages(). */
+ poison_kmalloc_large_redzone(ptr, size, flags);
+
+ /* Keep the tag that was set by alloc_pages(). */
return (void *)ptr;
}
{
struct slab *slab;
+ if (gfpflags_allow_blocking(flags))
+ kasan_quarantine_reduce();
+
if (unlikely(object == ZERO_SIZE_PTR))
return (void *)object;
+ if (is_kfence_address(object))
+ return (void *)object;
+
/*
* Unpoison the object's data.
* Part of it might already have been unpoisoned, but it's unknown
/* Piggy-back on kmalloc() instrumentation to poison the redzone. */
if (unlikely(!slab))
- return __kasan_kmalloc_large(object, size, flags);
+ poison_kmalloc_large_redzone(object, size, flags);
else
- return ____kasan_kmalloc(slab->slab_cache, object, size, flags);
+ poison_kmalloc_redzone(slab->slab_cache, object, size, flags);
+
+ return (void *)object;
+ }
+
+ bool __kasan_mempool_poison_pages(struct page *page, unsigned int order,
+ unsigned long ip)
+ {
+ unsigned long *ptr;
+
+ if (unlikely(PageHighMem(page)))
+ return true;
+
+ /* Bail out if allocation was excluded due to sampling. */
+ if (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
+ page_kasan_tag(page) == KASAN_TAG_KERNEL)
+ return true;
+
+ ptr = page_address(page);
+
+ if (check_page_allocation(ptr, ip))
+ return false;
+
+ kasan_poison(ptr, PAGE_SIZE << order, KASAN_PAGE_FREE, false);
+
+ return true;
+ }
+
+ void __kasan_mempool_unpoison_pages(struct page *page, unsigned int order,
+ unsigned long ip)
+ {
+ __kasan_unpoison_pages(page, order, false);
+ }
+
+ bool __kasan_mempool_poison_object(void *ptr, unsigned long ip)
+ {
+ struct folio *folio = virt_to_folio(ptr);
+ struct slab *slab;
+
+ /*
+ * This function can be called for large kmalloc allocation that get
+ * their memory from page_alloc. Thus, the folio might not be a slab.
+ */
+ if (unlikely(!folio_test_slab(folio))) {
+ if (check_page_allocation(ptr, ip))
+ return false;
+ kasan_poison(ptr, folio_size(folio), KASAN_PAGE_FREE, false);
+ return true;
+ }
+
+ if (is_kfence_address(ptr))
+ return false;
+
+ slab = folio_slab(folio);
+ return !poison_slab_object(slab->slab_cache, ptr, ip, false);
+ }
+
+ void __kasan_mempool_unpoison_object(void *ptr, size_t size, unsigned long ip)
+ {
+ struct slab *slab;
+ gfp_t flags = 0; /* Might be executing under a lock. */
+
+ slab = virt_to_slab(ptr);
+
+ /*
+ * This function can be called for large kmalloc allocation that get
+ * their memory from page_alloc.
+ */
+ if (unlikely(!slab)) {
+ kasan_unpoison(ptr, size, false);
+ poison_kmalloc_large_redzone(ptr, size, flags);
+ return;
+ }
+
+ if (is_kfence_address(ptr))
+ return;
+
+ /* Unpoison the object and save alloc info for non-kmalloc() allocations. */
+ unpoison_slab_object(slab->slab_cache, ptr, size, flags);
+
+ /* Poison the redzone and save alloc info for kmalloc() allocations. */
+ if (is_kmalloc_cache(slab->slab_cache))
+ poison_kmalloc_redzone(slab->slab_cache, ptr, size, flags);
}
bool __kasan_check_byte(const void *address, unsigned long ip)
#include <linux/kasan.h>
#include <linux/kasan-tags.h>
#include <linux/kfence.h>
+ #include <linux/spinlock.h>
#include <linux/stackdepot.h>
#if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS)
static inline bool kasan_vmalloc_enabled(void)
{
+ /* Static branch is never enabled with CONFIG_KASAN_VMALLOC disabled. */
return static_branch_likely(&kasan_flag_vmalloc);
}
#else /* CONFIG_KASAN_HW_TAGS */
+ static inline bool kasan_vmalloc_enabled(void)
+ {
+ return IS_ENABLED(CONFIG_KASAN_VMALLOC);
+ }
+
static inline bool kasan_async_fault_possible(void)
{
return false;
#ifdef CONFIG_KASAN_GENERIC
- /* Generic KASAN uses per-object metadata to store stack traces. */
+ /*
+ * Generic KASAN uses per-object metadata to store alloc and free stack traces
+ * and the quarantine link.
+ */
static inline bool kasan_requires_meta(void)
{
- /*
- * Technically, Generic KASAN always collects stack traces right now.
- * However, let's use kasan_stack_collection_enabled() in case the
- * kasan.stacktrace command-line argument is changed to affect
- * Generic KASAN.
- */
- return kasan_stack_collection_enabled();
+ return true;
}
#else /* CONFIG_KASAN_GENERIC */
- /* Tag-based KASAN modes do not use per-object metadata. */
+ /*
+ * Tag-based KASAN modes do not use per-object metadata: they use the stack
+ * ring to store alloc and free stack traces and do not use qurantine.
+ */
static inline bool kasan_requires_meta(void)
{
return false;
#ifdef CONFIG_KASAN_GENERIC
- #define KASAN_SLAB_FREETRACK 0xFA /* freed slab object with free track */
+ #define KASAN_SLAB_FREE_META 0xFA /* freed slab object with free meta */
#define KASAN_GLOBAL_REDZONE 0xF9 /* redzone for global variable */
/* Stack redzone shadow values. Compiler ABI, do not change. */
struct kasan_track {
u32 pid;
depot_stack_handle_t stack;
+ #ifdef CONFIG_KASAN_EXTRA_INFO
+ u64 cpu:20;
+ u64 timestamp:44;
+ #endif /* CONFIG_KASAN_EXTRA_INFO */
};
enum kasan_report_type {
#ifdef CONFIG_KASAN_GENERIC
+ /*
+ * Alloc meta contains the allocation-related information about a slab object.
+ * Alloc meta is saved when an object is allocated and is kept until either the
+ * object returns to the slab freelist (leaves quarantine for quarantined
+ * objects or gets freed for the non-quarantined ones) or reallocated via
+ * krealloc or through a mempool.
+ * Alloc meta is stored inside of the object's redzone.
+ * Alloc meta is considered valid whenever it contains non-zero data.
+ */
struct kasan_alloc_meta {
struct kasan_track alloc_track;
/* Free track is stored in kasan_free_meta. */
+ /*
+ * aux_lock protects aux_stack from accesses from concurrent
+ * kasan_record_aux_stack calls. It is a raw spinlock to avoid sleeping
+ * on RT kernels, as kasan_record_aux_stack_noalloc can be called from
+ * non-sleepable contexts.
+ */
+ raw_spinlock_t aux_lock;
depot_stack_handle_t aux_stack[2];
};
#define KASAN_NO_FREE_META INT_MAX
/*
- * Free meta is only used by Generic mode while the object is in quarantine.
- * After that, slab allocator stores the freelist pointer in the object.
+ * Free meta contains the freeing-related information about a slab object.
+ * Free meta is only kept for quarantined objects and for mempool objects until
+ * the object gets allocated again.
+ * Free meta is stored within the object's memory.
+ * Free meta is considered valid whenever the value of the shadow byte that
+ * corresponds to the first 8 bytes of the object is KASAN_SLAB_FREE_META.
*/
struct kasan_free_meta {
struct qlist_node quarantine_link;
struct kasan_stack_ring_entry {
void *ptr;
size_t size;
- u32 pid;
- depot_stack_handle_t stack;
+ struct kasan_track track;
bool is_free;
};
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
+ static __always_inline bool addr_in_shadow(const void *addr)
+ {
+ return addr >= (void *)KASAN_SHADOW_START &&
+ addr < (void *)KASAN_SHADOW_END;
+ }
+
#ifndef kasan_shadow_to_mem
static inline const void *kasan_shadow_to_mem(const void *shadow_addr)
{
struct slab *kasan_addr_to_slab(const void *addr);
#ifdef CONFIG_KASAN_GENERIC
- void kasan_init_cache_meta(struct kmem_cache *cache, unsigned int *size);
- void kasan_init_object_meta(struct kmem_cache *cache, const void *object);
struct kasan_alloc_meta *kasan_get_alloc_meta(struct kmem_cache *cache,
const void *object);
struct kasan_free_meta *kasan_get_free_meta(struct kmem_cache *cache,
const void *object);
+ void kasan_init_object_meta(struct kmem_cache *cache, const void *object);
+ void kasan_release_object_meta(struct kmem_cache *cache, const void *object);
#else
- static inline void kasan_init_cache_meta(struct kmem_cache *cache, unsigned int *size) { }
static inline void kasan_init_object_meta(struct kmem_cache *cache, const void *object) { }
+ static inline void kasan_release_object_meta(struct kmem_cache *cache, const void *object) { }
#endif
- depot_stack_handle_t kasan_save_stack(gfp_t flags, bool can_alloc);
- void kasan_set_track(struct kasan_track *track, gfp_t flags);
+ depot_stack_handle_t kasan_save_stack(gfp_t flags, depot_flags_t depot_flags);
+ void kasan_set_track(struct kasan_track *track, depot_stack_handle_t stack);
+ void kasan_save_track(struct kasan_track *track, gfp_t flags);
void kasan_save_alloc_info(struct kmem_cache *cache, void *object, gfp_t flags);
void kasan_save_free_info(struct kmem_cache *cache, void *object);
-#if defined(CONFIG_KASAN_GENERIC) && \
- (defined(CONFIG_SLAB) || defined(CONFIG_SLUB))
+#ifdef CONFIG_KASAN_GENERIC
bool kasan_quarantine_put(struct kmem_cache *cache, void *object);
void kasan_quarantine_reduce(void);
void kasan_quarantine_remove_cache(struct kmem_cache *cache);
static inline void kasan_poison(const void *addr, size_t size, u8 value, bool init)
{
- addr = kasan_reset_tag(addr);
-
- /* Skip KFENCE memory if called explicitly outside of sl*b. */
- if (is_kfence_address(addr))
- return;
-
if (WARN_ON((unsigned long)addr & KASAN_GRANULE_MASK))
return;
if (WARN_ON(size & KASAN_GRANULE_MASK))
return;
- hw_set_mem_tag_range((void *)addr, size, value, init);
+ hw_set_mem_tag_range(kasan_reset_tag(addr), size, value, init);
}
static inline void kasan_unpoison(const void *addr, size_t size, bool init)
{
u8 tag = get_tag(addr);
- addr = kasan_reset_tag(addr);
-
- /* Skip KFENCE memory if called explicitly outside of sl*b. */
- if (is_kfence_address(addr))
- return;
-
if (WARN_ON((unsigned long)addr & KASAN_GRANULE_MASK))
return;
size = round_up(size, KASAN_GRANULE_SIZE);
- hw_set_mem_tag_range((void *)addr, size, tag, init);
+ hw_set_mem_tag_range(kasan_reset_tag(addr), size, tag, init);
}
static inline bool kasan_byte_accessible(const void *addr)
* @size - range size, must be aligned to KASAN_GRANULE_SIZE
* @value - value that's written to metadata for the range
* @init - whether to initialize the memory range (only for hardware tag-based)
- *
- * The size gets aligned to KASAN_GRANULE_SIZE before marking the range.
*/
void kasan_poison(const void *addr, size_t size, u8 value, bool init);
static void qlink_free(struct qlist_node *qlink, struct kmem_cache *cache)
{
void *object = qlink_to_object(qlink, cache);
- struct kasan_free_meta *meta = kasan_get_free_meta(cache, object);
+ struct kasan_free_meta *free_meta = kasan_get_free_meta(cache, object);
- unsigned long flags;
+
+ kasan_release_object_meta(cache, object);
/*
* If init_on_free is enabled and KASAN's free metadata is stored in
*/
if (slab_want_init_on_free(cache) &&
cache->kasan_info.free_meta_offset == 0)
- memzero_explicit(meta, sizeof(*meta));
-
- /*
- * As the object now gets freed from the quarantine, assume that its
- * free track is no longer valid.
- */
- *(u8 *)kasan_mem_to_shadow(object) = KASAN_SLAB_FREE;
+ memzero_explicit(free_meta, sizeof(*free_meta));
- if (IS_ENABLED(CONFIG_SLAB))
- local_irq_save(flags);
-
___cache_free(cache, object, _THIS_IP_);
-
- if (IS_ENABLED(CONFIG_SLAB))
- local_irq_restore(flags);
}
static void qlist_free_all(struct qlist_head *q, struct kmem_cache *cache)
#include <linux/stacktrace.h>
#include <linux/string.h>
#include <linux/types.h>
+#include <linux/vmalloc.h>
#include <linux/kasan.h>
#include <linux/module.h>
#include <linux/sched/task_stack.h>
static void print_track(struct kasan_track *track, const char *prefix)
{
+ #ifdef CONFIG_KASAN_EXTRA_INFO
+ u64 ts_nsec = track->timestamp;
+ unsigned long rem_usec;
+
+ ts_nsec <<= 3;
+ rem_usec = do_div(ts_nsec, NSEC_PER_SEC) / 1000;
+
+ pr_err("%s by task %u on cpu %d at %lu.%06lus:\n",
+ prefix, track->pid, track->cpu,
+ (unsigned long)ts_nsec, rem_usec);
+ #else
pr_err("%s by task %u:\n", prefix, track->pid);
+ #endif /* CONFIG_KASAN_EXTRA_INFO */
if (track->stack)
stack_depot_print(track->stack);
else
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
/*
- * With CONFIG_KASAN_INLINE, accesses to bogus pointers (outside the high
- * canonical half of the address space) cause out-of-bounds shadow memory reads
- * before the actual access. For addresses in the low canonical half of the
- * address space, as well as most non-canonical addresses, that out-of-bounds
- * shadow memory access lands in the non-canonical part of the address space.
- * Help the user figure out what the original bogus pointer was.
+ * With compiler-based KASAN modes, accesses to bogus pointers (outside of the
+ * mapped kernel address space regions) cause faults when KASAN tries to check
+ * the shadow memory before the actual memory access. This results in cryptic
+ * GPF reports, which are hard for users to interpret. This hook helps users to
+ * figure out what the original bogus pointer was.
*/
void kasan_non_canonical_hook(unsigned long addr)
{
unsigned long orig_addr;
const char *bug_type;
+ /*
+ * All addresses that came as a result of the memory-to-shadow mapping
+ * (even for bogus pointers) must be >= KASAN_SHADOW_OFFSET.
+ */
if (addr < KASAN_SHADOW_OFFSET)
return;
- orig_addr = (addr - KASAN_SHADOW_OFFSET) << KASAN_SHADOW_SCALE_SHIFT;
+ orig_addr = (unsigned long)kasan_shadow_to_mem((void *)addr);
+
/*
* For faults near the shadow address for NULL, we can be fairly certain
* that this is a KASAN shadow memory access.
- * For faults that correspond to shadow for low canonical addresses, we
- * can still be pretty sure - that shadow region is a fairly narrow
- * chunk of the non-canonical address space.
- * But faults that look like shadow for non-canonical addresses are a
- * really large chunk of the address space. In that case, we still
- * print the decoded address, but make it clear that this is not
- * necessarily what's actually going on.
+ * For faults that correspond to the shadow for low or high canonical
+ * addresses, we can still be pretty sure: these shadow regions are a
+ * fairly narrow chunk of the address space.
+ * But the shadow for non-canonical addresses is a really large chunk
+ * of the address space. For this case, we still print the decoded
+ * address, but make it clear that this is not necessarily what's
+ * actually going on.
*/
if (orig_addr < PAGE_SIZE)
bug_type = "null-ptr-deref";
else if (orig_addr < TASK_SIZE)
bug_type = "probably user-memory-access";
+ else if (addr_in_shadow((void *)addr))
+ bug_type = "probably wild-memory-access";
else
bug_type = "maybe wild-memory-access";
pr_alert("KASAN: %s in range [0x%016lx-0x%016lx]\n", bug_type,
#include <linux/psi.h>
#include <linux/seq_buf.h>
#include <linux/sched/isolation.h>
+#include <linux/kmemleak.h>
#include "internal.h"
#include <net/sock.h>
#include <net/ip.h>
return mz;
}
- /*
- * memcg and lruvec stats flushing
- *
- * Many codepaths leading to stats update or read are performance sensitive and
- * adding stats flushing in such codepaths is not desirable. So, to optimize the
- * flushing the kernel does:
- *
- * 1) Periodically and asynchronously flush the stats every 2 seconds to not let
- * rstat update tree grow unbounded.
- *
- * 2) Flush the stats synchronously on reader side only when there are more than
- * (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization
- * will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but
- * only for 2 seconds due to (1).
- */
- static void flush_memcg_stats_dwork(struct work_struct *w);
- static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
- static DEFINE_PER_CPU(unsigned int, stats_updates);
- static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
- static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
- static u64 flush_next_time;
-
- #define FLUSH_TIME (2UL*HZ)
-
- /*
- * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can
- * not rely on this as part of an acquired spinlock_t lock. These functions are
- * never used in hardirq context on PREEMPT_RT and therefore disabling preemtion
- * is sufficient.
- */
- static void memcg_stats_lock(void)
- {
- preempt_disable_nested();
- VM_WARN_ON_IRQS_ENABLED();
- }
-
- static void __memcg_stats_lock(void)
- {
- preempt_disable_nested();
- }
-
- static void memcg_stats_unlock(void)
- {
- preempt_enable_nested();
- }
-
- static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
- {
- unsigned int x;
-
- if (!val)
- return;
-
- cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
-
- x = __this_cpu_add_return(stats_updates, abs(val));
- if (x > MEMCG_CHARGE_BATCH) {
- /*
- * If stats_flush_threshold exceeds the threshold
- * (>num_online_cpus()), cgroup stats update will be triggered
- * in __mem_cgroup_flush_stats(). Increasing this var further
- * is redundant and simply adds overhead in atomic update.
- */
- if (atomic_read(&stats_flush_threshold) <= num_online_cpus())
- atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold);
- __this_cpu_write(stats_updates, 0);
- }
- }
-
- static void do_flush_stats(void)
- {
- /*
- * We always flush the entire tree, so concurrent flushers can just
- * skip. This avoids a thundering herd problem on the rstat global lock
- * from memcg flushers (e.g. reclaim, refault, etc).
- */
- if (atomic_read(&stats_flush_ongoing) ||
- atomic_xchg(&stats_flush_ongoing, 1))
- return;
-
- WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
-
- cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
-
- atomic_set(&stats_flush_threshold, 0);
- atomic_set(&stats_flush_ongoing, 0);
- }
-
- void mem_cgroup_flush_stats(void)
- {
- if (atomic_read(&stats_flush_threshold) > num_online_cpus())
- do_flush_stats();
- }
-
- void mem_cgroup_flush_stats_ratelimited(void)
- {
- if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
- mem_cgroup_flush_stats();
- }
-
- static void flush_memcg_stats_dwork(struct work_struct *w)
- {
- /*
- * Always flush here so that flushing in latency-sensitive paths is
- * as cheap as possible.
- */
- do_flush_stats();
- queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
- }
-
/* Subset of vm_event_item to report for memcg event stats */
static const unsigned int memcg_vm_event_stat[] = {
PGPGIN,
#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
ZSWPIN,
ZSWPOUT,
+ ZSWPWB,
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
THP_FAULT_ALLOC,
/* Cgroup1: threshold notifications & softlimit tree updates */
unsigned long nr_page_events;
unsigned long targets[MEM_CGROUP_NTARGETS];
+
+ /* Stats updates since the last flush */
+ unsigned int stats_updates;
};
struct memcg_vmstats {
/* Pending child counts during tree propagation */
long state_pending[MEMCG_NR_STAT];
unsigned long events_pending[NR_MEMCG_EVENTS];
+
+ /* Stats updates since the last flush */
+ atomic64_t stats_updates;
};
+ /*
+ * memcg and lruvec stats flushing
+ *
+ * Many codepaths leading to stats update or read are performance sensitive and
+ * adding stats flushing in such codepaths is not desirable. So, to optimize the
+ * flushing the kernel does:
+ *
+ * 1) Periodically and asynchronously flush the stats every 2 seconds to not let
+ * rstat update tree grow unbounded.
+ *
+ * 2) Flush the stats synchronously on reader side only when there are more than
+ * (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization
+ * will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but
+ * only for 2 seconds due to (1).
+ */
+ static void flush_memcg_stats_dwork(struct work_struct *w);
+ static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
+ static u64 flush_last_time;
+
+ #define FLUSH_TIME (2UL*HZ)
+
+ /*
+ * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can
+ * not rely on this as part of an acquired spinlock_t lock. These functions are
+ * never used in hardirq context on PREEMPT_RT and therefore disabling preemtion
+ * is sufficient.
+ */
+ static void memcg_stats_lock(void)
+ {
+ preempt_disable_nested();
+ VM_WARN_ON_IRQS_ENABLED();
+ }
+
+ static void __memcg_stats_lock(void)
+ {
+ preempt_disable_nested();
+ }
+
+ static void memcg_stats_unlock(void)
+ {
+ preempt_enable_nested();
+ }
+
+
+ static bool memcg_should_flush_stats(struct mem_cgroup *memcg)
+ {
+ return atomic64_read(&memcg->vmstats->stats_updates) >
+ MEMCG_CHARGE_BATCH * num_online_cpus();
+ }
+
+ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
+ {
+ int cpu = smp_processor_id();
+ unsigned int x;
+
+ if (!val)
+ return;
+
+ cgroup_rstat_updated(memcg->css.cgroup, cpu);
+
+ for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+ x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates,
+ abs(val));
+
+ if (x < MEMCG_CHARGE_BATCH)
+ continue;
+
+ /*
+ * If @memcg is already flush-able, increasing stats_updates is
+ * redundant. Avoid the overhead of the atomic update.
+ */
+ if (!memcg_should_flush_stats(memcg))
+ atomic64_add(x, &memcg->vmstats->stats_updates);
+ __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0);
+ }
+ }
+
+ static void do_flush_stats(struct mem_cgroup *memcg)
+ {
+ if (mem_cgroup_is_root(memcg))
+ WRITE_ONCE(flush_last_time, jiffies_64);
+
+ cgroup_rstat_flush(memcg->css.cgroup);
+ }
+
+ /*
+ * mem_cgroup_flush_stats - flush the stats of a memory cgroup subtree
+ * @memcg: root of the subtree to flush
+ *
+ * Flushing is serialized by the underlying global rstat lock. There is also a
+ * minimum amount of work to be done even if there are no stat updates to flush.
+ * Hence, we only flush the stats if the updates delta exceeds a threshold. This
+ * avoids unnecessary work and contention on the underlying lock.
+ */
+ void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
+ {
+ if (mem_cgroup_disabled())
+ return;
+
+ if (!memcg)
+ memcg = root_mem_cgroup;
+
+ if (memcg_should_flush_stats(memcg))
+ do_flush_stats(memcg);
+ }
+
+ void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
+ {
+ /* Only flush if the periodic flusher is one full cycle late */
+ if (time_after64(jiffies_64, READ_ONCE(flush_last_time) + 2*FLUSH_TIME))
+ mem_cgroup_flush_stats(memcg);
+ }
+
+ static void flush_memcg_stats_dwork(struct work_struct *w)
+ {
+ /*
+ * Deliberately ignore memcg_should_flush_stats() here so that flushing
+ * in latency-sensitive paths is as cheap as possible.
+ */
+ do_flush_stats(root_mem_cgroup);
+ queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
+ }
+
unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
{
long x = READ_ONCE(memcg->vmstats->state[idx]);
__mod_memcg_lruvec_state(lruvec, idx, val);
}
- void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
+ void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx,
int val)
{
- struct page *head = compound_head(page); /* rmap on tail pages */
struct mem_cgroup *memcg;
- pg_data_t *pgdat = page_pgdat(page);
+ pg_data_t *pgdat = folio_pgdat(folio);
struct lruvec *lruvec;
rcu_read_lock();
- memcg = page_memcg(head);
+ memcg = folio_memcg(folio);
/* Untracked pages have no memcg, no lruvec. Update only the node */
if (!memcg) {
rcu_read_unlock();
__mod_lruvec_state(lruvec, idx, val);
rcu_read_unlock();
}
- EXPORT_SYMBOL(__mod_lruvec_page_state);
+ EXPORT_SYMBOL(__lruvec_stat_mod_folio);
void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val)
{
*
* Current memory state:
*/
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(memcg);
for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
u64 size;
int nid;
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(memcg);
for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
seq_printf(m, "%s=%lu", stat->name,
BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(memcg);
for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
unsigned long nr;
* only one element of the array here.
*/
for (; i >= 0 && unlikely(t->entries[i].threshold > usage); i--)
- eventfd_signal(t->entries[i].eventfd, 1);
+ eventfd_signal(t->entries[i].eventfd);
/* i = current_threshold + 1 */
i++;
* only one element of the array here.
*/
for (; i < t->size && unlikely(t->entries[i].threshold <= usage); i++)
- eventfd_signal(t->entries[i].eventfd, 1);
+ eventfd_signal(t->entries[i].eventfd);
/* Update current_threshold */
t->current_threshold = i - 1;
spin_lock(&memcg_oom_lock);
list_for_each_entry(ev, &memcg->oom_notify, list)
- eventfd_signal(ev->eventfd, 1);
+ eventfd_signal(ev->eventfd);
spin_unlock(&memcg_oom_lock);
return 0;
/* already in OOM ? */
if (memcg->under_oom)
- eventfd_signal(eventfd, 1);
+ eventfd_signal(eventfd);
spin_unlock(&memcg_oom_lock);
return 0;
struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
struct mem_cgroup *parent;
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(memcg);
*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
event->unregister_event(memcg, event->eventfd);
/* Notify userspace the event is going away. */
- eventfd_signal(event->eventfd, 1);
+ eventfd_signal(event->eventfd);
eventfd_ctx_put(event->eventfd);
kfree(event);
return ret;
}
-#if defined(CONFIG_MEMCG_KMEM) && (defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG))
+#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_SLUB_DEBUG)
static int mem_cgroup_slab_show(struct seq_file *m, void *p)
{
/*
.write = mem_cgroup_reset,
.read_u64 = mem_cgroup_read_u64,
},
-#if defined(CONFIG_MEMCG_KMEM) && \
- (defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG))
+#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_SLUB_DEBUG)
{
.name = "kmem.slabinfo",
.seq_show = mem_cgroup_slab_show,
WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
memcg->zswap_max = PAGE_COUNTER_MAX;
+ WRITE_ONCE(memcg->zswap_writeback,
+ !parent || READ_ONCE(parent->zswap_writeback));
#endif
page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
if (parent) {
page_counter_set_min(&memcg->memory, 0);
page_counter_set_low(&memcg->memory, 0);
+ zswap_memcg_offline_cleanup(memcg);
+
memcg_offline_kmem(memcg);
reparent_shrinker_deferred(memcg);
wb_memcg_offline(memcg);
}
}
}
+ statc->stats_updates = 0;
+ /* We are in a per-cpu loop here, only do the atomic write once */
+ if (atomic64_read(&memcg->vmstats->stats_updates))
+ atomic64_set(&memcg->vmstats->stats_updates, 0);
}
#ifdef CONFIG_MMU
return nbytes;
}
+ /*
+ * Note: don't forget to update the 'samples/cgroup/memcg_event_listener'
+ * if any new events become available.
+ */
static void __memory_events_show(struct seq_file *m, atomic_long_t *events)
{
seq_printf(m, "low %lu\n", atomic_long_read(&events[MEMCG_LOW]));
int i;
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(memcg);
for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
int nid;
/* Transfer the charge and the css ref */
commit_charge(new, memcg);
+ /*
+ * If the old folio is a large folio and is in the split queue, it needs
+ * to be removed from the split queue now, in case getting an incorrect
+ * split queue in destroy_large_folio() after the memcg of the old folio
+ * is cleared.
+ *
+ * In addition, the old folio is about to be freed after migration, so
+ * removing from the split queue a bit earlier seems reasonable.
+ */
+ if (folio_test_large(old) && folio_test_large_rmappable(old))
+ folio_undo_large_rmappable(old);
old->memcg_data = 0;
}
break;
}
- cgroup_rstat_flush(memcg->css.cgroup);
+ /*
+ * mem_cgroup_flush_stats() ignores small changes. Use
+ * do_flush_stats() directly to get accurate stats for charging.
+ */
+ do_flush_stats(memcg);
pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
if (pages < max)
continue;
rcu_read_unlock();
}
+ bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
+ {
+ /* if zswap is disabled, do not block pages going to the swapping device */
+ return !is_zswap_enabled() || !memcg || READ_ONCE(memcg->zswap_writeback);
+ }
+
static u64 zswap_current_read(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- cgroup_rstat_flush(css->cgroup);
- return memcg_page_state(mem_cgroup_from_css(css), MEMCG_ZSWAP_B);
+ struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+ mem_cgroup_flush_stats(memcg);
+ return memcg_page_state(memcg, MEMCG_ZSWAP_B);
}
static int zswap_max_show(struct seq_file *m, void *v)
return nbytes;
}
+ static int zswap_writeback_show(struct seq_file *m, void *v)
+ {
+ struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+ seq_printf(m, "%d\n", READ_ONCE(memcg->zswap_writeback));
+ return 0;
+ }
+
+ static ssize_t zswap_writeback_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+ {
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ int zswap_writeback;
+ ssize_t parse_ret = kstrtoint(strstrip(buf), 0, &zswap_writeback);
+
+ if (parse_ret)
+ return parse_ret;
+
+ if (zswap_writeback != 0 && zswap_writeback != 1)
+ return -EINVAL;
+
+ WRITE_ONCE(memcg->zswap_writeback, zswap_writeback);
+ return nbytes;
+ }
+
static struct cftype zswap_files[] = {
{
.name = "zswap.current",
.seq_show = zswap_max_show,
.write = zswap_max_write,
},
+ {
+ .name = "zswap.writeback",
+ .seq_show = zswap_writeback_show,
+ .write = zswap_writeback_write,
+ },
{ } /* terminate */
};
#endif /* CONFIG_MEMCG_KMEM && CONFIG_ZSWAP */
/*
* A number of key systems in x86 including ioremap() rely on the assumption
* that high_memory defines the upper bound on direct map memory, then end
- * of ZONE_NORMAL. Under CONFIG_DISCONTIG this means that max_low_pfn and
- * highstart_pfn must be the same; there must be no gap between ZONE_NORMAL
- * and ZONE_HIGHMEM.
+ * of ZONE_NORMAL.
*/
void *high_memory;
EXPORT_SYMBOL(high_memory);
* be 0. This will underflow and is okay.
*/
next = mas_find(mas, ceiling - 1);
+ if (unlikely(xa_is_zero(next)))
+ next = NULL;
/*
* Hide vma from rmap and truncate_pagecache before freeing
&& !is_vm_hugetlb_page(next)) {
vma = next;
next = mas_find(mas, ceiling - 1);
+ if (unlikely(xa_is_zero(next)))
+ next = NULL;
if (mm_wr_locked)
vma_start_write(vma);
unlink_anon_vmas(vma);
struct page *page, unsigned long address,
pte_t *ptep)
{
+ struct folio *folio = page_folio(page);
pte_t orig_pte;
pte_t pte;
swp_entry_t entry;
else if (is_writable_device_exclusive_entry(entry))
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
- VM_BUG_ON(pte_write(pte) && !(PageAnon(page) && PageAnonExclusive(page)));
+ VM_BUG_ON_FOLIO(pte_write(pte) && (!folio_test_anon(folio) &&
+ PageAnonExclusive(page)), folio);
/*
* No need to take a page reference as one was already
* created when the swap entry was made.
*/
- if (PageAnon(page))
- page_add_anon_rmap(page, vma, address, RMAP_NONE);
+ if (folio_test_anon(folio))
+ folio_add_anon_rmap_pte(folio, page, vma, address, RMAP_NONE);
else
/*
* Currently device exclusive access only supports anonymous
unsigned long vm_flags = dst_vma->vm_flags;
pte_t orig_pte = ptep_get(src_pte);
pte_t pte = orig_pte;
+ struct folio *folio;
struct page *page;
swp_entry_t entry = pte_to_swp_entry(orig_pte);
}
} else if (is_device_private_entry(entry)) {
page = pfn_swap_entry_to_page(entry);
+ folio = page_folio(page);
/*
* Update rss count even for unaddressable pages, as
* for unaddressable pages, at some point. But for now
* keep things as they are.
*/
- get_page(page);
+ folio_get(folio);
rss[mm_counter(page)]++;
/* Cannot fail as these pages cannot get pinned. */
- BUG_ON(page_try_dup_anon_rmap(page, false, src_vma));
+ folio_try_dup_anon_rmap_pte(folio, page, src_vma);
/*
* We do not preserve soft-dirty information, because so
* future.
*/
folio_get(folio);
- if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
+ if (unlikely(folio_try_dup_anon_rmap_pte(folio, page, src_vma))) {
/* Page may be pinned, we have to copy. */
folio_put(folio);
return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
rss[MM_ANONPAGES]++;
} else if (page) {
folio_get(folio);
- page_dup_file_rmap(page, false);
+ folio_dup_file_rmap_pte(folio, page);
rss[mm_counter_file(page)]++;
}
return 0;
}
- static inline struct folio *page_copy_prealloc(struct mm_struct *src_mm,
- struct vm_area_struct *vma, unsigned long addr)
+ static inline struct folio *folio_prealloc(struct mm_struct *src_mm,
+ struct vm_area_struct *vma, unsigned long addr, bool need_zero)
{
struct folio *new_folio;
- new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr, false);
+ if (need_zero)
+ new_folio = vma_alloc_zeroed_movable_folio(vma, addr);
+ else
+ new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma,
+ addr, false);
+
if (!new_folio)
return NULL;
} else if (ret == -EBUSY) {
goto out;
} else if (ret == -EAGAIN) {
- prealloc = page_copy_prealloc(src_mm, src_vma, addr);
+ prealloc = folio_prealloc(src_mm, src_vma, addr, false);
if (!prealloc)
return -ENOMEM;
} else if (ret) {
arch_enter_lazy_mmu_mode();
do {
pte_t ptent = ptep_get(pte);
+ struct folio *folio;
struct page *page;
if (pte_none(ptent))
continue;
}
+ folio = page_folio(page);
delay_rmap = 0;
- if (!PageAnon(page)) {
+ if (!folio_test_anon(folio)) {
if (pte_dirty(ptent)) {
- set_page_dirty(page);
+ folio_set_dirty(folio);
if (tlb_delay_rmap(tlb)) {
delay_rmap = 1;
force_flush = 1;
}
}
if (pte_young(ptent) && likely(vma_has_recency(vma)))
- mark_page_accessed(page);
+ folio_mark_accessed(folio);
}
rss[mm_counter(page)]--;
if (!delay_rmap) {
- page_remove_rmap(page, vma, false);
+ folio_remove_rmap_pte(folio, page, vma);
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
}
if (is_device_private_entry(entry) ||
is_device_exclusive_entry(entry)) {
page = pfn_swap_entry_to_page(entry);
+ folio = page_folio(page);
if (unlikely(!should_zap_page(details, page)))
continue;
/*
WARN_ON_ONCE(!vma_is_anonymous(vma));
rss[mm_counter(page)]--;
if (is_device_private_entry(entry))
- page_remove_rmap(page, vma, false);
- put_page(page);
+ folio_remove_rmap_pte(folio, page, vma);
+ folio_put(folio);
} else if (!non_swap_entry(entry)) {
/* Genuine swap entry, hence a private anon page */
if (!should_zap_cows(details))
unmap_single_vma(tlb, vma, start, end, &details,
mm_wr_locked);
hugetlb_zap_end(vma, &details);
- } while ((vma = mas_find(mas, tree_end - 1)) != NULL);
+ vma = mas_find(mas, tree_end - 1);
+ } while (vma && likely(!xa_is_zero(vma)));
mmu_notifier_invalidate_range_end(&range);
}
static int validate_page_before_insert(struct page *page)
{
- if (PageAnon(page) || PageSlab(page) || page_has_type(page))
+ struct folio *folio = page_folio(page);
+
+ if (folio_test_anon(folio) || folio_test_slab(folio) ||
+ page_has_type(page))
return -EINVAL;
- flush_dcache_page(page);
+ flush_dcache_folio(folio);
return 0;
}
static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
unsigned long addr, struct page *page, pgprot_t prot)
{
+ struct folio *folio = page_folio(page);
+
if (!pte_none(ptep_get(pte)))
return -EBUSY;
/* Ok, finally just insert the thing.. */
- get_page(page);
+ folio_get(folio);
inc_mm_counter(vma->vm_mm, mm_counter_file(page));
- page_add_file_rmap(page, vma, false);
+ folio_add_file_rmap_pte(folio, page, vma);
set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot));
return 0;
}
* just copying from the original user address. If that
* fails, we just zero-fill it. Live with it.
*/
- kaddr = kmap_atomic(dst);
+ kaddr = kmap_local_page(dst);
+ pagefault_disable();
uaddr = (void __user *)(addr & PAGE_MASK);
/*
pte_unlock:
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
- kunmap_atomic(kaddr);
+ pagefault_enable();
+ kunmap_local(kaddr);
flush_dcache_page(dst);
return ret;
int page_copied = 0;
struct mmu_notifier_range range;
vm_fault_t ret;
+ bool pfn_is_zero;
delayacct_wpcopy_start();
if (unlikely(ret))
goto out;
- if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
- new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
- if (!new_folio)
- goto oom;
- } else {
+ pfn_is_zero = is_zero_pfn(pte_pfn(vmf->orig_pte));
+ new_folio = folio_prealloc(mm, vma, vmf->address, pfn_is_zero);
+ if (!new_folio)
+ goto oom;
+
+ if (!pfn_is_zero) {
int err;
- new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma,
- vmf->address, false);
- if (!new_folio)
- goto oom;
err = __wp_page_copy_user(&new_folio->page, vmf->page, vmf);
if (err) {
kmsan_copy_page_meta(&new_folio->page, vmf->page);
}
- if (mem_cgroup_charge(new_folio, mm, GFP_KERNEL))
- goto oom_free_new;
- folio_throttle_swaprate(new_folio, GFP_KERNEL);
-
__folio_mark_uptodate(new_folio);
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
* threads.
*
* The critical issue is to order this
- * page_remove_rmap with the ptp_clear_flush above.
- * Those stores are ordered by (if nothing else,)
+ * folio_remove_rmap_pte() with the ptp_clear_flush
+ * above. Those stores are ordered by (if nothing else,)
* the barrier present in the atomic_add_negative
- * in page_remove_rmap.
+ * in folio_remove_rmap_pte();
*
* Then the TLB flush in ptep_clear_flush ensures that
* no process can access the old page before the
* mapcount is visible. So transitively, TLBs to
* old page will be flushed before it can be reused.
*/
- page_remove_rmap(vmf->page, vma, false);
+ folio_remove_rmap_pte(old_folio, vmf->page, vma);
}
/* Free the old page.. */
delayacct_wpcopy_end();
return 0;
- oom_free_new:
- folio_put(new_folio);
oom:
ret = VM_FAULT_OOM;
out:
void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows)
{
- pgoff_t hba = holebegin >> PAGE_SHIFT;
- pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ pgoff_t hba = (pgoff_t)(holebegin) >> PAGE_SHIFT;
+ pgoff_t hlen = ((pgoff_t)(holelen) + PAGE_SIZE - 1) >> PAGE_SHIFT;
/* Check for overflow. */
if (sizeof(holelen) > sizeof(hlen)) {
folio_add_lru(folio);
- /* To provide entry to swap_readpage() */
+ /* To provide entry to swap_read_folio() */
folio->swap = entry;
- swap_readpage(page, true, NULL);
+ swap_read_folio(folio, true, NULL);
folio->private = NULL;
}
} else {
* page->index of !PageKSM() pages would be nonlinear inside the
* anon VMA -- PageKSM() is lost on actual swapout.
*/
- page = ksm_might_need_to_copy(page, vma, vmf->address);
- if (unlikely(!page)) {
+ folio = ksm_might_need_to_copy(folio, vma, vmf->address);
+ if (unlikely(!folio)) {
ret = VM_FAULT_OOM;
+ folio = swapcache;
goto out_page;
- } else if (unlikely(PTR_ERR(page) == -EHWPOISON)) {
+ } else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
ret = VM_FAULT_HWPOISON;
+ folio = swapcache;
goto out_page;
}
- folio = page_folio(page);
+ if (folio != swapcache)
+ page = folio_page(folio, 0);
/*
* If we want to map a page that's in the swapcache writable, we
/* ksm created a completely new copy */
if (unlikely(folio != swapcache && swapcache)) {
- page_add_new_anon_rmap(page, vma, vmf->address);
+ folio_add_new_anon_rmap(folio, vma, vmf->address);
folio_add_lru_vma(folio, vma);
} else {
- page_add_anon_rmap(page, vma, vmf->address, rmap_flags);
+ folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
+ rmap_flags);
}
VM_BUG_ON(!folio_test_anon(folio) ||
return ret;
}
+ static bool pte_range_none(pte_t *pte, int nr_pages)
+ {
+ int i;
+
+ for (i = 0; i < nr_pages; i++) {
+ if (!pte_none(ptep_get_lockless(pte + i)))
+ return false;
+ }
+
+ return true;
+ }
+
+ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+ {
+ #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned long orders;
+ struct folio *folio;
+ unsigned long addr;
+ pte_t *pte;
+ gfp_t gfp;
+ int order;
+
+ /*
+ * If uffd is active for the vma we need per-page fault fidelity to
+ * maintain the uffd semantics.
+ */
+ if (unlikely(userfaultfd_armed(vma)))
+ goto fallback;
+
+ /*
+ * Get a list of all the (large) orders below PMD_ORDER that are enabled
+ * for this vma. Then filter out the orders that can't be allocated over
+ * the faulting address and still be fully contained in the vma.
+ */
+ orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
+ BIT(PMD_ORDER) - 1);
+ orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+
+ if (!orders)
+ goto fallback;
+
+ pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+ if (!pte)
+ return ERR_PTR(-EAGAIN);
+
+ /*
+ * Find the highest order where the aligned range is completely
+ * pte_none(). Note that all remaining orders will be completely
+ * pte_none().
+ */
+ order = highest_order(orders);
+ while (orders) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+ if (pte_range_none(pte + pte_index(addr), 1 << order))
+ break;
+ order = next_order(&orders, order);
+ }
+
+ pte_unmap(pte);
+
+ /* Try allocating the highest of the remaining orders. */
+ gfp = vma_thp_gfp_mask(vma);
+ while (orders) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+ folio = vma_alloc_folio(gfp, order, vma, addr, true);
+ if (folio) {
+ clear_huge_page(&folio->page, vmf->address, 1 << order);
+ return folio;
+ }
+ order = next_order(&orders, order);
+ }
+
+ fallback:
+ #endif
+ return vma_alloc_zeroed_movable_folio(vmf->vma, vmf->address);
+ }
+
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
{
bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
struct vm_area_struct *vma = vmf->vma;
+ unsigned long addr = vmf->address;
struct folio *folio;
vm_fault_t ret = 0;
+ int nr_pages = 1;
pte_t entry;
+ int i;
/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
/* Allocate our own private page. */
if (unlikely(anon_vma_prepare(vma)))
goto oom;
- folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+ /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
+ folio = alloc_anon_folio(vmf);
+ if (IS_ERR(folio))
+ return 0;
if (!folio)
goto oom;
+ nr_pages = folio_nr_pages(folio);
+ addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+
if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
goto oom_free_page;
folio_throttle_swaprate(folio, GFP_KERNEL);
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry), vma);
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
if (!vmf->pte)
goto release;
- if (vmf_pte_changed(vmf)) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
+ if (nr_pages == 1 && vmf_pte_changed(vmf)) {
+ update_mmu_tlb(vma, addr, vmf->pte);
+ goto release;
+ } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
+ for (i = 0; i < nr_pages; i++)
+ update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
goto release;
}
return handle_userfault(vmf, VM_UFFD_MISSING);
}
- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
- folio_add_new_anon_rmap(folio, vma, vmf->address);
+ folio_ref_add(folio, nr_pages - 1);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+ folio_add_new_anon_rmap(folio, vma, addr);
folio_add_lru_vma(folio, vma);
setpte:
if (uffd_wp)
entry = pte_mkuffd_wp(entry);
- set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+ set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
/* No need to invalidate - it was non-present before */
- update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+ update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
unlock:
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
static vm_fault_t __do_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
+ struct folio *folio;
vm_fault_t ret;
/*
VM_FAULT_DONE_COW)))
return ret;
+ folio = page_folio(vmf->page);
if (unlikely(PageHWPoison(vmf->page))) {
- struct page *page = vmf->page;
vm_fault_t poisonret = VM_FAULT_HWPOISON;
if (ret & VM_FAULT_LOCKED) {
- if (page_mapped(page))
- unmap_mapping_pages(page_mapping(page),
- page->index, 1, false);
- /* Retry if a clean page was removed from the cache. */
- if (invalidate_inode_page(page))
+ if (page_mapped(vmf->page))
+ unmap_mapping_folio(folio);
+ /* Retry if a clean folio was removed from the cache. */
+ if (mapping_evict_folio(folio->mapping, folio))
poisonret = VM_FAULT_NOPAGE;
- unlock_page(page);
+ folio_unlock(folio);
}
- put_page(page);
+ folio_put(folio);
vmf->page = NULL;
return poisonret;
}
if (unlikely(!(ret & VM_FAULT_LOCKED)))
- lock_page(vmf->page);
+ folio_lock(folio);
else
- VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
+ VM_BUG_ON_PAGE(!folio_test_locked(folio), vmf->page);
return ret;
}
vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
{
+ struct folio *folio = page_folio(page);
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
pmd_t entry;
vm_fault_t ret = VM_FAULT_FALLBACK;
- if (!transhuge_vma_suitable(vma, haddr))
+ if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
return ret;
- page = compound_head(page);
- if (compound_order(page) != HPAGE_PMD_ORDER)
+ if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
return ret;
/*
* check. This kind of THP just can be PTE mapped. Access to
* the corrupted subpage should trigger SIGBUS as expected.
*/
- if (unlikely(PageHasHWPoisoned(page)))
+ if (unlikely(folio_test_has_hwpoisoned(folio)))
return ret;
/*
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
add_mm_counter(vma->vm_mm, mm_counter_file(page), HPAGE_PMD_NR);
- page_add_file_rmap(page, vma, true);
+ folio_add_file_rmap_pmd(folio, page, vma);
/*
* deposit and withdraw with pmd lock held
folio_add_lru_vma(folio, vma);
} else {
add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
- folio_add_file_rmap_range(folio, page, nr, vma, false);
+ folio_add_file_rmap_ptes(folio, page, nr, vma);
}
set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
static vm_fault_t do_cow_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
+ struct folio *folio;
vm_fault_t ret;
ret = vmf_can_call_fault(vmf);
if (ret)
return ret;
- vmf->cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
- if (!vmf->cow_page)
+ folio = folio_prealloc(vma->vm_mm, vma, vmf->address, false);
+ if (!folio)
return VM_FAULT_OOM;
- if (mem_cgroup_charge(page_folio(vmf->cow_page), vma->vm_mm,
- GFP_KERNEL)) {
- put_page(vmf->cow_page);
- return VM_FAULT_OOM;
- }
- folio_throttle_swaprate(page_folio(vmf->cow_page), GFP_KERNEL);
+ vmf->cow_page = &folio->page;
ret = __do_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma);
- __SetPageUptodate(vmf->cow_page);
+ __folio_mark_uptodate(folio);
ret |= finish_fault(vmf);
unlock_page(vmf->page);
goto uncharge_out;
return ret;
uncharge_out:
- put_page(vmf->cow_page);
+ folio_put(folio);
return ret;
}
return VM_FAULT_OOM;
retry_pud:
if (pud_none(*vmf.pud) &&
- hugepage_vma_check(vma, vm_flags, false, true, true)) {
+ thp_vma_allowable_order(vma, vm_flags, false, true, true, PUD_ORDER)) {
ret = create_huge_pud(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
goto retry_pud;
if (pmd_none(*vmf.pmd) &&
- hugepage_vma_check(vma, vm_flags, false, true, true)) {
+ thp_vma_allowable_order(vma, vm_flags, false, true, true, PMD_ORDER)) {
ret = create_huge_pmd(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
if (bytes > PAGE_SIZE-offset)
bytes = PAGE_SIZE-offset;
- maddr = kmap(page);
+ maddr = kmap_local_page(page);
if (write) {
copy_to_user_page(vma, page, addr,
maddr + offset, buf, bytes);
copy_from_user_page(vma, page, addr,
buf, maddr + offset, bytes);
}
- kunmap(page);
- put_page(page);
+ unmap_and_put_page(page, maddr);
}
len -= bytes;
buf += bytes;
#include <linux/writeback.h>
#include "slab.h"
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB_DEBUG_ON)
+#ifdef CONFIG_SLUB_DEBUG_ON
static void poison_error(mempool_t *pool, void *element, size_t size,
size_t byte)
{
static void check_element(mempool_t *pool, void *element)
{
+ /* Skip checking: KASAN might save its metadata in the element. */
+ if (kasan_enabled())
+ return;
+
/* Mempools backed by slab allocator */
if (pool->free == mempool_kfree) {
__check_element(pool, element, (size_t)pool->pool_data);
} else if (pool->free == mempool_free_pages) {
/* Mempools backed by page allocator */
int order = (int)(long)pool->pool_data;
- void *addr = kmap_atomic((struct page *)element);
+ void *addr = kmap_local_page((struct page *)element);
__check_element(pool, addr, 1UL << (PAGE_SHIFT + order));
- kunmap_atomic(addr);
+ kunmap_local(addr);
}
}
static void poison_element(mempool_t *pool, void *element)
{
+ /* Skip poisoning: KASAN might save its metadata in the element. */
+ if (kasan_enabled())
+ return;
+
/* Mempools backed by slab allocator */
if (pool->alloc == mempool_kmalloc) {
__poison_element(element, (size_t)pool->pool_data);
} else if (pool->alloc == mempool_alloc_pages) {
/* Mempools backed by page allocator */
int order = (int)(long)pool->pool_data;
- void *addr = kmap_atomic((struct page *)element);
+ void *addr = kmap_local_page((struct page *)element);
__poison_element(addr, 1UL << (PAGE_SHIFT + order));
- kunmap_atomic(addr);
+ kunmap_local(addr);
}
}
-#else /* CONFIG_DEBUG_SLAB || CONFIG_SLUB_DEBUG_ON */
+#else /* CONFIG_SLUB_DEBUG_ON */
static inline void check_element(mempool_t *pool, void *element)
{
}
static inline void poison_element(mempool_t *pool, void *element)
{
}
-#endif /* CONFIG_DEBUG_SLAB || CONFIG_SLUB_DEBUG_ON */
+#endif /* CONFIG_SLUB_DEBUG_ON */
- static __always_inline void kasan_poison_element(mempool_t *pool, void *element)
+ static __always_inline bool kasan_poison_element(mempool_t *pool, void *element)
{
if (pool->alloc == mempool_alloc_slab || pool->alloc == mempool_kmalloc)
- kasan_slab_free_mempool(element);
+ return kasan_mempool_poison_object(element);
else if (pool->alloc == mempool_alloc_pages)
- kasan_poison_pages(element, (unsigned long)pool->pool_data,
- false);
+ return kasan_mempool_poison_pages(element,
+ (unsigned long)pool->pool_data);
+ return true;
}
static void kasan_unpoison_element(mempool_t *pool, void *element)
{
if (pool->alloc == mempool_kmalloc)
- kasan_unpoison_range(element, (size_t)pool->pool_data);
+ kasan_mempool_unpoison_object(element, (size_t)pool->pool_data);
else if (pool->alloc == mempool_alloc_slab)
- kasan_unpoison_range(element, kmem_cache_size(pool->pool_data));
+ kasan_mempool_unpoison_object(element,
+ kmem_cache_size(pool->pool_data));
else if (pool->alloc == mempool_alloc_pages)
- kasan_unpoison_pages(element, (unsigned long)pool->pool_data,
- false);
+ kasan_mempool_unpoison_pages(element,
+ (unsigned long)pool->pool_data);
}
static __always_inline void add_element(mempool_t *pool, void *element)
{
BUG_ON(pool->curr_nr >= pool->min_nr);
poison_element(pool, element);
- kasan_poison_element(pool, element);
- pool->elements[pool->curr_nr++] = element;
+ if (kasan_poison_element(pool, element))
+ pool->elements[pool->curr_nr++] = element;
}
static void *remove_element(mempool_t *pool)
}
EXPORT_SYMBOL(mempool_alloc);
+ /**
+ * mempool_alloc_preallocated - allocate an element from preallocated elements
+ * belonging to a specific memory pool
+ * @pool: pointer to the memory pool which was allocated via
+ * mempool_create().
+ *
+ * This function is similar to mempool_alloc, but it only attempts allocating
+ * an element from the preallocated elements. It does not sleep and immediately
+ * returns if no preallocated elements are available.
+ *
+ * Return: pointer to the allocated element or %NULL if no elements are
+ * available.
+ */
+ void *mempool_alloc_preallocated(mempool_t *pool)
+ {
+ void *element;
+ unsigned long flags;
+
+ spin_lock_irqsave(&pool->lock, flags);
+ if (likely(pool->curr_nr)) {
+ element = remove_element(pool);
+ spin_unlock_irqrestore(&pool->lock, flags);
+ /* paired with rmb in mempool_free(), read comment there */
+ smp_wmb();
+ /*
+ * Update the allocation stack trace as this is more useful
+ * for debugging.
+ */
+ kmemleak_update_trace(element);
+ return element;
+ }
+ spin_unlock_irqrestore(&pool->lock, flags);
+
+ return NULL;
+ }
+ EXPORT_SYMBOL(mempool_alloc_preallocated);
+
/**
* mempool_free - return an element to the pool.
* @element: pool element pointer.
pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
if (folio_test_anon(folio))
- hugepage_add_anon_rmap(folio, vma, pvmw.address,
- rmap_flags);
+ hugetlb_add_anon_rmap(folio, vma, pvmw.address,
+ rmap_flags);
else
- page_dup_file_rmap(new, true);
+ hugetlb_add_file_rmap(folio);
set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte,
psize);
} else
#endif
{
if (folio_test_anon(folio))
- page_add_anon_rmap(new, vma, pvmw.address,
- rmap_flags);
+ folio_add_anon_rmap_pte(folio, new, vma,
+ pvmw.address, rmap_flags);
else
- page_add_file_rmap(new, vma, false);
+ folio_add_file_rmap_pte(folio, new, vma);
set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
}
if (vma->vm_flags & VM_LOCKED)
recheck_buffers:
busy = false;
- spin_lock(&mapping->private_lock);
+ spin_lock(&mapping->i_private_lock);
bh = head;
do {
if (atomic_read(&bh->b_count)) {
rc = -EAGAIN;
goto unlock_buffers;
}
- spin_unlock(&mapping->private_lock);
+ spin_unlock(&mapping->i_private_lock);
invalidate_bh_lrus();
invalidated = true;
goto recheck_buffers;
rc = MIGRATEPAGE_SUCCESS;
unlock_buffers:
if (check_refs)
- spin_unlock(&mapping->private_lock);
+ spin_unlock(&mapping->i_private_lock);
bh = head;
do {
unlock_buffer(bh);
}
/*
- * To record some information during migration, we use some unused
- * fields (mapping and private) of struct folio of the newly allocated
- * destination folio. This is safe because nobody is using them
- * except us.
+ * To record some information during migration, we use unused private
+ * field of struct folio of the newly allocated destination folio.
+ * This is safe because nobody is using it except us.
*/
- union migration_ptr {
- struct anon_vma *anon_vma;
- struct address_space *mapping;
- };
-
enum {
PAGE_WAS_MAPPED = BIT(0),
PAGE_WAS_MLOCKED = BIT(1),
+ PAGE_OLD_STATES = PAGE_WAS_MAPPED | PAGE_WAS_MLOCKED,
};
static void __migrate_folio_record(struct folio *dst,
- unsigned long old_page_state,
+ int old_page_state,
struct anon_vma *anon_vma)
{
- union migration_ptr ptr = { .anon_vma = anon_vma };
- dst->mapping = ptr.mapping;
- dst->private = (void *)old_page_state;
+ dst->private = (void *)anon_vma + old_page_state;
}
static void __migrate_folio_extract(struct folio *dst,
int *old_page_state,
struct anon_vma **anon_vmap)
{
- union migration_ptr ptr = { .mapping = dst->mapping };
- *anon_vmap = ptr.anon_vma;
- *old_page_state = (unsigned long)dst->private;
- dst->mapping = NULL;
+ unsigned long private = (unsigned long)dst->private;
+
+ *anon_vmap = (struct anon_vma *)(private & ~PAGE_OLD_STATES);
+ *old_page_state = private & PAGE_OLD_STATES;
dst->private = NULL;
}
*/
pgoff = 0;
get_area = shmem_get_unmapped_area;
+ } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+ /* Ensures that larger anonymous mappings are THP aligned. */
+ get_area = thp_get_unmapped_area;
}
addr = get_area(file, addr, len, pgoff, flags);
}
#endif
- /*
- * IA64 has some horrid mapping rules: it can expand both up and down,
- * but with various special rules.
- *
- * We'll get rid of this architecture eventually, so the ugliness is
- * temporary.
- */
- #ifdef CONFIG_IA64
- static inline bool vma_expand_ok(struct vm_area_struct *vma, unsigned long addr)
- {
- return REGION_NUMBER(addr) == REGION_NUMBER(vma->vm_start) &&
- REGION_OFFSET(addr) < RGN_MAP_LIMIT;
- }
-
- /*
- * IA64 stacks grow down, but there's a special register backing store
- * that can grow up. Only sequentially, though, so the new address must
- * match vm_end.
- */
- static inline int vma_expand_up(struct vm_area_struct *vma, unsigned long addr)
- {
- if (!vma_expand_ok(vma, addr))
- return -EFAULT;
- if (vma->vm_end != (addr & PAGE_MASK))
- return -EFAULT;
- return expand_upwards(vma, addr);
- }
-
- static inline bool vma_expand_down(struct vm_area_struct *vma, unsigned long addr)
- {
- if (!vma_expand_ok(vma, addr))
- return -EFAULT;
- return expand_downwards(vma, addr);
- }
-
- #elif defined(CONFIG_STACK_GROWSUP)
+ #if defined(CONFIG_STACK_GROWSUP)
#define vma_expand_up(vma,addr) expand_upwards(vma, addr)
#define vma_expand_down(vma, addr) (-EFAULT)
arch_exit_mmap(mm);
vma = mas_find(&mas, ULONG_MAX);
- if (!vma) {
+ if (!vma || unlikely(xa_is_zero(vma))) {
/* Can happen if dup_mmap() received an OOM */
mmap_read_unlock(mm);
- return;
+ mmap_write_lock(mm);
+ goto destroy;
}
lru_add_drain();
remove_vma(vma, true);
count++;
cond_resched();
- } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
+ vma = mas_find(&mas, ULONG_MAX);
+ } while (vma && likely(!xa_is_zero(vma)));
BUG_ON(count != mm->map_count);
trace_exit_mmap(mm);
+ destroy:
__mt_destroy(&mm->mm_mt);
mmap_write_unlock(mm);
vm_unacct_memory(nr_accounted);
if (min_ratio > 100 * BDI_RATIO_SCALE)
return -EINVAL;
- min_ratio *= BDI_RATIO_SCALE;
spin_lock_bh(&bdi_lock);
if (min_ratio > bdi->max_ratio) {
ret = -EINVAL;
} else {
bdi->max_ratio = max_ratio;
- bdi->max_prop_frac = (FPROP_FRAC_BASE * max_ratio) / 100;
+ bdi->max_prop_frac = (FPROP_FRAC_BASE * max_ratio) /
+ (100 * BDI_RATIO_SCALE);
}
spin_unlock_bh(&bdi_lock);
return ret;
}
- bool __folio_start_writeback(struct folio *folio, bool keep_write)
+ void __folio_start_writeback(struct folio *folio, bool keep_write)
{
long nr = folio_nr_pages(folio);
struct address_space *mapping = folio_mapping(folio);
- bool ret;
int access_ret;
+ VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio);
+
folio_memcg_lock(folio);
if (mapping && mapping_use_writeback_tags(mapping)) {
XA_STATE(xas, &mapping->i_pages, folio_index(folio));
struct inode *inode = mapping->host;
struct backing_dev_info *bdi = inode_to_bdi(inode);
unsigned long flags;
+ bool on_wblist;
xas_lock_irqsave(&xas, flags);
xas_load(&xas);
- ret = folio_test_set_writeback(folio);
- if (!ret) {
- bool on_wblist;
+ folio_test_set_writeback(folio);
- on_wblist = mapping_tagged(mapping,
- PAGECACHE_TAG_WRITEBACK);
+ on_wblist = mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
- xas_set_mark(&xas, PAGECACHE_TAG_WRITEBACK);
- if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) {
- struct bdi_writeback *wb = inode_to_wb(inode);
-
- wb_stat_mod(wb, WB_WRITEBACK, nr);
- if (!on_wblist)
- wb_inode_writeback_start(wb);
- }
+ xas_set_mark(&xas, PAGECACHE_TAG_WRITEBACK);
+ if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) {
+ struct bdi_writeback *wb = inode_to_wb(inode);
- /*
- * We can come through here when swapping
- * anonymous folios, so we don't necessarily
- * have an inode to track for sync.
- */
- if (mapping->host && !on_wblist)
- sb_mark_inode_writeback(mapping->host);
+ wb_stat_mod(wb, WB_WRITEBACK, nr);
+ if (!on_wblist)
+ wb_inode_writeback_start(wb);
}
+
+ /*
+ * We can come through here when swapping anonymous
+ * folios, so we don't necessarily have an inode to
+ * track for sync.
+ */
+ if (mapping->host && !on_wblist)
+ sb_mark_inode_writeback(mapping->host);
if (!folio_test_dirty(folio))
xas_clear_mark(&xas, PAGECACHE_TAG_DIRTY);
if (!keep_write)
xas_clear_mark(&xas, PAGECACHE_TAG_TOWRITE);
xas_unlock_irqrestore(&xas, flags);
} else {
- ret = folio_test_set_writeback(folio);
- }
- if (!ret) {
- lruvec_stat_mod_folio(folio, NR_WRITEBACK, nr);
- zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, nr);
+ folio_test_set_writeback(folio);
}
+
+ lruvec_stat_mod_folio(folio, NR_WRITEBACK, nr);
+ zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, nr);
folio_memcg_unlock(folio);
+
access_ret = arch_make_folio_accessible(folio);
/*
* If writeback has been triggered on a page that cannot be made
* accessible, it is too late to recover here.
*/
VM_BUG_ON_FOLIO(access_ret != 0, folio);
-
- return ret;
}
EXPORT_SYMBOL(__folio_start_writeback);
#include <linux/memory.h>
#include <linux/math64.h>
#include <linux/fault-inject.h>
+#include <linux/kmemleak.h>
#include <linux/stacktrace.h>
#include <linux/prefetch.h>
#include <linux/memcontrol.h>
*
* Frozen slabs
*
- * If a slab is frozen then it is exempt from list management. It is not
- * on any list except per cpu partial list. The processor that froze the
+ * If a slab is frozen then it is exempt from list management. It is
+ * the cpu slab which is actively allocated from by the processor that
+ * froze it and it is not on any list. The processor that froze the
* slab is the one who can perform list operations on the slab. Other
* processors may put objects onto the freelist but the processor that
* froze the slab is the only one that can retrieve the objects from the
* slab's freelist.
*
+ * CPU partial slabs
+ *
+ * The partially empty slabs cached on the CPU partial list are used
+ * for performance reasons, which speeds up the allocation process.
+ * These slabs are not frozen, but are also exempt from list management,
+ * by clearing the PG_workingset flag when moving out of the node
+ * partial list. Please see __slab_free() for more details.
+ *
+ * To sum up, the current scheme is:
+ * - node partial slab: PG_Workingset && !frozen
+ * - cpu partial slab: !PG_Workingset && !frozen
+ * - cpu slab: !PG_Workingset && frozen
+ * - full slab: !PG_Workingset && !frozen
+ *
* list_lock
*
* The list_lock protects the partial and full list on each node and
/* Structure holding parameters for get_partial() call chain */
struct partial_context {
- struct slab **slab;
gfp_t flags;
unsigned int orig_size;
+ void *object;
};
static inline bool kmem_cache_debug(struct kmem_cache *s)
static inline void debugfs_slab_add(struct kmem_cache *s) { }
#endif
+enum stat_item {
+ ALLOC_FASTPATH, /* Allocation from cpu slab */
+ ALLOC_SLOWPATH, /* Allocation by getting a new cpu slab */
+ FREE_FASTPATH, /* Free to cpu slab */
+ FREE_SLOWPATH, /* Freeing not to cpu slab */
+ FREE_FROZEN, /* Freeing to frozen slab */
+ FREE_ADD_PARTIAL, /* Freeing moves slab to partial list */
+ FREE_REMOVE_PARTIAL, /* Freeing removes last object */
+ ALLOC_FROM_PARTIAL, /* Cpu slab acquired from node partial list */
+ ALLOC_SLAB, /* Cpu slab acquired from page allocator */
+ ALLOC_REFILL, /* Refill cpu slab from slab freelist */
+ ALLOC_NODE_MISMATCH, /* Switching cpu slab */
+ FREE_SLAB, /* Slab freed to the page allocator */
+ CPUSLAB_FLUSH, /* Abandoning of the cpu slab */
+ DEACTIVATE_FULL, /* Cpu slab was full when deactivated */
+ DEACTIVATE_EMPTY, /* Cpu slab was empty when deactivated */
+ DEACTIVATE_TO_HEAD, /* Cpu slab was moved to the head of partials */
+ DEACTIVATE_TO_TAIL, /* Cpu slab was moved to the tail of partials */
+ DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
+ DEACTIVATE_BYPASS, /* Implicit deactivation */
+ ORDER_FALLBACK, /* Number of times fallback was necessary */
+ CMPXCHG_DOUBLE_CPU_FAIL,/* Failures of this_cpu_cmpxchg_double */
+ CMPXCHG_DOUBLE_FAIL, /* Failures of slab freelist update */
+ CPU_PARTIAL_ALLOC, /* Used cpu partial on alloc */
+ CPU_PARTIAL_FREE, /* Refill cpu partial on free */
+ CPU_PARTIAL_NODE, /* Refill cpu partial from node partial */
+ CPU_PARTIAL_DRAIN, /* Drain cpu partial to node partial */
+ NR_SLUB_STAT_ITEMS
+};
+
+#ifndef CONFIG_SLUB_TINY
+/*
+ * When changing the layout, make sure freelist and tid are still compatible
+ * with this_cpu_cmpxchg_double() alignment requirements.
+ */
+struct kmem_cache_cpu {
+ union {
+ struct {
+ void **freelist; /* Pointer to next available object */
+ unsigned long tid; /* Globally unique transaction id */
+ };
+ freelist_aba_t freelist_tid;
+ };
+ struct slab *slab; /* The slab from which we are allocating */
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+ struct slab *partial; /* Partially allocated frozen slabs */
+#endif
+ local_lock_t lock; /* Protects the fields above */
+#ifdef CONFIG_SLUB_STATS
+ unsigned int stat[NR_SLUB_STAT_ITEMS];
+#endif
+};
+#endif /* CONFIG_SLUB_TINY */
+
static inline void stat(const struct kmem_cache *s, enum stat_item si)
{
#ifdef CONFIG_SLUB_STATS
#endif
}
+static inline
+void stat_add(const struct kmem_cache *s, enum stat_item si, int v)
+{
+#ifdef CONFIG_SLUB_STATS
+ raw_cpu_add(s->cpu_slab->stat[si], v);
+#endif
+}
+
+/*
+ * The slab lists for all objects.
+ */
+struct kmem_cache_node {
+ spinlock_t list_lock;
+ unsigned long nr_partial;
+ struct list_head partial;
+#ifdef CONFIG_SLUB_DEBUG
+ atomic_long_t nr_slabs;
+ atomic_long_t total_objects;
+ struct list_head full;
+#endif
+};
+
+static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
+{
+ return s->node[node];
+}
+
+/*
+ * Iterator over all nodes. The body will be executed for each node that has
+ * a kmem_cache_node structure allocated (which is true for all online nodes)
+ */
+#define for_each_kmem_cache_node(__s, __node, __n) \
+ for (__node = 0; __node < nr_node_ids; __node++) \
+ if ((__n = get_node(__s, __node)))
+
/*
* Tracks for which NUMA nodes we have kmem_cache_nodes allocated.
* Corresponds to node_state[N_NORMAL_MEMORY], but can temporarily
struct page *page = slab_page(slab);
VM_BUG_ON_PAGE(PageTail(page), page);
- __bit_spin_unlock(PG_locked, &page->flags);
+ bit_spin_unlock(PG_locked, &page->flags);
}
static inline bool
void *object, unsigned int orig_size)
{
void *p = kasan_reset_tag(object);
+ unsigned int kasan_meta_size;
if (!slub_debug_orig_size(s))
return;
- #ifdef CONFIG_KASAN_GENERIC
/*
- * KASAN could save its free meta data in object's data area at
- * offset 0, if the size is larger than 'orig_size', it will
- * overlap the data redzone in [orig_size+1, object_size], and
- * the check should be skipped.
+ * KASAN can save its free meta data inside of the object at offset 0.
+ * If this meta data size is larger than 'orig_size', it will overlap
+ * the data redzone in [orig_size+1, object_size]. Thus, we adjust
+ * 'orig_size' to be as at least as big as KASAN's meta data.
*/
- if (kasan_metadata_size(s, true) > orig_size)
- orig_size = s->object_size;
- #endif
+ kasan_meta_size = kasan_metadata_size(s, true);
+ if (kasan_meta_size > orig_size)
+ orig_size = kasan_meta_size;
p += get_info_end(s);
p += sizeof(struct track) * 2;
{
u8 *p = object;
u8 *endobject = object + s->object_size;
- unsigned int orig_size;
+ unsigned int orig_size, kasan_meta_size;
if (s->flags & SLAB_RED_ZONE) {
if (!check_bytes_and_report(s, slab, object, "Left Redzone",
}
if (s->flags & SLAB_POISON) {
- if (val != SLUB_RED_ACTIVE && (s->flags & __OBJECT_POISON) &&
- (!check_bytes_and_report(s, slab, p, "Poison", p,
- POISON_FREE, s->object_size - 1) ||
- !check_bytes_and_report(s, slab, p, "End Poison",
- p + s->object_size - 1, POISON_END, 1)))
- return 0;
+ if (val != SLUB_RED_ACTIVE && (s->flags & __OBJECT_POISON)) {
+ /*
+ * KASAN can save its free meta data inside of the
+ * object at offset 0. Thus, skip checking the part of
+ * the redzone that overlaps with the meta data.
+ */
+ kasan_meta_size = kasan_metadata_size(s, true);
+ if (kasan_meta_size < s->object_size - 1 &&
+ !check_bytes_and_report(s, slab, p, "Poison",
+ p + kasan_meta_size, POISON_FREE,
+ s->object_size - kasan_meta_size - 1))
+ return 0;
+ if (kasan_meta_size < s->object_size &&
+ !check_bytes_and_report(s, slab, p, "End Poison",
+ p + s->object_size - 1, POISON_END, 1))
+ return 0;
+ }
/*
* check_pad_bytes cleans up on its own.
*/
#endif
#endif /* CONFIG_SLUB_DEBUG */
+static inline enum node_stat_item cache_vmstat_idx(struct kmem_cache *s)
+{
+ return (s->flags & SLAB_RECLAIM_ACCOUNT) ?
+ NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B;
+}
+
+#ifdef CONFIG_MEMCG_KMEM
+static inline void memcg_free_slab_cgroups(struct slab *slab)
+{
+ kfree(slab_objcgs(slab));
+ slab->memcg_data = 0;
+}
+
+static inline size_t obj_full_size(struct kmem_cache *s)
+{
+ /*
+ * For each accounted object there is an extra space which is used
+ * to store obj_cgroup membership. Charge it too.
+ */
+ return s->size + sizeof(struct obj_cgroup *);
+}
+
+/*
+ * Returns false if the allocation should fail.
+ */
+static bool __memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+ struct list_lru *lru,
+ struct obj_cgroup **objcgp,
+ size_t objects, gfp_t flags)
+{
+ /*
+ * The obtained objcg pointer is safe to use within the current scope,
+ * defined by current task or set_active_memcg() pair.
+ * obj_cgroup_get() is used to get a permanent reference.
+ */
+ struct obj_cgroup *objcg = current_obj_cgroup();
+ if (!objcg)
+ return true;
+
+ if (lru) {
+ int ret;
+ struct mem_cgroup *memcg;
+
+ memcg = get_mem_cgroup_from_objcg(objcg);
+ ret = memcg_list_lru_alloc(memcg, lru, flags);
+ css_put(&memcg->css);
+
+ if (ret)
+ return false;
+ }
+
+ if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s)))
+ return false;
+
+ *objcgp = objcg;
+ return true;
+}
+
+/*
+ * Returns false if the allocation should fail.
+ */
+static __fastpath_inline
+bool memcg_slab_pre_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
+ struct obj_cgroup **objcgp, size_t objects,
+ gfp_t flags)
+{
+ if (!memcg_kmem_online())
+ return true;
+
+ if (likely(!(flags & __GFP_ACCOUNT) && !(s->flags & SLAB_ACCOUNT)))
+ return true;
+
+ return likely(__memcg_slab_pre_alloc_hook(s, lru, objcgp, objects,
+ flags));
+}
+
+static void __memcg_slab_post_alloc_hook(struct kmem_cache *s,
+ struct obj_cgroup *objcg,
+ gfp_t flags, size_t size,
+ void **p)
+{
+ struct slab *slab;
+ unsigned long off;
+ size_t i;
+
+ flags &= gfp_allowed_mask;
+
+ for (i = 0; i < size; i++) {
+ if (likely(p[i])) {
+ slab = virt_to_slab(p[i]);
+
+ if (!slab_objcgs(slab) &&
+ memcg_alloc_slab_cgroups(slab, s, flags, false)) {
+ obj_cgroup_uncharge(objcg, obj_full_size(s));
+ continue;
+ }
+
+ off = obj_to_index(s, slab, p[i]);
+ obj_cgroup_get(objcg);
+ slab_objcgs(slab)[off] = objcg;
+ mod_objcg_state(objcg, slab_pgdat(slab),
+ cache_vmstat_idx(s), obj_full_size(s));
+ } else {
+ obj_cgroup_uncharge(objcg, obj_full_size(s));
+ }
+ }
+}
+
+static __fastpath_inline
+void memcg_slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg,
+ gfp_t flags, size_t size, void **p)
+{
+ if (likely(!memcg_kmem_online() || !objcg))
+ return;
+
+ return __memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
+}
+
+static void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
+ void **p, int objects,
+ struct obj_cgroup **objcgs)
+{
+ for (int i = 0; i < objects; i++) {
+ struct obj_cgroup *objcg;
+ unsigned int off;
+
+ off = obj_to_index(s, slab, p[i]);
+ objcg = objcgs[off];
+ if (!objcg)
+ continue;
+
+ objcgs[off] = NULL;
+ obj_cgroup_uncharge(objcg, obj_full_size(s));
+ mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s),
+ -obj_full_size(s));
+ obj_cgroup_put(objcg);
+ }
+}
+
+static __fastpath_inline
+void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p,
+ int objects)
+{
+ struct obj_cgroup **objcgs;
+
+ if (!memcg_kmem_online())
+ return;
+
+ objcgs = slab_objcgs(slab);
+ if (likely(!objcgs))
+ return;
+
+ __memcg_slab_free_hook(s, slab, p, objects, objcgs);
+}
+
+static inline
+void memcg_slab_alloc_error_hook(struct kmem_cache *s, int objects,
+ struct obj_cgroup *objcg)
+{
+ if (objcg)
+ obj_cgroup_uncharge(objcg, objects * obj_full_size(s));
+}
+#else /* CONFIG_MEMCG_KMEM */
+static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
+{
+ return NULL;
+}
+
+static inline void memcg_free_slab_cgroups(struct slab *slab)
+{
+}
+
+static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+ struct list_lru *lru,
+ struct obj_cgroup **objcgp,
+ size_t objects, gfp_t flags)
+{
+ return true;
+}
+
+static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
+ struct obj_cgroup *objcg,
+ gfp_t flags, size_t size,
+ void **p)
+{
+}
+
+static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
+ void **p, int objects)
+{
+}
+
+static inline
+void memcg_slab_alloc_error_hook(struct kmem_cache *s, int objects,
+ struct obj_cgroup *objcg)
+{
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
/*
* Hooks for other subsystems that check memory allocations. In a typical
* production configuration these hooks all should produce no code at all.
+ *
+ * Returns true if freeing of the object can proceed, false if its reuse
+ * was delayed by KASAN quarantine, or it was returned to KFENCE.
*/
-static __always_inline bool slab_free_hook(struct kmem_cache *s,
- void *x, bool init)
+static __always_inline
+bool slab_free_hook(struct kmem_cache *s, void *x, bool init)
{
kmemleak_free_recursive(x, s->flags);
kmsan_slab_free(s, x);
__kcsan_check_access(x, s->object_size,
KCSAN_ACCESS_WRITE | KCSAN_ACCESS_ASSERT);
+ if (kfence_free(x))
+ return false;
+
/*
* As memory initialization might be integrated into KASAN,
* kasan_slab_free and initialization memset's must be
* The initialization memset's clear the object and the metadata,
* but don't touch the SLAB redzone.
*/
- if (init) {
+ if (unlikely(init)) {
int rsize;
if (!kasan_has_integrated_init())
s->size - s->inuse - rsize);
}
/* KASAN might put x into memory quarantine, delaying its reuse. */
- return kasan_slab_free(s, x, init);
+ return !kasan_slab_free(s, x, init);
}
static inline bool slab_free_freelist_hook(struct kmem_cache *s,
void *object;
void *next = *head;
- void *old_tail = *tail ? *tail : *head;
+ void *old_tail = *tail;
+ bool init;
if (is_kfence_address(next)) {
slab_free_hook(s, next, false);
- return true;
+ return false;
}
/* Head and tail of the reconstructed freelist */
*head = NULL;
*tail = NULL;
+ init = slab_want_init_on_free(s);
+
do {
object = next;
next = get_freepointer(s, object);
/* If object's reuse doesn't have to be delayed */
- if (!slab_free_hook(s, object, slab_want_init_on_free(s))) {
+ if (likely(slab_free_hook(s, object, init))) {
/* Move object to the new freelist */
set_freepointer(s, object, *head);
*head = object;
}
} while (object != old_tail);
- if (*head == *tail)
- *tail = NULL;
-
return *head != NULL;
}
setup_object_debug(s, object);
object = kasan_init_slab_obj(s, object);
if (unlikely(s->ctor)) {
- kasan_unpoison_object_data(s, object);
+ kasan_unpoison_new_object(s, object);
s->ctor(object);
- kasan_poison_object_data(s, object);
+ kasan_poison_new_object(s, object);
}
return object;
}
struct slab *slab;
unsigned int order = oo_order(oo);
- if (node == NUMA_NO_NODE)
- folio = (struct folio *)alloc_pages(flags, order);
- else
- folio = (struct folio *)__alloc_pages_node(node, flags, order);
-
+ folio = (struct folio *)alloc_pages_node(node, flags, order);
if (!folio)
return NULL;
}
#endif /* CONFIG_SLAB_FREELIST_RANDOM */
+static __always_inline void account_slab(struct slab *slab, int order,
+ struct kmem_cache *s, gfp_t gfp)
+{
+ if (memcg_kmem_online() && (s->flags & SLAB_ACCOUNT))
+ memcg_alloc_slab_cgroups(slab, s, gfp, true);
+
+ mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
+ PAGE_SIZE << order);
+}
+
+static __always_inline void unaccount_slab(struct slab *slab, int order,
+ struct kmem_cache *s)
+{
+ if (memcg_kmem_online())
+ memcg_free_slab_cgroups(slab);
+
+ mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
+ -(PAGE_SIZE << order));
+}
+
static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
struct slab *slab;
free_slab(s, slab);
}
+/*
+ * SLUB reuses PG_workingset bit to keep track of whether it's on
+ * the per-node partial list.
+ */
+static inline bool slab_test_node_partial(const struct slab *slab)
+{
+ return folio_test_workingset((struct folio *)slab_folio(slab));
+}
+
+static inline void slab_set_node_partial(struct slab *slab)
+{
+ set_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
+}
+
+static inline void slab_clear_node_partial(struct slab *slab)
+{
+ clear_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
+}
+
/*
* Management of partially allocated slabs.
*/
list_add_tail(&slab->slab_list, &n->partial);
else
list_add(&slab->slab_list, &n->partial);
+ slab_set_node_partial(slab);
}
static inline void add_partial(struct kmem_cache_node *n,
{
lockdep_assert_held(&n->list_lock);
list_del(&slab->slab_list);
+ slab_clear_node_partial(slab);
n->nr_partial--;
}
/*
- * Called only for kmem_cache_debug() caches instead of acquire_slab(), with a
+ * Called only for kmem_cache_debug() caches instead of remove_partial(), with a
* slab from the n->partial list. Remove only a single object from the slab, do
* the alloc_debug_processing() checks and leave the slab on the list, or move
* it to full list if it was the last free object.
return object;
}
-/*
- * Remove slab from the partial list, freeze it and
- * return the pointer to the freelist.
- *
- * Returns a list of objects or NULL if it fails.
- */
-static inline void *acquire_slab(struct kmem_cache *s,
- struct kmem_cache_node *n, struct slab *slab,
- int mode)
-{
- void *freelist;
- unsigned long counters;
- struct slab new;
-
- lockdep_assert_held(&n->list_lock);
-
- /*
- * Zap the freelist and set the frozen bit.
- * The old freelist is the list of objects for the
- * per cpu allocation list.
- */
- freelist = slab->freelist;
- counters = slab->counters;
- new.counters = counters;
- if (mode) {
- new.inuse = slab->objects;
- new.freelist = NULL;
- } else {
- new.freelist = freelist;
- }
-
- VM_BUG_ON(new.frozen);
- new.frozen = 1;
-
- if (!__slab_update_freelist(s, slab,
- freelist, counters,
- new.freelist, new.counters,
- "acquire_slab"))
- return NULL;
-
- remove_partial(n, slab);
- WARN_ON(!freelist);
- return freelist;
-}
-
#ifdef CONFIG_SLUB_CPU_PARTIAL
static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain);
#else
/*
* Try to allocate a partial slab from a specific node.
*/
-static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
- struct partial_context *pc)
+static struct slab *get_partial_node(struct kmem_cache *s,
+ struct kmem_cache_node *n,
+ struct partial_context *pc)
{
- struct slab *slab, *slab2;
- void *object = NULL;
+ struct slab *slab, *slab2, *partial = NULL;
unsigned long flags;
unsigned int partial_slabs = 0;
spin_lock_irqsave(&n->list_lock, flags);
list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
- void *t;
-
if (!pfmemalloc_match(slab, pc->flags))
continue;
if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
- object = alloc_single_from_partial(s, n, slab,
+ void *object = alloc_single_from_partial(s, n, slab,
pc->orig_size);
- if (object)
+ if (object) {
+ partial = slab;
+ pc->object = object;
break;
+ }
continue;
}
- t = acquire_slab(s, n, slab, object == NULL);
- if (!t)
- break;
+ remove_partial(n, slab);
- if (!object) {
- *pc->slab = slab;
+ if (!partial) {
+ partial = slab;
stat(s, ALLOC_FROM_PARTIAL);
- object = t;
} else {
put_cpu_partial(s, slab, 0);
stat(s, CPU_PARTIAL_NODE);
}
spin_unlock_irqrestore(&n->list_lock, flags);
- return object;
+ return partial;
}
/*
* Get a slab from somewhere. Search in increasing NUMA distances.
*/
-static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
+static struct slab *get_any_partial(struct kmem_cache *s,
+ struct partial_context *pc)
{
#ifdef CONFIG_NUMA
struct zonelist *zonelist;
struct zoneref *z;
struct zone *zone;
enum zone_type highest_zoneidx = gfp_zone(pc->flags);
- void *object;
+ struct slab *slab;
unsigned int cpuset_mems_cookie;
/*
if (n && cpuset_zone_allowed(zone, pc->flags) &&
n->nr_partial > s->min_partial) {
- object = get_partial_node(s, n, pc);
- if (object) {
+ slab = get_partial_node(s, n, pc);
+ if (slab) {
/*
* Don't check read_mems_allowed_retry()
* here - if mems_allowed was updated in
* between allocation and the cpuset
* update
*/
- return object;
+ return slab;
}
}
}
/*
* Get a partial slab, lock it and return it.
*/
-static void *get_partial(struct kmem_cache *s, int node, struct partial_context *pc)
+static struct slab *get_partial(struct kmem_cache *s, int node,
+ struct partial_context *pc)
{
- void *object;
+ struct slab *slab;
int searchnode = node;
if (node == NUMA_NO_NODE)
searchnode = numa_mem_id();
- object = get_partial_node(s, get_node(s, searchnode), pc);
- if (object || node != NUMA_NO_NODE)
- return object;
+ slab = get_partial_node(s, get_node(s, searchnode), pc);
+ if (slab || node != NUMA_NO_NODE)
+ return slab;
return get_any_partial(s, pc);
}
static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
void *freelist)
{
- enum slab_modes { M_NONE, M_PARTIAL, M_FREE, M_FULL_NOLIST };
struct kmem_cache_node *n = get_node(s, slab_nid(slab));
int free_delta = 0;
- enum slab_modes mode = M_NONE;
void *nextfree, *freelist_iter, *freelist_tail;
int tail = DEACTIVATE_TO_HEAD;
unsigned long flags = 0;
/*
* Stage two: Unfreeze the slab while splicing the per-cpu
* freelist to the head of slab's freelist.
- *
- * Ensure that the slab is unfrozen while the list presence
- * reflects the actual number of objects during unfreeze.
- *
- * We first perform cmpxchg holding lock and insert to list
- * when it succeed. If there is mismatch then the slab is not
- * unfrozen and number of objects in the slab may have changed.
- * Then release lock and retry cmpxchg again.
*/
-redo:
-
- old.freelist = READ_ONCE(slab->freelist);
- old.counters = READ_ONCE(slab->counters);
- VM_BUG_ON(!old.frozen);
-
- /* Determine target state of the slab */
- new.counters = old.counters;
- if (freelist_tail) {
- new.inuse -= free_delta;
- set_freepointer(s, freelist_tail, old.freelist);
- new.freelist = freelist;
- } else
- new.freelist = old.freelist;
-
- new.frozen = 0;
+ do {
+ old.freelist = READ_ONCE(slab->freelist);
+ old.counters = READ_ONCE(slab->counters);
+ VM_BUG_ON(!old.frozen);
+
+ /* Determine target state of the slab */
+ new.counters = old.counters;
+ new.frozen = 0;
+ if (freelist_tail) {
+ new.inuse -= free_delta;
+ set_freepointer(s, freelist_tail, old.freelist);
+ new.freelist = freelist;
+ } else {
+ new.freelist = old.freelist;
+ }
+ } while (!slab_update_freelist(s, slab,
+ old.freelist, old.counters,
+ new.freelist, new.counters,
+ "unfreezing slab"));
+ /*
+ * Stage three: Manipulate the slab list based on the updated state.
+ */
if (!new.inuse && n->nr_partial >= s->min_partial) {
- mode = M_FREE;
+ stat(s, DEACTIVATE_EMPTY);
+ discard_slab(s, slab);
+ stat(s, FREE_SLAB);
} else if (new.freelist) {
- mode = M_PARTIAL;
- /*
- * Taking the spinlock removes the possibility that
- * acquire_slab() will see a slab that is frozen
- */
spin_lock_irqsave(&n->list_lock, flags);
- } else {
- mode = M_FULL_NOLIST;
- }
-
-
- if (!slab_update_freelist(s, slab,
- old.freelist, old.counters,
- new.freelist, new.counters,
- "unfreezing slab")) {
- if (mode == M_PARTIAL)
- spin_unlock_irqrestore(&n->list_lock, flags);
- goto redo;
- }
-
-
- if (mode == M_PARTIAL) {
add_partial(n, slab, tail);
spin_unlock_irqrestore(&n->list_lock, flags);
stat(s, tail);
- } else if (mode == M_FREE) {
- stat(s, DEACTIVATE_EMPTY);
- discard_slab(s, slab);
- stat(s, FREE_SLAB);
- } else if (mode == M_FULL_NOLIST) {
+ } else {
stat(s, DEACTIVATE_FULL);
}
}
#ifdef CONFIG_SLUB_CPU_PARTIAL
-static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
+static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
{
struct kmem_cache_node *n = NULL, *n2 = NULL;
struct slab *slab, *slab_to_discard = NULL;
unsigned long flags = 0;
while (partial_slab) {
- struct slab new;
- struct slab old;
-
slab = partial_slab;
partial_slab = slab->next;
spin_lock_irqsave(&n->list_lock, flags);
}
- do {
-
- old.freelist = slab->freelist;
- old.counters = slab->counters;
- VM_BUG_ON(!old.frozen);
-
- new.counters = old.counters;
- new.freelist = old.freelist;
-
- new.frozen = 0;
-
- } while (!__slab_update_freelist(s, slab,
- old.freelist, old.counters,
- new.freelist, new.counters,
- "unfreezing slab"));
-
- if (unlikely(!new.inuse && n->nr_partial >= s->min_partial)) {
+ if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial)) {
slab->next = slab_to_discard;
slab_to_discard = slab;
} else {
}
/*
- * Unfreeze all the cpu partial slabs.
+ * Put all the cpu partial slabs to the node partial list.
*/
-static void unfreeze_partials(struct kmem_cache *s)
+static void put_partials(struct kmem_cache *s)
{
struct slab *partial_slab;
unsigned long flags;
local_unlock_irqrestore(&s->cpu_slab->lock, flags);
if (partial_slab)
- __unfreeze_partials(s, partial_slab);
+ __put_partials(s, partial_slab);
}
-static void unfreeze_partials_cpu(struct kmem_cache *s,
- struct kmem_cache_cpu *c)
+static void put_partials_cpu(struct kmem_cache *s,
+ struct kmem_cache_cpu *c)
{
struct slab *partial_slab;
c->partial = NULL;
if (partial_slab)
- __unfreeze_partials(s, partial_slab);
+ __put_partials(s, partial_slab);
}
/*
- * Put a slab that was just frozen (in __slab_free|get_partial_node) into a
- * partial slab slot if available.
+ * Put a slab into a partial slab slot if available.
*
* If we did not find a slot then simply move all the partials to the
* per node partial list.
static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
{
struct slab *oldslab;
- struct slab *slab_to_unfreeze = NULL;
+ struct slab *slab_to_put = NULL;
unsigned long flags;
int slabs = 0;
* per node partial list. Postpone the actual unfreezing
* outside of the critical section.
*/
- slab_to_unfreeze = oldslab;
+ slab_to_put = oldslab;
oldslab = NULL;
} else {
slabs = oldslab->slabs;
local_unlock_irqrestore(&s->cpu_slab->lock, flags);
- if (slab_to_unfreeze) {
- __unfreeze_partials(s, slab_to_unfreeze);
+ if (slab_to_put) {
+ __put_partials(s, slab_to_put);
stat(s, CPU_PARTIAL_DRAIN);
}
}
#else /* CONFIG_SLUB_CPU_PARTIAL */
-static inline void unfreeze_partials(struct kmem_cache *s) { }
-static inline void unfreeze_partials_cpu(struct kmem_cache *s,
- struct kmem_cache_cpu *c) { }
+static inline void put_partials(struct kmem_cache *s) { }
+static inline void put_partials_cpu(struct kmem_cache *s,
+ struct kmem_cache_cpu *c) { }
#endif /* CONFIG_SLUB_CPU_PARTIAL */
stat(s, CPUSLAB_FLUSH);
}
- unfreeze_partials_cpu(s, c);
+ put_partials_cpu(s, c);
}
struct slub_flush_work {
if (c->slab)
flush_slab(s, c);
- unfreeze_partials(s);
+ put_partials(s);
}
static bool has_cpu_slab(int cpu, struct kmem_cache *s)
return freelist;
}
+/*
+ * Freeze the partial slab and return the pointer to the freelist.
+ */
+static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
+{
+ struct slab new;
+ unsigned long counters;
+ void *freelist;
+
+ do {
+ freelist = slab->freelist;
+ counters = slab->counters;
+
+ new.counters = counters;
+ VM_BUG_ON(new.frozen);
+
+ new.inuse = slab->objects;
+ new.frozen = 1;
+
+ } while (!slab_update_freelist(s, slab,
+ freelist, counters,
+ NULL, new.counters,
+ "freeze_slab"));
+
+ return freelist;
+}
+
/*
* Slow path. The lockless freelist is empty or we need to perform
* debugging duties.
node = NUMA_NO_NODE;
goto new_slab;
}
-redo:
if (unlikely(!node_match(slab, node))) {
/*
new_slab:
- if (slub_percpu_partial(c)) {
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+ while (slub_percpu_partial(c)) {
local_lock_irqsave(&s->cpu_slab->lock, flags);
if (unlikely(c->slab)) {
local_unlock_irqrestore(&s->cpu_slab->lock, flags);
goto new_objects;
}
- slab = c->slab = slub_percpu_partial(c);
+ slab = slub_percpu_partial(c);
slub_set_percpu_partial(c, slab);
local_unlock_irqrestore(&s->cpu_slab->lock, flags);
stat(s, CPU_PARTIAL_ALLOC);
- goto redo;
+
+ if (unlikely(!node_match(slab, node) ||
+ !pfmemalloc_match(slab, gfpflags))) {
+ slab->next = NULL;
+ __put_partials(s, slab);
+ continue;
+ }
+
+ freelist = freeze_slab(s, slab);
+ goto retry_load_slab;
}
+#endif
new_objects:
pc.flags = gfpflags;
- pc.slab = &slab;
pc.orig_size = orig_size;
- freelist = get_partial(s, node, &pc);
- if (freelist)
- goto check_new_slab;
+ slab = get_partial(s, node, &pc);
+ if (slab) {
+ if (kmem_cache_debug(s)) {
+ freelist = pc.object;
+ /*
+ * For debug caches here we had to go through
+ * alloc_single_from_partial() so just store the
+ * tracking info and return the object.
+ */
+ if (s->flags & SLAB_STORE_USER)
+ set_track(s, freelist, TRACK_ALLOC, addr);
+
+ return freelist;
+ }
+
+ freelist = freeze_slab(s, slab);
+ goto retry_load_slab;
+ }
slub_put_cpu_ptr(s->cpu_slab);
slab = new_slab(s, gfpflags, node);
inc_slabs_node(s, slab_nid(slab), slab->objects);
-check_new_slab:
-
- if (kmem_cache_debug(s)) {
- /*
- * For debug caches here we had to go through
- * alloc_single_from_partial() so just store the tracking info
- * and return the object
- */
- if (s->flags & SLAB_STORE_USER)
- set_track(s, freelist, TRACK_ALLOC, addr);
-
- return freelist;
- }
-
if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
/*
* For !pfmemalloc_match() case we don't load freelist so that
void *object;
pc.flags = gfpflags;
- pc.slab = &slab;
pc.orig_size = orig_size;
- object = get_partial(s, node, &pc);
+ slab = get_partial(s, node, &pc);
- if (object)
- return object;
+ if (slab)
+ return pc.object;
slab = new_slab(s, gfpflags, node);
if (unlikely(!slab)) {
0, sizeof(void *));
}
+noinline int should_failslab(struct kmem_cache *s, gfp_t gfpflags)
+{
+ if (__should_failslab(s, gfpflags))
+ return -ENOMEM;
+ return 0;
+}
+ALLOW_ERROR_INJECTION(should_failslab, ERRNO);
+
+static __fastpath_inline
+struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
+ struct list_lru *lru,
+ struct obj_cgroup **objcgp,
+ size_t size, gfp_t flags)
+{
+ flags &= gfp_allowed_mask;
+
+ might_alloc(flags);
+
+ if (unlikely(should_failslab(s, flags)))
+ return NULL;
+
+ if (unlikely(!memcg_slab_pre_alloc_hook(s, lru, objcgp, size, flags)))
+ return NULL;
+
+ return s;
+}
+
+static __fastpath_inline
+void slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg,
+ gfp_t flags, size_t size, void **p, bool init,
+ unsigned int orig_size)
+{
+ unsigned int zero_size = s->object_size;
+ bool kasan_init = init;
+ size_t i;
+ gfp_t init_flags = flags & gfp_allowed_mask;
+
+ /*
+ * For kmalloc object, the allocated memory size(object_size) is likely
+ * larger than the requested size(orig_size). If redzone check is
+ * enabled for the extra space, don't zero it, as it will be redzoned
+ * soon. The redzone operation for this extra space could be seen as a
+ * replacement of current poisoning under certain debug option, and
+ * won't break other sanity checks.
+ */
+ if (kmem_cache_debug_flags(s, SLAB_STORE_USER | SLAB_RED_ZONE) &&
+ (s->flags & SLAB_KMALLOC))
+ zero_size = orig_size;
+
+ /*
+ * When slub_debug is enabled, avoid memory initialization integrated
+ * into KASAN and instead zero out the memory via the memset below with
+ * the proper size. Otherwise, KASAN might overwrite SLUB redzones and
+ * cause false-positive reports. This does not lead to a performance
+ * penalty on production builds, as slub_debug is not intended to be
+ * enabled there.
+ */
+ if (__slub_debug_enabled())
+ kasan_init = false;
+
+ /*
+ * As memory initialization might be integrated into KASAN,
+ * kasan_slab_alloc and initialization memset must be
+ * kept together to avoid discrepancies in behavior.
+ *
+ * As p[i] might get tagged, memset and kmemleak hook come after KASAN.
+ */
+ for (i = 0; i < size; i++) {
+ p[i] = kasan_slab_alloc(s, p[i], init_flags, kasan_init);
+ if (p[i] && init && (!kasan_init ||
+ !kasan_has_integrated_init()))
+ memset(p[i], 0, zero_size);
+ kmemleak_alloc_recursive(p[i], s->object_size, 1,
+ s->flags, init_flags);
+ kmsan_slab_alloc(s, p[i], init_flags);
+ }
+
+ memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
+}
+
/*
* Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
* have the fastpath folded into their functions. So no function call
bool init = false;
s = slab_pre_alloc_hook(s, lru, &objcg, 1, gfpflags);
- if (!s)
+ if (unlikely(!s))
return NULL;
object = kfence_alloc(s, orig_size, gfpflags);
return object;
}
-static __fastpath_inline void *slab_alloc(struct kmem_cache *s, struct list_lru *lru,
- gfp_t gfpflags, unsigned long addr, size_t orig_size)
+void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
{
- return slab_alloc_node(s, lru, gfpflags, NUMA_NO_NODE, addr, orig_size);
+ void *ret = slab_alloc_node(s, NULL, gfpflags, NUMA_NO_NODE, _RET_IP_,
+ s->object_size);
+
+ trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);
+
+ return ret;
}
+EXPORT_SYMBOL(kmem_cache_alloc);
-static __fastpath_inline
-void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
- gfp_t gfpflags)
+void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
+ gfp_t gfpflags)
{
- void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
+ void *ret = slab_alloc_node(s, lru, gfpflags, NUMA_NO_NODE, _RET_IP_,
+ s->object_size);
trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);
return ret;
}
+EXPORT_SYMBOL(kmem_cache_alloc_lru);
-void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+/**
+ * kmem_cache_alloc_node - Allocate an object on the specified node
+ * @s: The cache to allocate from.
+ * @gfpflags: See kmalloc().
+ * @node: node number of the target node.
+ *
+ * Identical to kmem_cache_alloc but it will allocate memory on the given
+ * node, which can improve the performance for cpu bound structures.
+ *
+ * Fallback to other node is possible if __GFP_THISNODE is not set.
+ *
+ * Return: pointer to the new object or %NULL in case of error
+ */
+void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
{
- return __kmem_cache_alloc_lru(s, NULL, gfpflags);
+ void *ret = slab_alloc_node(s, NULL, gfpflags, node, _RET_IP_, s->object_size);
+
+ trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, node);
+
+ return ret;
}
-EXPORT_SYMBOL(kmem_cache_alloc);
+EXPORT_SYMBOL(kmem_cache_alloc_node);
-void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
- gfp_t gfpflags)
+/*
+ * To avoid unnecessary overhead, we pass through large allocation requests
+ * directly to the page allocator. We use __GFP_COMP, because we will need to
+ * know the allocation order to free the pages properly in kfree.
+ */
+static void *__kmalloc_large_node(size_t size, gfp_t flags, int node)
{
- struct page *page;
- return __kmem_cache_alloc_lru(s, lru, gfpflags);
++ struct folio *folio;
+ void *ptr = NULL;
+ unsigned int order = get_order(size);
+
+ if (unlikely(flags & GFP_SLAB_BUG_MASK))
+ flags = kmalloc_fix_flags(flags);
+
+ flags |= __GFP_COMP;
- page = alloc_pages_node(node, flags, order);
- if (page) {
- ptr = page_address(page);
- mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B,
++ folio = (struct folio *)alloc_pages_node(node, flags, order);
++ if (folio) {
++ ptr = folio_address(folio);
++ lruvec_stat_mod_folio(folio, NR_SLAB_UNRECLAIMABLE_B,
+ PAGE_SIZE << order);
+ }
+
+ ptr = kasan_kmalloc_large(ptr, size, flags);
+ /* As ptr might get tagged, call kmemleak hook after KASAN. */
+ kmemleak_alloc(ptr, size, 1, flags);
+ kmsan_kmalloc_large(ptr, size, flags);
+
+ return ptr;
}
-EXPORT_SYMBOL(kmem_cache_alloc_lru);
-void *__kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags,
- int node, size_t orig_size,
- unsigned long caller)
+void *kmalloc_large(size_t size, gfp_t flags)
{
- return slab_alloc_node(s, NULL, gfpflags, node,
- caller, orig_size);
+ void *ret = __kmalloc_large_node(size, flags, NUMA_NO_NODE);
+
+ trace_kmalloc(_RET_IP_, ret, size, PAGE_SIZE << get_order(size),
+ flags, NUMA_NO_NODE);
+ return ret;
}
+EXPORT_SYMBOL(kmalloc_large);
-void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
+void *kmalloc_large_node(size_t size, gfp_t flags, int node)
{
- void *ret = slab_alloc_node(s, NULL, gfpflags, node, _RET_IP_, s->object_size);
+ void *ret = __kmalloc_large_node(size, flags, node);
- trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, node);
+ trace_kmalloc(_RET_IP_, ret, size, PAGE_SIZE << get_order(size),
+ flags, node);
+ return ret;
+}
+EXPORT_SYMBOL(kmalloc_large_node);
+static __always_inline
+void *__do_kmalloc_node(size_t size, gfp_t flags, int node,
+ unsigned long caller)
+{
+ struct kmem_cache *s;
+ void *ret;
+
+ if (unlikely(size > KMALLOC_MAX_CACHE_SIZE)) {
+ ret = __kmalloc_large_node(size, flags, node);
+ trace_kmalloc(caller, ret, size,
+ PAGE_SIZE << get_order(size), flags, node);
+ return ret;
+ }
+
+ if (unlikely(!size))
+ return ZERO_SIZE_PTR;
+
+ s = kmalloc_slab(size, flags, caller);
+
+ ret = slab_alloc_node(s, NULL, flags, node, caller, size);
+ ret = kasan_kmalloc(s, ret, size, flags);
+ trace_kmalloc(caller, ret, size, s->size, flags, node);
return ret;
}
-EXPORT_SYMBOL(kmem_cache_alloc_node);
+
+void *__kmalloc_node(size_t size, gfp_t flags, int node)
+{
+ return __do_kmalloc_node(size, flags, node, _RET_IP_);
+}
+EXPORT_SYMBOL(__kmalloc_node);
+
+void *__kmalloc(size_t size, gfp_t flags)
+{
+ return __do_kmalloc_node(size, flags, NUMA_NO_NODE, _RET_IP_);
+}
+EXPORT_SYMBOL(__kmalloc);
+
+void *__kmalloc_node_track_caller(size_t size, gfp_t flags,
+ int node, unsigned long caller)
+{
+ return __do_kmalloc_node(size, flags, node, caller);
+}
+EXPORT_SYMBOL(__kmalloc_node_track_caller);
+
+void *kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
+{
+ void *ret = slab_alloc_node(s, NULL, gfpflags, NUMA_NO_NODE,
+ _RET_IP_, size);
+
+ trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags, NUMA_NO_NODE);
+
+ ret = kasan_kmalloc(s, ret, size, gfpflags);
+ return ret;
+}
+EXPORT_SYMBOL(kmalloc_trace);
+
+void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
+ int node, size_t size)
+{
+ void *ret = slab_alloc_node(s, NULL, gfpflags, node, _RET_IP_, size);
+
+ trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags, node);
+
+ ret = kasan_kmalloc(s, ret, size, gfpflags);
+ return ret;
+}
+EXPORT_SYMBOL(kmalloc_node_trace);
static noinline void free_to_partial_list(
struct kmem_cache *s, struct slab *slab,
unsigned long counters;
struct kmem_cache_node *n = NULL;
unsigned long flags;
+ bool on_node_partial;
stat(s, FREE_SLOWPATH);
- if (kfence_free(head))
- return;
-
if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
free_to_partial_list(s, slab, head, tail, cnt, addr);
return;
was_frozen = new.frozen;
new.inuse -= cnt;
if ((!new.inuse || !prior) && !was_frozen) {
-
- if (kmem_cache_has_cpu_partial(s) && !prior) {
-
- /*
- * Slab was on no list before and will be
- * partially empty
- * We can defer the list move and instead
- * freeze it.
- */
- new.frozen = 1;
-
- } else { /* Needs to be taken off a list */
+ /* Needs to be taken off a list */
+ if (!kmem_cache_has_cpu_partial(s) || prior) {
n = get_node(s, slab_nid(slab));
/*
*/
spin_lock_irqsave(&n->list_lock, flags);
+ on_node_partial = slab_test_node_partial(slab);
}
}
* activity can be necessary.
*/
stat(s, FREE_FROZEN);
- } else if (new.frozen) {
+ } else if (kmem_cache_has_cpu_partial(s) && !prior) {
/*
- * If we just froze the slab then put it onto the
+ * If we started with a full slab then put it onto the
* per cpu partial list.
*/
put_cpu_partial(s, slab, 1);
return;
}
+ /*
+ * This slab was partially empty but not on the per-node partial list,
+ * in which case we shouldn't manipulate its list, just return.
+ */
+ if (prior && !on_node_partial) {
+ spin_unlock_irqrestore(&n->list_lock, flags);
+ return;
+ }
+
if (unlikely(!new.inuse && n->nr_partial >= s->min_partial))
goto slab_empty;
struct slab *slab, void *head, void *tail,
int cnt, unsigned long addr)
{
- void *tail_obj = tail ? : head;
struct kmem_cache_cpu *c;
unsigned long tid;
void **freelist;
barrier();
if (unlikely(slab != c->slab)) {
- __slab_free(s, slab, head, tail_obj, cnt, addr);
+ __slab_free(s, slab, head, tail, cnt, addr);
return;
}
if (USE_LOCKLESS_FAST_PATH()) {
freelist = READ_ONCE(c->freelist);
- set_freepointer(s, tail_obj, freelist);
+ set_freepointer(s, tail, freelist);
if (unlikely(!__update_cpu_freelist_fast(s, freelist, head, tid))) {
note_cmpxchg_failure("slab_free", s, tid);
tid = c->tid;
freelist = c->freelist;
- set_freepointer(s, tail_obj, freelist);
+ set_freepointer(s, tail, freelist);
c->freelist = head;
c->tid = next_tid(tid);
local_unlock(&s->cpu_slab->lock);
}
- stat(s, FREE_FASTPATH);
+ stat_add(s, FREE_FASTPATH, cnt);
}
#else /* CONFIG_SLUB_TINY */
static void do_slab_free(struct kmem_cache *s,
struct slab *slab, void *head, void *tail,
int cnt, unsigned long addr)
{
- void *tail_obj = tail ? : head;
-
- __slab_free(s, slab, head, tail_obj, cnt, addr);
+ __slab_free(s, slab, head, tail, cnt, addr);
}
#endif /* CONFIG_SLUB_TINY */
-static __fastpath_inline void slab_free(struct kmem_cache *s, struct slab *slab,
- void *head, void *tail, void **p, int cnt,
- unsigned long addr)
+static __fastpath_inline
+void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
+ unsigned long addr)
+{
+ memcg_slab_free_hook(s, slab, &object, 1);
+
+ if (likely(slab_free_hook(s, object, slab_want_init_on_free(s))))
+ do_slab_free(s, slab, object, object, 1, addr);
+}
+
+static __fastpath_inline
+void slab_free_bulk(struct kmem_cache *s, struct slab *slab, void *head,
+ void *tail, void **p, int cnt, unsigned long addr)
{
memcg_slab_free_hook(s, slab, p, cnt);
/*
* With KASAN enabled slab_free_freelist_hook modifies the freelist
* to remove objects, whose reuse must be delayed.
*/
- if (slab_free_freelist_hook(s, &head, &tail, &cnt))
+ if (likely(slab_free_freelist_hook(s, &head, &tail, &cnt)))
do_slab_free(s, slab, head, tail, cnt, addr);
}
#ifdef CONFIG_KASAN_GENERIC
void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr)
{
- do_slab_free(cache, virt_to_slab(x), x, NULL, 1, addr);
+ do_slab_free(cache, virt_to_slab(x), x, x, 1, addr);
}
#endif
-void __kmem_cache_free(struct kmem_cache *s, void *x, unsigned long caller)
+static inline struct kmem_cache *virt_to_cache(const void *obj)
{
- slab_free(s, virt_to_slab(x), x, NULL, &x, 1, caller);
+ struct slab *slab;
+
+ slab = virt_to_slab(obj);
+ if (WARN_ONCE(!slab, "%s: Object is not a Slab page!\n", __func__))
+ return NULL;
+ return slab->slab_cache;
}
+static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
+{
+ struct kmem_cache *cachep;
+
+ if (!IS_ENABLED(CONFIG_SLAB_FREELIST_HARDENED) &&
+ !kmem_cache_debug_flags(s, SLAB_CONSISTENCY_CHECKS))
+ return s;
+
+ cachep = virt_to_cache(x);
+ if (WARN(cachep && cachep != s,
+ "%s: Wrong slab cache. %s but object is from %s\n",
+ __func__, s->name, cachep->name))
+ print_tracking(cachep, x);
+ return cachep;
+}
+
+/**
+ * kmem_cache_free - Deallocate an object
+ * @s: The cache the allocation was from.
+ * @x: The previously allocated object.
+ *
+ * Free an object which was previously allocated from this
+ * cache.
+ */
void kmem_cache_free(struct kmem_cache *s, void *x)
{
s = cache_from_obj(s, x);
if (!s)
return;
trace_kmem_cache_free(_RET_IP_, x, s);
- slab_free(s, virt_to_slab(x), x, NULL, &x, 1, _RET_IP_);
+ slab_free(s, virt_to_slab(x), x, _RET_IP_);
}
EXPORT_SYMBOL(kmem_cache_free);
- mod_lruvec_page_state(folio_page(folio, 0), NR_SLAB_UNRECLAIMABLE_B,
+static void free_large_kmalloc(struct folio *folio, void *object)
+{
+ unsigned int order = folio_order(folio);
+
+ if (WARN_ON_ONCE(order == 0))
+ pr_warn_once("object pointer: 0x%p\n", object);
+
+ kmemleak_free(object);
+ kasan_kfree_large(object);
+ kmsan_kfree_large(object);
+
- __free_pages(folio_page(folio, 0), order);
++ lruvec_stat_mod_folio(folio, NR_SLAB_UNRECLAIMABLE_B,
+ -(PAGE_SIZE << order));
++ folio_put(folio);
+}
+
+/**
+ * kfree - free previously allocated memory
+ * @object: pointer returned by kmalloc() or kmem_cache_alloc()
+ *
+ * If @object is NULL, no operation is performed.
+ */
+void kfree(const void *object)
+{
+ struct folio *folio;
+ struct slab *slab;
+ struct kmem_cache *s;
+ void *x = (void *)object;
+
+ trace_kfree(_RET_IP_, object);
+
+ if (unlikely(ZERO_OR_NULL_PTR(object)))
+ return;
+
+ folio = virt_to_folio(object);
+ if (unlikely(!folio_test_slab(folio))) {
+ free_large_kmalloc(folio, (void *)object);
+ return;
+ }
+
+ slab = folio_slab(folio);
+ s = slab->slab_cache;
+ slab_free(s, slab, x, _RET_IP_);
+}
+EXPORT_SYMBOL(kfree);
+
struct detached_freelist {
struct slab *slab;
void *tail;
return same;
}
+/*
+ * Internal bulk free of objects that were not initialised by the post alloc
+ * hooks and thus should not be processed by the free hooks
+ */
+static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+ if (!size)
+ return;
+
+ do {
+ struct detached_freelist df;
+
+ size = build_detached_freelist(s, size, p, &df);
+ if (!df.slab)
+ continue;
+
+ do_slab_free(df.s, df.slab, df.freelist, df.tail, df.cnt,
+ _RET_IP_);
+ } while (likely(size));
+}
+
/* Note that interrupts must be enabled when calling this function. */
void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
{
if (!df.slab)
continue;
- slab_free(df.s, df.slab, df.freelist, df.tail, &p[size], df.cnt,
- _RET_IP_);
+ slab_free_bulk(df.s, df.slab, df.freelist, df.tail, &p[size],
+ df.cnt, _RET_IP_);
} while (likely(size));
}
EXPORT_SYMBOL(kmem_cache_free_bulk);
#ifndef CONFIG_SLUB_TINY
-static inline int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
- size_t size, void **p, struct obj_cgroup *objcg)
+static inline
+int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
+ void **p)
{
struct kmem_cache_cpu *c;
unsigned long irqflags;
c->freelist = get_freepointer(s, object);
p[i] = object;
maybe_wipe_obj_freeptr(s, p[i]);
+ stat(s, ALLOC_FASTPATH);
}
c->tid = next_tid(c->tid);
local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
error:
slub_put_cpu_ptr(s->cpu_slab);
- slab_post_alloc_hook(s, objcg, flags, i, p, false, s->object_size);
- kmem_cache_free_bulk(s, i, p);
+ __kmem_cache_free_bulk(s, i, p);
return 0;
}
#else /* CONFIG_SLUB_TINY */
static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
- size_t size, void **p, struct obj_cgroup *objcg)
+ size_t size, void **p)
{
int i;
return i;
error:
- slab_post_alloc_hook(s, objcg, flags, i, p, false, s->object_size);
- kmem_cache_free_bulk(s, i, p);
+ __kmem_cache_free_bulk(s, i, p);
return 0;
}
#endif /* CONFIG_SLUB_TINY */
if (unlikely(!s))
return 0;
- i = __kmem_cache_alloc_bulk(s, flags, size, p, objcg);
+ i = __kmem_cache_alloc_bulk(s, flags, size, p);
/*
* memcg and kmem_cache debug support and memory initialization.
* Done outside of the IRQ disabled fastpath loop.
*/
- if (i != 0)
+ if (likely(i != 0)) {
slab_post_alloc_hook(s, objcg, flags, size, p,
slab_want_init_on_alloc(flags, s), s->object_size);
+ } else {
+ memcg_slab_alloc_error_hook(s, size, objcg);
+ }
+
return i;
}
EXPORT_SYMBOL(kmem_cache_alloc_bulk);
* Doh this slab cannot be placed using slub_max_order.
*/
order = get_order(size);
- if (order <= MAX_ORDER)
+ if (order <= MAX_PAGE_ORDER)
return order;
return -ENOSYS;
}
static int __init setup_slub_max_order(char *str)
{
get_option(&str, (int *)&slub_max_order);
- slub_max_order = min_t(unsigned int, slub_max_order, MAX_ORDER);
+ slub_max_order = min_t(unsigned int, slub_max_order, MAX_PAGE_ORDER);
if (slub_min_order > slub_max_order)
slub_min_order = slub_max_order;
if (free == slab->objects) {
list_move(&slab->slab_list, &discard);
+ slab_clear_node_partial(slab);
n->nr_partial--;
dec_slabs_node(s, node, slab->objects);
} else if (free <= SHRINK_PROMOTE_MAX)
{
BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD);
- BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
- PGSCAN_DIRECT - PGSCAN_KSWAPD);
BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
PGDEMOTE_KHUGEPAGED - PGDEMOTE_KSWAPD);
+ BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
+ PGSCAN_DIRECT - PGSCAN_KSWAPD);
BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
PGSCAN_KHUGEPAGED - PGSCAN_KSWAPD);
(unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
&nr_succeeded);
- __count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
+ mod_node_page_state(pgdat, PGDEMOTE_KSWAPD + reclaimer_offset(),
+ nr_succeeded);
return nr_succeeded;
}
* Flush the memory cgroup stats, so that we read accurate per-memcg
* lruvec stats for heuristics.
*/
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(sc->target_mem_cgroup);
/*
* Determine the scan balance between anon and file LRUs.
key[1] = hash >> BLOOM_FILTER_SHIFT;
}
- static bool test_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
+ static bool test_bloom_filter(struct lru_gen_mm_state *mm_state, unsigned long seq,
+ void *item)
{
int key[2];
unsigned long *filter;
int gen = filter_gen_from_seq(seq);
- filter = READ_ONCE(lruvec->mm_state.filters[gen]);
+ filter = READ_ONCE(mm_state->filters[gen]);
if (!filter)
return true;
return test_bit(key[0], filter) && test_bit(key[1], filter);
}
- static void update_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
+ static void update_bloom_filter(struct lru_gen_mm_state *mm_state, unsigned long seq,
+ void *item)
{
int key[2];
unsigned long *filter;
int gen = filter_gen_from_seq(seq);
- filter = READ_ONCE(lruvec->mm_state.filters[gen]);
+ filter = READ_ONCE(mm_state->filters[gen]);
if (!filter)
return;
set_bit(key[1], filter);
}
- static void reset_bloom_filter(struct lruvec *lruvec, unsigned long seq)
+ static void reset_bloom_filter(struct lru_gen_mm_state *mm_state, unsigned long seq)
{
unsigned long *filter;
int gen = filter_gen_from_seq(seq);
- filter = lruvec->mm_state.filters[gen];
+ filter = mm_state->filters[gen];
if (filter) {
bitmap_clear(filter, 0, BIT(BLOOM_FILTER_SHIFT));
return;
filter = bitmap_zalloc(BIT(BLOOM_FILTER_SHIFT),
__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
- WRITE_ONCE(lruvec->mm_state.filters[gen], filter);
+ WRITE_ONCE(mm_state->filters[gen], filter);
}
/******************************************************************************
* mm_struct list
******************************************************************************/
+ #ifdef CONFIG_LRU_GEN_WALKS_MMU
+
static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg)
{
static struct lru_gen_mm_list mm_list = {
return &mm_list;
}
+ static struct lru_gen_mm_state *get_mm_state(struct lruvec *lruvec)
+ {
+ return &lruvec->mm_state;
+ }
+
+ static struct mm_struct *get_next_mm(struct lru_gen_mm_walk *walk)
+ {
+ int key;
+ struct mm_struct *mm;
+ struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+ struct lru_gen_mm_state *mm_state = get_mm_state(walk->lruvec);
+
+ mm = list_entry(mm_state->head, struct mm_struct, lru_gen.list);
+ key = pgdat->node_id % BITS_PER_TYPE(mm->lru_gen.bitmap);
+
+ if (!walk->force_scan && !test_bit(key, &mm->lru_gen.bitmap))
+ return NULL;
+
+ clear_bit(key, &mm->lru_gen.bitmap);
+
+ return mmget_not_zero(mm) ? mm : NULL;
+ }
+
void lru_gen_add_mm(struct mm_struct *mm)
{
int nid;
for_each_node_state(nid, N_MEMORY) {
struct lruvec *lruvec = get_lruvec(memcg, nid);
+ struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
/* the first addition since the last iteration */
- if (lruvec->mm_state.tail == &mm_list->fifo)
- lruvec->mm_state.tail = &mm->lru_gen.list;
+ if (mm_state->tail == &mm_list->fifo)
+ mm_state->tail = &mm->lru_gen.list;
}
list_add_tail(&mm->lru_gen.list, &mm_list->fifo);
for_each_node(nid) {
struct lruvec *lruvec = get_lruvec(memcg, nid);
+ struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
/* where the current iteration continues after */
- if (lruvec->mm_state.head == &mm->lru_gen.list)
- lruvec->mm_state.head = lruvec->mm_state.head->prev;
+ if (mm_state->head == &mm->lru_gen.list)
+ mm_state->head = mm_state->head->prev;
/* where the last iteration ended before */
- if (lruvec->mm_state.tail == &mm->lru_gen.list)
- lruvec->mm_state.tail = lruvec->mm_state.tail->next;
+ if (mm_state->tail == &mm->lru_gen.list)
+ mm_state->tail = mm_state->tail->next;
}
list_del_init(&mm->lru_gen.list);
}
#endif
+ #else /* !CONFIG_LRU_GEN_WALKS_MMU */
+
+ static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg)
+ {
+ return NULL;
+ }
+
+ static struct lru_gen_mm_state *get_mm_state(struct lruvec *lruvec)
+ {
+ return NULL;
+ }
+
+ static struct mm_struct *get_next_mm(struct lru_gen_mm_walk *walk)
+ {
+ return NULL;
+ }
+
+ #endif
+
static void reset_mm_stats(struct lruvec *lruvec, struct lru_gen_mm_walk *walk, bool last)
{
int i;
int hist;
+ struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock);
hist = lru_hist_from_seq(walk->max_seq);
for (i = 0; i < NR_MM_STATS; i++) {
- WRITE_ONCE(lruvec->mm_state.stats[hist][i],
- lruvec->mm_state.stats[hist][i] + walk->mm_stats[i]);
+ WRITE_ONCE(mm_state->stats[hist][i],
+ mm_state->stats[hist][i] + walk->mm_stats[i]);
walk->mm_stats[i] = 0;
}
}
if (NR_HIST_GENS > 1 && last) {
- hist = lru_hist_from_seq(lruvec->mm_state.seq + 1);
+ hist = lru_hist_from_seq(mm_state->seq + 1);
for (i = 0; i < NR_MM_STATS; i++)
- WRITE_ONCE(lruvec->mm_state.stats[hist][i], 0);
+ WRITE_ONCE(mm_state->stats[hist][i], 0);
}
}
- static bool should_skip_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
- {
- int type;
- unsigned long size = 0;
- struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
- int key = pgdat->node_id % BITS_PER_TYPE(mm->lru_gen.bitmap);
-
- if (!walk->force_scan && !test_bit(key, &mm->lru_gen.bitmap))
- return true;
-
- clear_bit(key, &mm->lru_gen.bitmap);
-
- for (type = !walk->can_swap; type < ANON_AND_FILE; type++) {
- size += type ? get_mm_counter(mm, MM_FILEPAGES) :
- get_mm_counter(mm, MM_ANONPAGES) +
- get_mm_counter(mm, MM_SHMEMPAGES);
- }
-
- if (size < MIN_LRU_BATCH)
- return true;
-
- return !mmget_not_zero(mm);
- }
-
static bool iterate_mm_list(struct lruvec *lruvec, struct lru_gen_mm_walk *walk,
struct mm_struct **iter)
{
struct mm_struct *mm = NULL;
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
- struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
+ struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
/*
* mm_state->seq is incremented after each iteration of mm_list. There
mm_state->tail = mm_state->head->next;
walk->force_scan = true;
}
-
- mm = list_entry(mm_state->head, struct mm_struct, lru_gen.list);
- if (should_skip_mm(mm, walk))
- mm = NULL;
- } while (!mm);
+ } while (!(mm = get_next_mm(walk)));
done:
if (*iter || last)
reset_mm_stats(lruvec, walk, last);
spin_unlock(&mm_list->lock);
if (mm && first)
- reset_bloom_filter(lruvec, walk->max_seq + 1);
+ reset_bloom_filter(mm_state, walk->max_seq + 1);
if (*iter)
mmput_async(*iter);
bool success = false;
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
- struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
+ struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
spin_lock(&mm_list->lock);
return pfn;
}
- #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr)
{
unsigned long pfn = pmd_pfn(pmd);
return pfn;
}
- #endif
static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
struct pglist_data *pgdat, bool can_swap)
return suitable_to_scan(total, young);
}
- #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area_struct *vma,
struct mm_walk *args, unsigned long *bitmap, unsigned long *first)
{
done:
*first = -1;
}
- #else
- static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area_struct *vma,
- struct mm_walk *args, unsigned long *bitmap, unsigned long *first)
- {
- }
- #endif
static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
struct mm_walk *args)
DECLARE_BITMAP(bitmap, MIN_LRU_BATCH);
unsigned long first = -1;
struct lru_gen_mm_walk *walk = args->private;
+ struct lru_gen_mm_state *mm_state = get_mm_state(walk->lruvec);
VM_WARN_ON_ONCE(pud_leaf(*pud));
continue;
}
- #ifdef CONFIG_TRANSPARENT_HUGEPAGE
if (pmd_trans_huge(val)) {
unsigned long pfn = pmd_pfn(val);
struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first);
continue;
}
- #endif
+
walk->mm_stats[MM_NONLEAF_TOTAL]++;
if (should_clear_pmd_young()) {
walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first);
}
- if (!walk->force_scan && !test_bloom_filter(walk->lruvec, walk->max_seq, pmd + i))
+ if (!walk->force_scan && !test_bloom_filter(mm_state, walk->max_seq, pmd + i))
continue;
walk->mm_stats[MM_NONLEAF_FOUND]++;
walk->mm_stats[MM_NONLEAF_ADDED]++;
/* carry over to the next generation */
- update_bloom_filter(walk->lruvec, walk->max_seq + 1, pmd + i);
+ update_bloom_filter(mm_state, walk->max_seq + 1, pmd + i);
}
walk_pmd_range_locked(pud, -1, vma, args, bitmap, &first);
return success;
}
- static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
+ static bool inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
+ bool can_swap, bool force_scan)
{
+ bool success;
int prev, next;
int type, zone;
struct lru_gen_folio *lrugen = &lruvec->lrugen;
restart:
+ if (max_seq < READ_ONCE(lrugen->max_seq))
+ return false;
+
spin_lock_irq(&lruvec->lru_lock);
VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
+ success = max_seq == lrugen->max_seq;
+ if (!success)
+ goto unlock;
+
for (type = ANON_AND_FILE - 1; type >= 0; type--) {
if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
continue;
WRITE_ONCE(lrugen->timestamps[next], jiffies);
/* make sure preceding modifications appear */
smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
-
+ unlock:
spin_unlock_irq(&lruvec->lru_lock);
+
+ return success;
}
static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
struct lru_gen_mm_walk *walk;
struct mm_struct *mm = NULL;
struct lru_gen_folio *lrugen = &lruvec->lrugen;
+ struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
VM_WARN_ON_ONCE(max_seq > READ_ONCE(lrugen->max_seq));
+ if (!mm_state)
+ return inc_max_seq(lruvec, max_seq, can_swap, force_scan);
+
/* see the comment in iterate_mm_list() */
- if (max_seq <= READ_ONCE(lruvec->mm_state.seq)) {
- success = false;
- goto done;
- }
+ if (max_seq <= READ_ONCE(mm_state->seq))
+ return false;
/*
* If the hardware doesn't automatically set the accessed bit, fallback
walk_mm(lruvec, mm, walk);
} while (mm);
done:
- if (success)
- inc_max_seq(lruvec, can_swap, force_scan);
+ if (success) {
+ success = inc_max_seq(lruvec, max_seq, can_swap, force_scan);
+ WARN_ON_ONCE(!success);
+ }
return success;
}
int young = 0;
pte_t *pte = pvmw->pte;
unsigned long addr = pvmw->address;
+ struct vm_area_struct *vma = pvmw->vma;
struct folio *folio = pfn_folio(pvmw->pfn);
bool can_swap = !folio_is_file_lru(folio);
struct mem_cgroup *memcg = folio_memcg(folio);
struct pglist_data *pgdat = folio_pgdat(folio);
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
DEFINE_MAX_SEQ(lruvec);
int old_gen, new_gen = lru_gen_from_seq(max_seq);
if (spin_is_contended(pvmw->ptl))
return;
+ /* exclude special VMAs containing anon pages from COW */
+ if (vma->vm_flags & VM_SPECIAL)
+ return;
+
/* avoid taking the LRU lock under the PTL when possible */
walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
- start = max(addr & PMD_MASK, pvmw->vma->vm_start);
- end = min(addr | ~PMD_MASK, pvmw->vma->vm_end - 1) + 1;
+ start = max(addr & PMD_MASK, vma->vm_start);
+ end = min(addr | ~PMD_MASK, vma->vm_end - 1) + 1;
if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
if (addr - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
unsigned long pfn;
pte_t ptent = ptep_get(pte + i);
- pfn = get_pte_pfn(ptent, pvmw->vma, addr);
+ pfn = get_pte_pfn(ptent, vma, addr);
if (pfn == -1)
continue;
if (!folio)
continue;
- if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
+ if (!ptep_test_and_clear_young(vma, addr, pte + i))
VM_WARN_ON_ONCE(true);
young++;
mem_cgroup_unlock_pages();
/* feedback from rmap walkers to page table walkers */
- if (suitable_to_scan(i, young))
- update_bloom_filter(lruvec, max_seq, pvmw->pmd);
+ if (mm_state && suitable_to_scan(i, young))
+ update_bloom_filter(mm_state, max_seq, pvmw->pmd);
}
/******************************************************************************
MEMCG_LRU_YOUNG,
};
- #ifdef CONFIG_MEMCG
-
- static int lru_gen_memcg_seg(struct lruvec *lruvec)
- {
- return READ_ONCE(lruvec->lrugen.seg);
- }
-
static void lru_gen_rotate_memcg(struct lruvec *lruvec, int op)
{
int seg;
spin_unlock_irqrestore(&pgdat->memcg_lru.lock, flags);
}
+ #ifdef CONFIG_MEMCG
+
void lru_gen_online_memcg(struct mem_cgroup *memcg)
{
int gen;
struct lruvec *lruvec = get_lruvec(memcg, nid);
/* see the comment on MEMCG_NR_GENS */
- if (lru_gen_memcg_seg(lruvec) != MEMCG_LRU_HEAD)
+ if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_HEAD)
lru_gen_rotate_memcg(lruvec, MEMCG_LRU_HEAD);
}
- #else /* !CONFIG_MEMCG */
-
- static int lru_gen_memcg_seg(struct lruvec *lruvec)
- {
- return 0;
- }
-
- #endif
+ #endif /* CONFIG_MEMCG */
/******************************************************************************
* the eviction
if (mem_cgroup_below_low(NULL, memcg)) {
/* see the comment on MEMCG_NR_GENS */
- if (lru_gen_memcg_seg(lruvec) != MEMCG_LRU_TAIL)
+ if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL)
return MEMCG_LRU_TAIL;
memcg_memory_event(memcg, MEMCG_LOW);
return 0;
/* one retry if offlined or too small */
- return lru_gen_memcg_seg(lruvec) != MEMCG_LRU_TAIL ?
+ return READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL ?
MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
}
- #ifdef CONFIG_MEMCG
-
static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
{
int op;
blk_finish_plug(&plug);
}
- #else /* !CONFIG_MEMCG */
-
- static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
- {
- BUILD_BUG();
- }
-
- static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
- {
- BUILD_BUG();
- }
-
- #endif
-
static void set_initial_priority(struct pglist_data *pgdat, struct scan_control *sc)
{
int priority;
int type, tier;
int hist = lru_hist_from_seq(seq);
struct lru_gen_folio *lrugen = &lruvec->lrugen;
+ struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
for (tier = 0; tier < MAX_NR_TIERS; tier++) {
seq_printf(m, " %10d", tier);
seq_putc(m, '\n');
}
+ if (!mm_state)
+ return;
+
seq_puts(m, " ");
for (i = 0; i < NR_MM_STATS; i++) {
const char *s = " ";
if (seq == max_seq && NR_HIST_GENS == 1) {
s = "LOYNFA";
- n = READ_ONCE(lruvec->mm_state.stats[hist][i]);
+ n = READ_ONCE(mm_state->stats[hist][i]);
} else if (seq != max_seq && NR_HIST_GENS > 1) {
s = "loynfa";
- n = READ_ONCE(lruvec->mm_state.stats[hist][i]);
+ n = READ_ONCE(mm_state->stats[hist][i]);
}
seq_printf(m, " %10lu%c", n, s[i]);
* initialization
******************************************************************************/
+ void lru_gen_init_pgdat(struct pglist_data *pgdat)
+ {
+ int i, j;
+
+ spin_lock_init(&pgdat->memcg_lru.lock);
+
+ for (i = 0; i < MEMCG_NR_GENS; i++) {
+ for (j = 0; j < MEMCG_NR_BINS; j++)
+ INIT_HLIST_NULLS_HEAD(&pgdat->memcg_lru.fifo[i][j], i);
+ }
+ }
+
void lru_gen_init_lruvec(struct lruvec *lruvec)
{
int i;
int gen, type, zone;
struct lru_gen_folio *lrugen = &lruvec->lrugen;
+ struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
lrugen->max_seq = MIN_NR_GENS + 1;
lrugen->enabled = lru_gen_enabled();
for_each_gen_type_zone(gen, type, zone)
INIT_LIST_HEAD(&lrugen->folios[gen][type][zone]);
- lruvec->mm_state.seq = MIN_NR_GENS;
+ if (mm_state)
+ mm_state->seq = MIN_NR_GENS;
}
#ifdef CONFIG_MEMCG
- void lru_gen_init_pgdat(struct pglist_data *pgdat)
+ void lru_gen_init_memcg(struct mem_cgroup *memcg)
{
- int i, j;
+ struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
- spin_lock_init(&pgdat->memcg_lru.lock);
+ if (!mm_list)
+ return;
- for (i = 0; i < MEMCG_NR_GENS; i++) {
- for (j = 0; j < MEMCG_NR_BINS; j++)
- INIT_HLIST_NULLS_HEAD(&pgdat->memcg_lru.fifo[i][j], i);
- }
- }
-
- void lru_gen_init_memcg(struct mem_cgroup *memcg)
- {
- INIT_LIST_HEAD(&memcg->mm_list.fifo);
- spin_lock_init(&memcg->mm_list.lock);
+ INIT_LIST_HEAD(&mm_list->fifo);
+ spin_lock_init(&mm_list->lock);
}
void lru_gen_exit_memcg(struct mem_cgroup *memcg)
{
int i;
int nid;
+ struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
- VM_WARN_ON_ONCE(!list_empty(&memcg->mm_list.fifo));
+ VM_WARN_ON_ONCE(mm_list && !list_empty(&mm_list->fifo));
for_each_node(nid) {
struct lruvec *lruvec = get_lruvec(memcg, nid);
+ struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
sizeof(lruvec->lrugen.nr_pages)));
lruvec->lrugen.list.next = LIST_POISON1;
+ if (!mm_state)
+ continue;
+
for (i = 0; i < NR_BLOOM_FILTERS; i++) {
- bitmap_free(lruvec->mm_state.filters[i]);
- lruvec->mm_state.filters[i] = NULL;
+ bitmap_free(mm_state->filters[i]);
+ mm_state->filters[i] = NULL;
}
}
}
static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
{
+ BUILD_BUG();
}
static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
+ BUILD_BUG();
}
static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *sc)
{
+ BUILD_BUG();
}
#endif /* CONFIG_LRU_GEN */
* scan_control uses s8 fields for order, priority, and reclaim_idx.
* Confirm they are large enough for max values.
*/
- BUILD_BUG_ON(MAX_ORDER >= S8_MAX);
+ BUILD_BUG_ON(MAX_PAGE_ORDER >= S8_MAX);
BUILD_BUG_ON(DEF_PRIORITY > S8_MAX);
BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX);
}
skb = nc->skb_cache[--nc->skb_count];
- kasan_unpoison_object_data(skbuff_cache, skb);
+ kasan_mempool_unpoison_object(skb, kmem_cache_size(skbuff_cache));
return skb;
}
struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
u32 i;
- kasan_poison_object_data(skbuff_cache, skb);
+ if (!kasan_mempool_poison_object(skb))
+ return;
+
nc->skb_cache[nc->skb_count++] = skb;
if (unlikely(nc->skb_count == NAPI_SKB_CACHE_SIZE)) {
for (i = NAPI_SKB_CACHE_HALF; i < NAPI_SKB_CACHE_SIZE; i++)
- kasan_unpoison_object_data(skbuff_cache,
- nc->skb_cache[i]);
+ kasan_mempool_unpoison_object(nc->skb_cache[i],
+ kmem_cache_size(skbuff_cache));
kmem_cache_free_bulk(skbuff_cache, NAPI_SKB_CACHE_HALF,
nc->skb_cache + NAPI_SKB_CACHE_HALF);
/* GSO partial only requires that we trim off any excess that
* doesn't fit into an MSS sized block, so take care of that
* now.
+ * Cap len to not accidentally hit GSO_BY_FRAGS.
*/
- partial_segs = len / mss;
+ partial_segs = min(len, GSO_BY_FRAGS - 1) / mss;
if (partial_segs > 1)
mss *= partial_segs;
else
static void skb_extensions_init(void)
{
BUILD_BUG_ON(SKB_EXT_NUM >= 8);
+#if !IS_ENABLED(CONFIG_KCOV_INSTRUMENT_ALL)
BUILD_BUG_ON(skb_ext_total_length() > 255);
+#endif
skbuff_ext_cache = kmem_cache_create("skbuff_ext_cache",
SKB_EXT_ALIGN_VALUE * skb_ext_total_length(),