No memory is allocated for such pages.
pages_compacted the number of pages freed during compaction
huge_pages the number of incompressible pages
+ huge_pages_since the number of incompressible pages since zram set up
================ =============================================================
File /sys/block/zram<id>/bd_stat
With the command, zram writeback idle pages from memory to the storage.
+ If admin want to write a specific page in zram device to backing device,
+ they could write a page index into the interface.
+
+ echo "page_index=1251" > /sys/block/zramX/writeback
+
If there are lots of write IO with flash device, potentially, it has
flash wearout problem so that admin needs to design write limitation
to guarantee storage health for entire product life.
/sys/block/zram0/writeback_limit.
$ echo 1 > /sys/block/zram0/writeback_limit_enable
-If admins want to allow further write again once the bugdet is exhausted,
+If admins want to allow further write again once the budget is exhausted,
he could do it like below::
$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
acceptable trade for large contiguous free memory. Set to 0 to prevent
compaction from moving pages that are unevictable. Default value is 1.
On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due
-to compaction, which would block the task from becomming active until the fault
+to compaction, which would block the task from becoming active until the fault
is resolved.
unprivileged_userfaultfd
========================
- This flag controls whether unprivileged users can use the userfaultfd
- system calls. Set this to 1 to allow unprivileged users to use the
- userfaultfd system calls, or set this to 0 to restrict userfaultfd to only
- privileged users (with SYS_CAP_PTRACE capability).
+ This flag controls the mode in which unprivileged users can use the
+ userfaultfd system calls. Set this to 0 to restrict unprivileged users
+ to handle page faults in user mode only. In this case, users without
+ SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd to
+ succeed. Prohibiting use of userfaultfd for handling faults from kernel
+ mode may make certain vulnerabilities more difficult to exploit.
- The default value is 1.
+ Set this to 1 to allow unprivileged users to use the userfaultfd system
+ calls without any restrictions.
+
+ The default value is 0.
user_reserve_kbytes
This option significantly enlarges kernel but it gives x1.1-x2 performance
boost over outline instrumented kernel.
- Generic KASAN prints up to 2 call_rcu() call stacks in reports, the last one
- and the second to last.
+ Generic KASAN also reports the last 2 call stacks to creation of work that
+ potentially has access to an object. Call stacks for the following are shown:
+ call_rcu() and workqueue queuing.
Software tag-based KASAN
~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~
With ``CONFIG_KUNIT`` built-in, ``CONFIG_KASAN_KUNIT_TEST`` can be built-in
-on any architecure that supports KASAN. These and any other KUnit
+on any architecture that supports KASAN. These and any other KUnit
tests enabled will run and print the results at boot as a late-init
call.
``CONFIG_KASAN`` built-in. The type of error expected and the
function being run is printed before the expression expected to give
an error. Then the error is printed, if found, and that test
-should be interpretted to pass only if the error was the one expected
+should be interpreted to pass only if the error was the one expected
by the test.
#
# Select if the architecture provides the arch_dma_set_uncached symbol to
- # either provide an uncached segement alias for a DMA allocation, or
+ # either provide an uncached segment alias for a DMA allocation, or
# to remap the page tables in place.
#
config ARCH_HAS_DMA_SET_UNCACHED
config HAVE_ASM_MODVERSIONS
bool
help
- This symbol should be selected by an architecure if it provides
+ This symbol should be selected by an architecture if it provides
<asm/asm-prototypes.h> to support the module versioning for symbols
exported from assembly code.
config HAVE_REGS_AND_STACK_ACCESS_API
bool
help
- This symbol should be selected by an architecure if it supports
+ This symbol should be selected by an architecture if it supports
the API needed to access registers and stack entries from pt_regs,
declared in asm/ptrace.h
For example the kprobes-based event tracer needs this API.
config HAVE_FUNCTION_ARG_ACCESS_API
bool
help
- This symbol should be selected by an architecure if it supports
+ This symbol should be selected by an architecture if it supports
the API needed to access function arguments from pt_regs,
declared in asm/ptrace.h
protected inside rcu_irq_enter/rcu_irq_exit() but preemption or signal
handling on irq exit still need to be protected.
+config HAVE_CONTEXT_TRACKING_OFFSTACK
+ bool
+ help
+ Architecture neither relies on exception_enter()/exception_exit()
+ nor on schedule_user(). Also preempt_schedule_notrace() and
+ preempt_schedule_irq() can't be called in a preemptible section
+ while context tracking is CONTEXT_USER. This feature reflects a sane
+ entry implementation where the following requirements are met on
+ critical entry code, ie: before user_exit() or after user_enter():
+
+ - Critical entry code isn't preemptible (or better yet:
+ not interruptible).
+ - No use of RCU read side critical sections, unless rcu_nmi_enter()
+ got called.
+ - No use of instrumentation, unless instrumentation_begin() got
+ called.
+
config HAVE_TIF_NOHZ
bool
help
Archs need to ensure they use a high enough resolution clock to
support irq time accounting and then call enable_sched_clock_irqtime().
+ config HAVE_MOVE_PUD
+ bool
+ help
+ Architectures that select this are able to move page tables at the
+ PUD level. If there are only 3 page table levels, the move effectively
+ happens at the PGD level.
+
config HAVE_MOVE_PMD
bool
help
by the linker, since the locations of such sections can change between linker
versions.
+ config HAVE_ARCH_PFN_VALID
+ bool
+
+ config ARCH_SUPPORTS_DEBUG_PAGEALLOC
+ bool
+
source "kernel/gcov/Kconfig"
source "scripts/gcc-plugins/Kconfig"
config ARCH_DISCONTIGMEM_ENABLE
def_bool n
+ depends on BROKEN
config ARCH_FLATMEM_ENABLE
def_bool y
config HIGHMEM
bool "High Memory Support"
- select ARCH_DISCONTIGMEM_ENABLE
+ select HAVE_ARCH_PFN_VALID
+ select KMAP_LOCAL
help
With ARC 2G:2G address split, only upper 2G is directly addressable by
kernel. Enable this to potentially allow access to rest of 2G and PAE
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAVE_CUSTOM_GPIO_H
select ARCH_HAS_GCOV_PROFILE_ALL
- select ARCH_KEEP_MEMBLOCK if HAVE_ARCH_PFN_VALID || KEXEC
+ select ARCH_KEEP_MEMBLOCK
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_NO_SG_CHAIN if !ARM_HAS_SG_CHAIN
select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
select HAVE_ARCH_MMAP_RND_BITS if MMU
+ select HAVE_ARCH_PFN_VALID
select HAVE_ARCH_SECCOMP
select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT
select HAVE_ARCH_THREAD_STRUCT_WHITELIST
config ARCH_OMAP1
bool "TI OMAP1"
depends on MMU
- select ARCH_HAS_HOLES_MEMORYMODEL
select ARCH_OMAP
select CLKDEV_LOOKUP
select CLKSRC_MMIO
UNPREDICTABLE (in fact it can be predicted that it won't work
at all). If in doubt say N.
- config ARCH_HAS_HOLES_MEMORYMODEL
- bool
-
config ARCH_SELECT_MEMORY_MODEL
bool
bool
select SPARSEMEM_STATIC if SPARSEMEM
- config HAVE_ARCH_PFN_VALID
- def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
-
config HIGHMEM
bool "High Memory Support"
depends on MMU
+ select KMAP_LOCAL
help
The address space of ARM processors is only 4 Gigabytes large
and it has to accommodate user address space, kernel address
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_USE_SYM_ANNOTATIONS
+ select ARCH_SUPPORTS_DEBUG_PAGEALLOC
select ARCH_SUPPORTS_MEMORY_FAILURE
select ARCH_SUPPORTS_SHADOW_CALL_STACK if CC_HAVE_SHADOW_CALL_STACK
select ARCH_SUPPORTS_ATOMIC_RMW
select HANDLE_DOMAIN_IRQ
select HARDIRQS_SW_RESEND
select HAVE_MOVE_PMD
+ select HAVE_MOVE_PUD
select HAVE_PCI
select HAVE_ACPI_APEI if (ACPI && EFI)
select HAVE_ALIGNED_STRUCT_PAGE if SLUB
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
+ select HAVE_ARCH_PFN_VALID
select HAVE_ARCH_PREL32_RELOCATIONS
select HAVE_ARCH_SECCOMP_FILTER
select HAVE_ARCH_STACKLEAK
select HAVE_NMI
select HAVE_PATA_PLATFORM
select HAVE_PERF_EVENTS
+ select HAVE_PERF_EVENTS_NMI if ARM64_PSEUDO_NMI && HW_PERF_EVENTS
+ select HAVE_HARDLOCKUP_DETECTOR_PERF if PERF_EVENTS && HAVE_PERF_EVENTS_NMI
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_REGS_AND_STACK_ACCESS_API
select PCI_SYSCALL if PCI
select POWER_RESET
select POWER_SUPPLY
- select SET_FS
select SPARSE_IRQ
select SWIOTLB
select SYSCTL_EXCEPTION_TRACE
config KASAN_SHADOW_OFFSET
hex
depends on KASAN
- default 0xdfffa00000000000 if (ARM64_VA_BITS_48 || ARM64_VA_BITS_52) && !KASAN_SW_TAGS
- default 0xdfffd00000000000 if ARM64_VA_BITS_47 && !KASAN_SW_TAGS
- default 0xdffffe8000000000 if ARM64_VA_BITS_42 && !KASAN_SW_TAGS
- default 0xdfffffd000000000 if ARM64_VA_BITS_39 && !KASAN_SW_TAGS
- default 0xdffffffa00000000 if ARM64_VA_BITS_36 && !KASAN_SW_TAGS
- default 0xefff900000000000 if (ARM64_VA_BITS_48 || ARM64_VA_BITS_52) && KASAN_SW_TAGS
- default 0xefffc80000000000 if ARM64_VA_BITS_47 && KASAN_SW_TAGS
- default 0xeffffe4000000000 if ARM64_VA_BITS_42 && KASAN_SW_TAGS
- default 0xefffffc800000000 if ARM64_VA_BITS_39 && KASAN_SW_TAGS
- default 0xeffffff900000000 if ARM64_VA_BITS_36 && KASAN_SW_TAGS
+ default 0xdfff800000000000 if (ARM64_VA_BITS_48 || ARM64_VA_BITS_52) && !KASAN_SW_TAGS
+ default 0xdfffc00000000000 if ARM64_VA_BITS_47 && !KASAN_SW_TAGS
+ default 0xdffffe0000000000 if ARM64_VA_BITS_42 && !KASAN_SW_TAGS
+ default 0xdfffffc000000000 if ARM64_VA_BITS_39 && !KASAN_SW_TAGS
+ default 0xdffffff800000000 if ARM64_VA_BITS_36 && !KASAN_SW_TAGS
+ default 0xefff800000000000 if (ARM64_VA_BITS_48 || ARM64_VA_BITS_52) && KASAN_SW_TAGS
+ default 0xefffc00000000000 if ARM64_VA_BITS_47 && KASAN_SW_TAGS
+ default 0xeffffe0000000000 if ARM64_VA_BITS_42 && KASAN_SW_TAGS
+ default 0xefffffc000000000 if ARM64_VA_BITS_39 && KASAN_SW_TAGS
+ default 0xeffffff800000000 if ARM64_VA_BITS_36 && KASAN_SW_TAGS
default 0xffffffffffffffff
source "arch/arm64/Kconfig.platforms"
source "kernel/Kconfig.hz"
- config ARCH_SUPPORTS_DEBUG_PAGEALLOC
- def_bool y
-
config ARCH_SPARSEMEM_ENABLE
def_bool y
select SPARSEMEM_VMEMMAP_ENABLE
config ARCH_FLATMEM_ENABLE
def_bool !NUMA
- config HAVE_ARCH_PFN_VALID
- def_bool y
-
config HW_PERF_EVENTS
def_bool y
depends on ARM_PMU
The feature is detected at runtime, and will remain as a 'nop'
instruction if the cpu does not implement the feature.
+config AS_HAS_LDAPR
+ def_bool $(as-instr,.arch_extension rcpc)
+
config ARM64_LSE_ATOMICS
bool
default ARM64_USE_LSE_ATOMICS
menu "ARMv8.2 architectural features"
-config ARM64_UAO
- bool "Enable support for User Access Override (UAO)"
- default y
- help
- User Access Override (UAO; part of the ARMv8.2 Extensions)
- causes the 'unprivileged' variant of the load/store instructions to
- be overridden to be privileged.
-
- This option changes get_user() and friends to use the 'unprivileged'
- variant of the load/store instructions. This ensures that user-space
- really did have access to the supplied memory. When addr_limit is
- set to kernel memory the UAO bit will be set, allowing privileged
- access to kernel memory.
-
- Choosing this option will cause copy_to_user() et al to use user-space
- memory permissions.
-
- The feature is detected at runtime, the kernel will use the
- regular load/store instructions if the cpu does not implement the
- feature.
-
config ARM64_PMEM
bool "Enable support for persistent memory"
select ARCH_HAS_PMEM_API
entering them here. As a minimum, you should specify the the
root device (e.g. root=/dev/nfs).
+choice
+ prompt "Kernel command line type" if CMDLINE != ""
+ default CMDLINE_FROM_BOOTLOADER
+ help
+ Choose how the kernel will handle the provided default kernel
+ command line string.
+
+config CMDLINE_FROM_BOOTLOADER
+ bool "Use bootloader kernel arguments if available"
+ help
+ Uses the command-line options passed by the boot loader. If
+ the boot loader doesn't provide any, the default kernel command
+ string provided in CMDLINE will be used.
+
+config CMDLINE_EXTEND
+ bool "Extend bootloader kernel arguments"
+ help
+ The command-line arguments provided by the boot loader will be
+ appended to the default kernel command string.
+
config CMDLINE_FORCE
bool "Always use the default kernel command string"
- depends on CMDLINE != ""
help
Always use the default kernel command string, even if the boot
loader passes other arguments to the kernel.
This is useful if you cannot or don't want to change the
command-line options your boot loader passes to the kernel.
+endchoice
+
config EFI_STUB
bool
* and fixed mappings
*/
#define VMALLOC_START (MODULES_END)
-#define VMALLOC_END (- PUD_SIZE - VMEMMAP_SIZE - SZ_64K)
+#define VMALLOC_END (VMEMMAP_START - SZ_256M)
#define vmemmap ((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))
#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
#define pmd_young(pmd) pte_young(pmd_pte(pmd))
#define pmd_valid(pmd) pte_valid(pmd_pte(pmd))
+#define pmd_cont(pmd) pte_cont(pmd_pte(pmd))
#define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
#define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))
#define pfn_pud(pfn,prot) __pud(__phys_to_pud_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
#define set_pmd_at(mm, addr, pmdp, pmd) set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd))
+ #define set_pud_at(mm, addr, pudp, pud) set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud))
#define __p4d_to_phys(p4d) __pte_to_phys(p4d_pte(p4d))
#define __phys_to_p4d_val(phys) __phys_to_pte_val(phys)
PMD_TYPE_SECT)
#define pmd_leaf(pmd) pmd_sect(pmd)
+#define pmd_leaf_size(pmd) (pmd_cont(pmd) ? CONT_PMD_SIZE : PMD_SIZE)
+#define pte_leaf_size(pte) (pte_cont(pte) ? CONT_PTE_SIZE : PAGE_SIZE)
+
#if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
static inline bool pud_sect(pud_t pud) { return false; }
static inline bool pud_table(pud_t pud) { return true; }
extern pgd_t idmap_pg_dir[PTRS_PER_PGD];
extern pgd_t idmap_pg_end[];
extern pgd_t tramp_pg_dir[PTRS_PER_PGD];
+extern pgd_t reserved_pg_dir[PTRS_PER_PGD];
extern void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd);
#include <linux/kexec.h>
#include <linux/crash_dump.h>
#include <linux/hugetlb.h>
+#include <linux/acpi_iort.h>
#include <asm/boot.h>
#include <asm/fixmap.h>
#include <asm/tlb.h>
#include <asm/alternative.h>
-#define ARM64_ZONE_DMA_BITS 30
-
/*
* We need to be able to catch inadvertent references to memstart_addr
* that occur (potentially in generic code) before arm64_memblock_init()
#endif /* CONFIG_CRASH_DUMP */
/*
- * Return the maximum physical address for a zone with a given address size
- * limit. It currently assumes that for memory starting above 4G, 32-bit
- * devices will use a DMA offset.
+ * Return the maximum physical address for a zone accessible by the given bits
+ * limit. If DRAM starts above 32-bit, expand the zone to the maximum
+ * available memory, otherwise cap it at 32-bit.
*/
static phys_addr_t __init max_zone_phys(unsigned int zone_bits)
{
- phys_addr_t offset = memblock_start_of_DRAM() & GENMASK_ULL(63, zone_bits);
- return min(offset + (1ULL << zone_bits), memblock_end_of_DRAM());
+ phys_addr_t zone_mask = DMA_BIT_MASK(zone_bits);
+ phys_addr_t phys_start = memblock_start_of_DRAM();
+
+ if (phys_start > U32_MAX)
+ zone_mask = PHYS_ADDR_MAX;
+ else if (phys_start > zone_mask)
+ zone_mask = U32_MAX;
+
+ return min(zone_mask, memblock_end_of_DRAM() - 1) + 1;
}
static void __init zone_sizes_init(unsigned long min, unsigned long max)
{
unsigned long max_zone_pfns[MAX_NR_ZONES] = {0};
+ unsigned int __maybe_unused acpi_zone_dma_bits;
+ unsigned int __maybe_unused dt_zone_dma_bits;
#ifdef CONFIG_ZONE_DMA
+ acpi_zone_dma_bits = fls64(acpi_iort_dma_get_max_cpu_address());
+ dt_zone_dma_bits = fls64(of_dma_get_max_cpu_address(NULL));
+ zone_dma_bits = min3(32U, dt_zone_dma_bits, acpi_zone_dma_bits);
+ arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
#endif
#ifdef CONFIG_ZONE_DMA32
void __init arm64_memblock_init(void)
{
- const s64 linear_region_size = BIT(vabits_actual - 1);
+ const s64 linear_region_size = PAGE_END - _PAGE_OFFSET(vabits_actual);
/* Handle linux,usable-memory-range property */
fdt_enforce_memory_region();
if (IS_ENABLED(CONFIG_RANDOMIZE_BASE)) {
extern u16 memstart_offset_seed;
- u64 range = linear_region_size -
- (memblock_end_of_DRAM() - memblock_start_of_DRAM());
+ u64 mmfr0 = read_cpuid(ID_AA64MMFR0_EL1);
+ int parange = cpuid_feature_extract_unsigned_field(
+ mmfr0, ID_AA64MMFR0_PARANGE_SHIFT);
+ s64 range = linear_region_size -
+ BIT(id_aa64mmfr0_parange_to_phys_shift(parange));
/*
* If the size of the linear region exceeds, by a sufficient
- * margin, the size of the region that the available physical
- * memory spans, randomize the linear region as well.
+ * margin, the size of the region that the physical memory can
+ * span, randomize the linear region as well.
*/
- if (memstart_offset_seed > 0 && range >= ARM64_MEMSTART_ALIGN) {
+ if (memstart_offset_seed > 0 && range >= (s64)ARM64_MEMSTART_ALIGN) {
range /= ARM64_MEMSTART_ALIGN;
memstart_addr -= ARM64_MEMSTART_ALIGN *
((range * memstart_offset_seed) >> 16);
* Register the kernel text, kernel data, initrd, and initial
* pagetables with memblock.
*/
- memblock_reserve(__pa_symbol(_text), _end - _text);
+ memblock_reserve(__pa_symbol(_stext), _end - _stext);
if (IS_ENABLED(CONFIG_BLK_DEV_INITRD) && phys_initrd_size) {
/* the generic initrd code expects virtual addresses */
initrd_start = __phys_to_virt(phys_initrd_start);
early_init_fdt_scan_reserved_mem();
- if (IS_ENABLED(CONFIG_ZONE_DMA)) {
- zone_dma_bits = ARM64_ZONE_DMA_BITS;
- arm64_dma_phys_limit = max_zone_phys(ARM64_ZONE_DMA_BITS);
- }
-
if (IS_ENABLED(CONFIG_ZONE_DMA32))
arm64_dma32_phys_limit = max_zone_phys(32);
else
arm64_dma32_phys_limit = PHYS_MASK + 1;
- reserve_crashkernel();
-
reserve_elfcorehdr();
high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
sparse_init();
zone_sizes_init(min, max);
+ /*
+ * request_standard_resources() depends on crashkernel's memory being
+ * reserved, so do it here.
+ */
+ reserve_crashkernel();
+
memblock_dump_all();
}
- #ifndef CONFIG_SPARSEMEM_VMEMMAP
- static inline void free_memmap(unsigned long start_pfn, unsigned long end_pfn)
- {
- struct page *start_pg, *end_pg;
- unsigned long pg, pgend;
-
- /*
- * Convert start_pfn/end_pfn to a struct page pointer.
- */
- start_pg = pfn_to_page(start_pfn - 1) + 1;
- end_pg = pfn_to_page(end_pfn - 1) + 1;
-
- /*
- * Convert to physical addresses, and round start upwards and end
- * downwards.
- */
- pg = (unsigned long)PAGE_ALIGN(__pa(start_pg));
- pgend = (unsigned long)__pa(end_pg) & PAGE_MASK;
-
- /*
- * If there are free pages between these, free the section of the
- * memmap array.
- */
- if (pg < pgend)
- memblock_free(pg, pgend - pg);
- }
-
- /*
- * The mem_map array can get very big. Free the unused area of the memory map.
- */
- static void __init free_unused_memmap(void)
- {
- unsigned long start, end, prev_end = 0;
- int i;
-
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, NULL) {
- #ifdef CONFIG_SPARSEMEM
- /*
- * Take care not to free memmap entries that don't exist due
- * to SPARSEMEM sections which aren't present.
- */
- start = min(start, ALIGN(prev_end, PAGES_PER_SECTION));
- #endif
- /*
- * If we had a previous bank, and there is a space between the
- * current bank and the previous, free it.
- */
- if (prev_end && prev_end < start)
- free_memmap(prev_end, start);
-
- /*
- * Align up here since the VM subsystem insists that the
- * memmap entries are valid from the bank end aligned to
- * MAX_ORDER_NR_PAGES.
- */
- prev_end = ALIGN(end, MAX_ORDER_NR_PAGES);
- }
-
- #ifdef CONFIG_SPARSEMEM
- if (!IS_ALIGNED(prev_end, PAGES_PER_SECTION))
- free_memmap(prev_end, ALIGN(prev_end, PAGES_PER_SECTION));
- #endif
- }
- #endif /* !CONFIG_SPARSEMEM_VMEMMAP */
-
/*
* mem_init() marks the free areas in the mem_map and tells us how much memory
* is free. This is done after various parts of the system have claimed their
set_max_mapnr(max_pfn - PHYS_PFN_OFFSET);
- #ifndef CONFIG_SPARSEMEM_VMEMMAP
- free_unused_memmap();
- #endif
/* this will put all unused low memory onto the freelists */
memblock_free_all();
select ARCH_MIGHT_HAVE_PC_SERIO
select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_SUPPORTS_ATOMIC_RMW
+ select ARCH_SUPPORTS_DEBUG_PAGEALLOC if PPC32 || PPC_BOOK3S_64
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF if PPC64
select ARCH_USE_QUEUED_RWLOCKS if PPC_QUEUED_SPINLOCKS
depends on PCI
depends on PPC64 # not supported on 32 bits yet
- config ARCH_SUPPORTS_DEBUG_PAGEALLOC
- depends on PPC32 || PPC_BOOK3S_64
- def_bool y
-
config ARCH_SUPPORTS_UPROBES
def_bool y
config HIGHMEM
bool "High memory support"
depends on PPC32
+ select KMAP_LOCAL
source "kernel/Kconfig.hz"
config PGSTE
def_bool y if KVM
- config ARCH_SUPPORTS_DEBUG_PAGEALLOC
- def_bool y
-
config AUDIT_ARCH
def_bool y
config KASAN_SHADOW_OFFSET
hex
depends on KASAN
- default 0x18000000000000 if KASAN_S390_4_LEVEL_PAGING
- default 0x30000000000
+ default 0x18000000000000
config S390
def_bool y
select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE
select ARCH_STACKWALK
select ARCH_SUPPORTS_ATOMIC_RMW
+ select ARCH_SUPPORTS_DEBUG_PAGEALLOC
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF
select PCI_DOMAINS if PCI
select PCI_MSI if PCI
select PCI_MSI_ARCH_FALLBACKS if PCI_MSI
- select SET_FS
select SPARSE_IRQ
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
config PCI_NR_FUNCTIONS
int "Maximum number of PCI functions (1-4096)"
range 1 4096
- default "128"
+ default "512"
help
This allows you to specify the maximum number of PCI functions which
this kernel will support.
CONFIG_DEFERRED_STRUCT_PAGE_INIT=y
CONFIG_IDLE_PAGE_TRACKING=y
CONFIG_PERCPU_STATS=y
- CONFIG_GUP_BENCHMARK=y
+ CONFIG_GUP_TEST=y
CONFIG_NET=y
CONFIG_PACKET=y
CONFIG_PACKET_DIAG=m
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_BPF_KPROBE_OVERRIDE=y
CONFIG_HIST_TRIGGERS=y
+CONFIG_DEBUG_USER_ASCE=y
CONFIG_NOTIFIER_ERROR_INJECTION=m
CONFIG_NETDEV_NOTIFIER_ERROR_INJECT=m
CONFIG_FAULT_INJECTION=y
#include <asm/sections.h>
#include <asm/vdso.h>
#include <asm/facility.h>
+#include <asm/timex.h>
extern char vdso64_start, vdso64_end;
static void *vdso64_kbase = &vdso64_start;
static int vdso_mremap(const struct vm_special_mapping *sm,
struct vm_area_struct *vma)
{
- unsigned long vdso_pages;
-
- vdso_pages = vdso64_pages;
-
- if ((vdso_pages << PAGE_SHIFT) != vma->vm_end - vma->vm_start)
- return -EINVAL;
-
- if (WARN_ON_ONCE(current->mm != vma->vm_mm))
- return -EFAULT;
-
current->mm->context.vdso_base = vma->vm_start;
+
return 0;
}
u8 page[PAGE_SIZE];
} vdso_data_store __page_aligned_data;
struct vdso_data *vdso_data = (struct vdso_data *)&vdso_data_store.data;
-/*
- * Allocate/free per cpu vdso data.
- */
-#define SEGMENT_ORDER 2
-int vdso_alloc_per_cpu(struct lowcore *lowcore)
+void vdso_getcpu_init(void)
{
- unsigned long segment_table, page_table, page_frame;
- struct vdso_per_cpu_data *vd;
-
- segment_table = __get_free_pages(GFP_KERNEL, SEGMENT_ORDER);
- page_table = get_zeroed_page(GFP_KERNEL);
- page_frame = get_zeroed_page(GFP_KERNEL);
- if (!segment_table || !page_table || !page_frame)
- goto out;
- arch_set_page_dat(virt_to_page(segment_table), SEGMENT_ORDER);
- arch_set_page_dat(virt_to_page(page_table), 0);
-
- /* Initialize per-cpu vdso data page */
- vd = (struct vdso_per_cpu_data *) page_frame;
- vd->cpu_nr = lowcore->cpu_nr;
- vd->node_id = cpu_to_node(vd->cpu_nr);
-
- /* Set up page table for the vdso address space */
- memset64((u64 *)segment_table, _SEGMENT_ENTRY_EMPTY, _CRST_ENTRIES);
- memset64((u64 *)page_table, _PAGE_INVALID, PTRS_PER_PTE);
-
- *(unsigned long *) segment_table = _SEGMENT_ENTRY + page_table;
- *(unsigned long *) page_table = _PAGE_PROTECT + page_frame;
-
- lowcore->vdso_asce = segment_table +
- _ASCE_TABLE_LENGTH + _ASCE_USER_BITS + _ASCE_TYPE_SEGMENT;
- lowcore->vdso_per_cpu_data = page_frame;
-
- return 0;
-
-out:
- free_page(page_frame);
- free_page(page_table);
- free_pages(segment_table, SEGMENT_ORDER);
- return -ENOMEM;
-}
-
-void vdso_free_per_cpu(struct lowcore *lowcore)
-{
- unsigned long segment_table, page_table, page_frame;
-
- segment_table = lowcore->vdso_asce & PAGE_MASK;
- page_table = *(unsigned long *) segment_table;
- page_frame = *(unsigned long *) page_table;
-
- free_page(page_frame);
- free_page(page_table);
- free_pages(segment_table, SEGMENT_ORDER);
+ set_tod_programmable_field(smp_processor_id());
}
/*
{
int i;
+ vdso_getcpu_init();
/* Calculate the size of the 64 bit vDSO */
vdso64_pages = ((&vdso64_end - &vdso64_start
+ PAGE_SIZE - 1) >> PAGE_SHIFT) + 1;
}
vdso64_pagelist[vdso64_pages - 1] = virt_to_page(vdso_data);
vdso64_pagelist[vdso64_pages] = NULL;
- if (vdso_alloc_per_cpu(&S390_lowcore))
- BUG();
get_page(virt_to_page(vdso_data));
select HAVE_C_RECORDMCOUNT
select HAVE_ARCH_AUDITSYSCALL
select ARCH_SUPPORTS_ATOMIC_RMW
+ select ARCH_SUPPORTS_DEBUG_PAGEALLOC
select HAVE_NMI
select HAVE_REGS_AND_STACK_ACCESS_API
select ARCH_USE_QUEUED_RWLOCKS
config HIGHMEM
bool
default y if SPARC32
+ select KMAP_LOCAL
config ZONE_DMA
bool
bool
default y if SPARC32
- config ARCH_SUPPORTS_DEBUG_PAGEALLOC
- def_bool y if SPARC64
-
config PGTABLE_LEVELS
default 4 if 64BIT
default 3
select ARCH_WANT_IPC_PARSE_VERSION
select CLKSRC_I8253
select CLONE_BACKWARDS
+ select GENERIC_VDSO_32
select HAVE_DEBUG_STACKOVERFLOW
+ select KMAP_LOCAL
select MODULES_USE_ELF_REL
select OLD_SIGACTION
- select GENERIC_VDSO_32
config X86_64
def_bool y
select ARCH_STACKWALK
select ARCH_SUPPORTS_ACPI
select ARCH_SUPPORTS_ATOMIC_RMW
+ select ARCH_SUPPORTS_DEBUG_PAGEALLOC
select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
+ select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP if NR_CPUS <= 4096
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
select HAVE_CMPXCHG_DOUBLE
select HAVE_CMPXCHG_LOCAL
select HAVE_CONTEXT_TRACKING if X86_64
+ select HAVE_CONTEXT_TRACKING_OFFSTACK if HAVE_CONTEXT_TRACKING
select HAVE_C_RECORDMCOUNT
select HAVE_DEBUG_KMEMLEAK
select HAVE_DMA_CONTIGUOUS
select HAVE_MIXED_BREAKPOINTS_REGS
select HAVE_MOD_ARCH_SPECIFIC
select HAVE_MOVE_PMD
+ select HAVE_MOVE_PUD
select HAVE_NMI
select HAVE_OPROFILE
select HAVE_OPTPROBES
config AUDIT_ARCH
def_bool y if X86_64
- config ARCH_SUPPORTS_DEBUG_PAGEALLOC
- def_bool y
-
config KASAN_SHADOW_OFFSET
hex
depends on KASAN
side channel attacks- equals the tsx=auto command line parameter.
endchoice
+config X86_SGX
+ bool "Software Guard eXtensions (SGX)"
+ depends on X86_64 && CPU_SUP_INTEL
+ depends on CRYPTO=y
+ depends on CRYPTO_SHA256=y
+ select SRCU
+ select MMU_NOTIFIER
+ help
+ Intel(R) Software Guard eXtensions (SGX) is a set of CPU instructions
+ that can be used by applications to set aside private regions of code
+ and data, referred to as enclaves. An enclave's private memory can
+ only be accessed by code running within the enclave. Accesses from
+ outside the enclave, including other enclaves, are disallowed by
+ hardware.
+
+ If unsure, say N.
+
config EFI
bool "EFI runtime service support"
depends on ACPI
static int vdso_mremap(const struct vm_special_mapping *sm,
struct vm_area_struct *new_vma)
{
- unsigned long new_size = new_vma->vm_end - new_vma->vm_start;
const struct vdso_image *image = current->mm->context.vdso_image;
- if (image->size != new_size)
- return -EINVAL;
-
vdso_fix_landing(image, new_vma);
current->mm->context.vdso = (void __user *)new_vma->vm_start;
return 0;
}
- static int vvar_mremap(const struct vm_special_mapping *sm,
- struct vm_area_struct *new_vma)
- {
- const struct vdso_image *image = new_vma->vm_mm->context.vdso_image;
- unsigned long new_size = new_vma->vm_end - new_vma->vm_start;
-
- if (new_size != -image->sym_vvar_start)
- return -EINVAL;
-
- return 0;
- }
-
#ifdef CONFIG_TIME_NS
static struct page *find_timens_vvar_page(struct vm_area_struct *vma)
{
static const struct vm_special_mapping vvar_mapping = {
.name = "[vvar]",
.fault = vvar_fault,
- .mremap = vvar_mremap,
};
/*
#ifdef CONFIG_COMPAT
int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
- int uses_interp)
+ int uses_interp, bool x32)
{
#ifdef CONFIG_X86_X32_ABI
- if (test_thread_flag(TIF_X32)) {
+ if (x32) {
if (!vdso64_enabled)
return 0;
return map_vdso_randomized(&vdso_image_x32);
}
#endif
+bool arch_syscall_is_vdso_sigreturn(struct pt_regs *regs)
+{
+#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
+ const struct vdso_image *image = current->mm->context.vdso_image;
+ unsigned long vdso = (unsigned long) current->mm->context.vdso;
+
+ if (in_ia32_syscall() && image == &vdso_image_32) {
+ if (regs->ip == vdso + image->sym_vdso32_sigreturn_landing_pad ||
+ regs->ip == vdso + image->sym_vdso32_rt_sigreturn_landing_pad)
+ return true;
+ }
+#endif
+ return false;
+}
+
#ifdef CONFIG_X86_64
static __init int vdso_setup(char *s)
{
* needed. It will also grab the relevant CRTC lock to make sure that the state
* is consistent.
*
+ * WARNING: Drivers may only add new CRTC states to a @state if
+ * drm_atomic_state.allow_modeset is set, or if it's a driver-internal commit
+ * not created by userspace through an IOCTL call.
+ *
* Returns:
*
* Either the allocated state or the error code encoded into the pointer. When
struct __drm_connnectors_state *c;
int alloc = max(index + 1, config->num_connector);
- c = krealloc(state->connectors, alloc * sizeof(*state->connectors), GFP_KERNEL);
+ c = krealloc_array(state->connectors, alloc,
+ sizeof(*state->connectors), GFP_KERNEL);
if (!c)
return ERR_PTR(-ENOMEM);
struct drm_crtc_state *new_crtc_state;
struct drm_connector *conn;
struct drm_connector_state *conn_state;
+ unsigned requested_crtc = 0;
+ unsigned affected_crtc = 0;
int i, ret = 0;
DRM_DEBUG_ATOMIC("checking %p\n", state);
+ for_each_new_crtc_in_state(state, crtc, new_crtc_state, i)
+ requested_crtc |= drm_crtc_mask(crtc);
+
for_each_oldnew_plane_in_state(state, plane, old_plane_state, new_plane_state, i) {
ret = drm_atomic_plane_check(old_plane_state, new_plane_state);
if (ret) {
}
}
+ for_each_new_crtc_in_state(state, crtc, new_crtc_state, i)
+ affected_crtc |= drm_crtc_mask(crtc);
+
+ /*
+ * For commits that allow modesets drivers can add other CRTCs to the
+ * atomic commit, e.g. when they need to reallocate global resources.
+ * This can cause spurious EBUSY, which robs compositors of a very
+ * effective sanity check for their drawing loop. Therefor only allow
+ * drivers to add unrelated CRTC states for modeset commits.
+ *
+ * FIXME: Should add affected_crtc mask to the ATOMIC IOCTL as an output
+ * so compositors know what's going on.
+ */
+ if (affected_crtc != requested_crtc) {
+ DRM_DEBUG_ATOMIC("driver added CRTC to commit: requested 0x%x, affected 0x%0x\n",
+ requested_crtc, affected_crtc);
+ WARN(!state->allow_modeset, "adding CRTC not allowed without modesets: requested 0x%x, affected 0x%0x\n",
+ requested_crtc, affected_crtc);
+ }
+
return 0;
}
EXPORT_SYMBOL(drm_atomic_check_only);
* to dmesg in case of error irq's. (Hint, you probably want to
* ratelimit this!)
*
- * The caller must drm_modeset_lock_all(), or if this is called
- * from error irq handler, it should not be enabled by default.
- * (Ie. if you are debugging errors you might not care that this
- * is racey. But calling this without all modeset locks held is
- * not inherently safe.)
+ * The caller must wrap this drm_modeset_lock_all_ctx() and
+ * drm_modeset_drop_locks(). If this is called from error irq handler, it should
+ * not be enabled by default - if you are debugging errors you might
+ * not care that this is racey, but calling this without all modeset locks held
+ * is inherently unsafe.
*/
void drm_state_dump(struct drm_device *dev, struct drm_printer *p)
{
#include <linux/mount.h>
#include <linux/pseudo_fs.h>
-#include <asm/kmap_types.h>
#include <linux/uaccess.h>
#include <linux/nospec.h>
}
}
- static int aio_ring_mremap(struct vm_area_struct *vma)
+ static int aio_ring_mremap(struct vm_area_struct *vma, unsigned long flags)
{
struct file *file = vma->vm_file;
struct mm_struct *mm = vma->vm_mm;
struct kioctx_table *table;
int i, res = -EINVAL;
+ if (flags & MREMAP_DONTUNMAP)
+ return -EINVAL;
+
spin_lock(&mm->ioctx_lock);
rcu_read_lock();
table = rcu_dereference(mm->ioctx_table);
#include <asm/cacheflush.h>
-#ifndef ARCH_HAS_FLUSH_ANON_PAGE
-static inline void flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned long vmaddr)
-{
-}
-#endif
-
-#ifndef ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE
-static inline void flush_kernel_dcache_page(struct page *page)
-{
-}
-static inline void flush_kernel_vmap_range(void *vaddr, int size)
-{
-}
-static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
-{
-}
-#endif
-
-#include <asm/kmap_types.h>
-
-#ifdef CONFIG_HIGHMEM
-extern void *kmap_atomic_high_prot(struct page *page, pgprot_t prot);
-extern void kunmap_atomic_high(void *kvaddr);
-#include <asm/highmem.h>
-
-#ifndef ARCH_HAS_KMAP_FLUSH_TLB
-static inline void kmap_flush_tlb(unsigned long addr) { }
-#endif
-
-#ifndef kmap_prot
-#define kmap_prot PAGE_KERNEL
-#endif
+#include "highmem-internal.h"
-void *kmap_high(struct page *page);
-static inline void *kmap(struct page *page)
-{
- void *addr;
-
- might_sleep();
- if (!PageHighMem(page))
- addr = page_address(page);
- else
- addr = kmap_high(page);
- kmap_flush_tlb((unsigned long)addr);
- return addr;
-}
-
-void kunmap_high(struct page *page);
-
-static inline void kunmap(struct page *page)
-{
- might_sleep();
- if (!PageHighMem(page))
- return;
- kunmap_high(page);
-}
-
-/*
- * kmap_atomic/kunmap_atomic is significantly faster than kmap/kunmap because
- * no global lock is needed and because the kmap code must perform a global TLB
- * invalidation when the kmap pool wraps.
- *
- * However when holding an atomic kmap it is not legal to sleep, so atomic
- * kmaps are appropriate for short, tight code paths only.
- *
- * The use of kmap_atomic/kunmap_atomic is discouraged - kmap/kunmap
- * gives a more generic (and caching) interface. But kmap_atomic can
- * be used in IRQ contexts, so in some (very limited) cases we need
- * it.
+/**
+ * kmap - Map a page for long term usage
+ * @page: Pointer to the page to be mapped
+ *
+ * Returns: The virtual address of the mapping
+ *
+ * Can only be invoked from preemptible task context because on 32bit
+ * systems with CONFIG_HIGHMEM enabled this function might sleep.
+ *
+ * For systems with CONFIG_HIGHMEM=n and for pages in the low memory area
+ * this returns the virtual address of the direct kernel mapping.
+ *
+ * The returned virtual address is globally visible and valid up to the
+ * point where it is unmapped via kunmap(). The pointer can be handed to
+ * other contexts.
+ *
+ * For highmem pages on 32bit systems this can be slow as the mapping space
+ * is limited and protected by a global lock. In case that there is no
+ * mapping slot available the function blocks until a slot is released via
+ * kunmap().
*/
-static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
-{
- preempt_disable();
- pagefault_disable();
- if (!PageHighMem(page))
- return page_address(page);
- return kmap_atomic_high_prot(page, prot);
-}
-#define kmap_atomic(page) kmap_atomic_prot(page, kmap_prot)
-
-/* declarations for linux/mm/highmem.c */
-unsigned int nr_free_highpages(void);
-extern atomic_long_t _totalhigh_pages;
-static inline unsigned long totalhigh_pages(void)
-{
- return (unsigned long)atomic_long_read(&_totalhigh_pages);
-}
-
-static inline void totalhigh_pages_inc(void)
-{
- atomic_long_inc(&_totalhigh_pages);
-}
+static inline void *kmap(struct page *page);
-static inline void totalhigh_pages_dec(void)
-{
- atomic_long_dec(&_totalhigh_pages);
-}
-
-static inline void totalhigh_pages_add(long count)
-{
- atomic_long_add(count, &_totalhigh_pages);
-}
-
-static inline void totalhigh_pages_set(long val)
-{
- atomic_long_set(&_totalhigh_pages, val);
-}
-
-void kmap_flush_unused(void);
-
-struct page *kmap_to_page(void *addr);
-
-#else /* CONFIG_HIGHMEM */
+/**
+ * kunmap - Unmap the virtual address mapped by kmap()
+ * @addr: Virtual address to be unmapped
+ *
+ * Counterpart to kmap(). A NOOP for CONFIG_HIGHMEM=n and for mappings of
+ * pages in the low memory area.
+ */
+static inline void kunmap(struct page *page);
-static inline unsigned int nr_free_highpages(void) { return 0; }
+/**
+ * kmap_to_page - Get the page for a kmap'ed address
+ * @addr: The address to look up
+ *
+ * Returns: The page which is mapped to @addr.
+ */
+static inline struct page *kmap_to_page(void *addr);
-static inline struct page *kmap_to_page(void *addr)
-{
- return virt_to_page(addr);
-}
+/**
+ * kmap_flush_unused - Flush all unused kmap mappings in order to
+ * remove stray mappings
+ */
+static inline void kmap_flush_unused(void);
-static inline unsigned long totalhigh_pages(void) { return 0UL; }
+/**
+ * kmap_local_page - Map a page for temporary usage
+ * @page: Pointer to the page to be mapped
+ *
+ * Returns: The virtual address of the mapping
+ *
+ * Can be invoked from any context.
+ *
+ * Requires careful handling when nesting multiple mappings because the map
+ * management is stack based. The unmap has to be in the reverse order of
+ * the map operation:
+ *
+ * addr1 = kmap_local_page(page1);
+ * addr2 = kmap_local_page(page2);
+ * ...
+ * kunmap_local(addr2);
+ * kunmap_local(addr1);
+ *
+ * Unmapping addr1 before addr2 is invalid and causes malfunction.
+ *
+ * Contrary to kmap() mappings the mapping is only valid in the context of
+ * the caller and cannot be handed to other contexts.
+ *
+ * On CONFIG_HIGHMEM=n kernels and for low memory pages this returns the
+ * virtual address of the direct mapping. Only real highmem pages are
+ * temporarily mapped.
+ *
+ * While it is significantly faster than kmap() for the higmem case it
+ * comes with restrictions about the pointer validity. Only use when really
+ * necessary.
+ *
+ * On HIGHMEM enabled systems mapping a highmem page has the side effect of
+ * disabling migration in order to keep the virtual address stable across
+ * preemption. No caller of kmap_local_page() can rely on this side effect.
+ */
+static inline void *kmap_local_page(struct page *page);
-static inline void *kmap(struct page *page)
-{
- might_sleep();
- return page_address(page);
-}
+/**
+ * kmap_atomic - Atomically map a page for temporary usage - Deprecated!
+ * @page: Pointer to the page to be mapped
+ *
+ * Returns: The virtual address of the mapping
+ *
+ * Effectively a wrapper around kmap_local_page() which disables pagefaults
+ * and preemption.
+ *
+ * Do not use in new code. Use kmap_local_page() instead.
+ */
+static inline void *kmap_atomic(struct page *page);
-static inline void kunmap_high(struct page *page)
-{
-}
+/**
+ * kunmap_atomic - Unmap the virtual address mapped by kmap_atomic()
+ * @addr: Virtual address to be unmapped
+ *
+ * Counterpart to kmap_atomic().
+ *
+ * Effectively a wrapper around kunmap_local() which additionally undoes
+ * the side effects of kmap_atomic(), i.e. reenabling pagefaults and
+ * preemption.
+ */
-static inline void kunmap(struct page *page)
-{
-#ifdef ARCH_HAS_FLUSH_ON_KUNMAP
- kunmap_flush_on_unmap(page_address(page));
-#endif
-}
+/* Highmem related interfaces for management code */
+static inline unsigned int nr_free_highpages(void);
+static inline unsigned long totalhigh_pages(void);
-static inline void *kmap_atomic(struct page *page)
+#ifndef ARCH_HAS_FLUSH_ANON_PAGE
+static inline void flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned long vmaddr)
{
- preempt_disable();
- pagefault_disable();
- return page_address(page);
}
-#define kmap_atomic_prot(page, prot) kmap_atomic(page)
-
-static inline void kunmap_atomic_high(void *addr)
-{
- /*
- * Mostly nothing to do in the CONFIG_HIGHMEM=n case as kunmap_atomic()
- * handles re-enabling faults + preemption
- */
-#ifdef ARCH_HAS_FLUSH_ON_KUNMAP
- kunmap_flush_on_unmap(addr);
#endif
-}
-
-#define kmap_atomic_pfn(pfn) kmap_atomic(pfn_to_page(pfn))
-
-#define kmap_flush_unused() do {} while(0)
-
-#endif /* CONFIG_HIGHMEM */
-
-#if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32)
-
-DECLARE_PER_CPU(int, __kmap_atomic_idx);
-static inline int kmap_atomic_idx_push(void)
+#ifndef ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE
+static inline void flush_kernel_dcache_page(struct page *page)
{
- int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
-
-#ifdef CONFIG_DEBUG_HIGHMEM
- WARN_ON_ONCE(in_irq() && !irqs_disabled());
- BUG_ON(idx >= KM_TYPE_NR);
-#endif
- return idx;
}
-
-static inline int kmap_atomic_idx(void)
+static inline void flush_kernel_vmap_range(void *vaddr, int size)
{
- return __this_cpu_read(__kmap_atomic_idx) - 1;
}
-
-static inline void kmap_atomic_idx_pop(void)
+static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
{
-#ifdef CONFIG_DEBUG_HIGHMEM
- int idx = __this_cpu_dec_return(__kmap_atomic_idx);
-
- BUG_ON(idx < 0);
-#else
- __this_cpu_dec(__kmap_atomic_idx);
-#endif
}
-
#endif
-/*
- * Prevent people trying to call kunmap_atomic() as if it were kunmap()
- * kunmap_atomic() should get the return value of kmap_atomic, not the page.
- */
-#define kunmap_atomic(addr) \
-do { \
- BUILD_BUG_ON(__same_type((addr), struct page *)); \
- kunmap_atomic_high(addr); \
- pagefault_enable(); \
- preempt_enable(); \
-} while (0)
-
-
/* when CONFIG_HIGHMEM is not set these will be plain clear/copy_page */
#ifndef clear_user_highpage
static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
kunmap_atomic(kaddr);
}
+ /*
+ * If we pass in a base or tail page, we can zero up to PAGE_SIZE.
+ * If we pass in a head page, we can zero up to the size of the compound page.
+ */
+ #if defined(CONFIG_HIGHMEM) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+ void zero_user_segments(struct page *page, unsigned start1, unsigned end1,
+ unsigned start2, unsigned end2);
+ #else /* !HIGHMEM || !TRANSPARENT_HUGEPAGE */
static inline void zero_user_segments(struct page *page,
- unsigned start1, unsigned end1,
- unsigned start2, unsigned end2)
+ unsigned start1, unsigned end1,
+ unsigned start2, unsigned end2)
{
void *kaddr = kmap_atomic(page);
+ unsigned int i;
- BUG_ON(end1 > PAGE_SIZE || end2 > PAGE_SIZE);
+ BUG_ON(end1 > page_size(page) || end2 > page_size(page));
if (end1 > start1)
memset(kaddr + start1, 0, end1 - start1);
memset(kaddr + start2, 0, end2 - start2);
kunmap_atomic(kaddr);
- flush_dcache_page(page);
+ for (i = 0; i < compound_nr(page); i++)
+ flush_dcache_page(page + i);
}
+ #endif /* !HIGHMEM || !TRANSPARENT_HUGEPAGE */
static inline void zero_user_segment(struct page *page,
unsigned start, unsigned end)
struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
- int (*split)(struct vm_area_struct * area, unsigned long addr);
- int (*mremap)(struct vm_area_struct * area);
+ /* Called any time before splitting to check if it's allowed */
+ int (*may_split)(struct vm_area_struct *area, unsigned long addr);
+ int (*mremap)(struct vm_area_struct *area, unsigned long flags);
+ /*
+ * Called by mprotect() to make driver-specific permission
+ * checks before mprotect() is finalised. The VMA must not
+ * be modified. Returns 0 if eprotect() can proceed.
+ */
+ int (*mprotect)(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, unsigned long newflags);
vm_fault_t (*fault)(struct vm_fault *vmf);
vm_fault_t (*huge_fault)(struct vm_fault *vmf,
enum page_entry_size pe_size);
void *buf, int len, unsigned int gup_flags);
extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
void *buf, int len, unsigned int gup_flags);
- extern int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long addr, void *buf, int len, unsigned int gup_flags);
+ extern int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
+ void *buf, int len, unsigned int gup_flags);
long get_user_pages_remote(struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
if (!ptlock_init(page))
return false;
__SetPageTable(page);
- inc_zone_page_state(page, NR_PAGETABLE);
+ inc_lruvec_page_state(page, NR_PAGETABLE);
return true;
}
{
ptlock_free(page);
__ClearPageTable(page);
- dec_zone_page_state(page, NR_PAGETABLE);
+ dec_lruvec_page_state(page, NR_PAGETABLE);
}
#define pte_offset_map_lock(mm, pmd, address, ptlp) \
if (!pmd_ptlock_init(page))
return false;
__SetPageTable(page);
- inc_zone_page_state(page, NR_PAGETABLE);
+ inc_lruvec_page_state(page, NR_PAGETABLE);
return true;
}
{
pmd_ptlock_free(page);
__ClearPageTable(page);
- dec_zone_page_state(page, NR_PAGETABLE);
+ dec_lruvec_page_state(page, NR_PAGETABLE);
}
/*
#else
/* please see mm/page_alloc.c */
extern int __meminit early_pfn_to_nid(unsigned long pfn);
- /* there is a per-arch backend function. */
- extern int __meminit __early_pfn_to_nid(unsigned long pfn,
- struct mminit_pfnnid_cache *state);
#endif
extern void set_dma_reserve(unsigned long new_dma_reserve);
unsigned long address, unsigned long size,
pte_fn_t fn, void *data);
+ extern void init_mem_debugging_and_hardening(void);
#ifdef CONFIG_PAGE_POISONING
- extern bool page_poisoning_enabled(void);
- extern void kernel_poison_pages(struct page *page, int numpages, int enable);
+ extern void __kernel_poison_pages(struct page *page, int numpages);
+ extern void __kernel_unpoison_pages(struct page *page, int numpages);
+ extern bool _page_poisoning_enabled_early;
+ DECLARE_STATIC_KEY_FALSE(_page_poisoning_enabled);
+ static inline bool page_poisoning_enabled(void)
+ {
+ return _page_poisoning_enabled_early;
+ }
+ /*
+ * For use in fast paths after init_mem_debugging() has run, or when a
+ * false negative result is not harmful when called too early.
+ */
+ static inline bool page_poisoning_enabled_static(void)
+ {
+ return static_branch_unlikely(&_page_poisoning_enabled);
+ }
+ static inline void kernel_poison_pages(struct page *page, int numpages)
+ {
+ if (page_poisoning_enabled_static())
+ __kernel_poison_pages(page, numpages);
+ }
+ static inline void kernel_unpoison_pages(struct page *page, int numpages)
+ {
+ if (page_poisoning_enabled_static())
+ __kernel_unpoison_pages(page, numpages);
+ }
#else
static inline bool page_poisoning_enabled(void) { return false; }
- static inline void kernel_poison_pages(struct page *page, int numpages,
- int enable) { }
+ static inline bool page_poisoning_enabled_static(void) { return false; }
+ static inline void __kernel_poison_pages(struct page *page, int nunmpages) { }
+ static inline void kernel_poison_pages(struct page *page, int numpages) { }
+ static inline void kernel_unpoison_pages(struct page *page, int numpages) { }
#endif
- #ifdef CONFIG_INIT_ON_ALLOC_DEFAULT_ON
- DECLARE_STATIC_KEY_TRUE(init_on_alloc);
- #else
DECLARE_STATIC_KEY_FALSE(init_on_alloc);
- #endif
static inline bool want_init_on_alloc(gfp_t flags)
{
- if (static_branch_unlikely(&init_on_alloc) &&
- !page_poisoning_enabled())
+ if (static_branch_unlikely(&init_on_alloc))
return true;
return flags & __GFP_ZERO;
}
- #ifdef CONFIG_INIT_ON_FREE_DEFAULT_ON
- DECLARE_STATIC_KEY_TRUE(init_on_free);
- #else
DECLARE_STATIC_KEY_FALSE(init_on_free);
- #endif
static inline bool want_init_on_free(void)
{
- return static_branch_unlikely(&init_on_free) &&
- !page_poisoning_enabled();
+ return static_branch_unlikely(&init_on_free);
}
- #ifdef CONFIG_DEBUG_PAGEALLOC
- extern void init_debug_pagealloc(void);
- #else
- static inline void init_debug_pagealloc(void) {}
- #endif
extern bool _debug_pagealloc_enabled_early;
DECLARE_STATIC_KEY_FALSE(_debug_pagealloc_enabled);
return static_branch_unlikely(&_debug_pagealloc_enabled);
}
- #if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_ARCH_HAS_SET_DIRECT_MAP)
- extern void __kernel_map_pages(struct page *page, int numpages, int enable);
-
+ #ifdef CONFIG_DEBUG_PAGEALLOC
/*
- * When called in DEBUG_PAGEALLOC context, the call should most likely be
- * guarded by debug_pagealloc_enabled() or debug_pagealloc_enabled_static()
+ * To support DEBUG_PAGEALLOC architecture must ensure that
+ * __kernel_map_pages() never fails
*/
- static inline void
- kernel_map_pages(struct page *page, int numpages, int enable)
- {
- __kernel_map_pages(page, numpages, enable);
- }
- #ifdef CONFIG_HIBERNATION
- extern bool kernel_page_present(struct page *page);
- #endif /* CONFIG_HIBERNATION */
- #else /* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */
- static inline void
- kernel_map_pages(struct page *page, int numpages, int enable) {}
- #ifdef CONFIG_HIBERNATION
- static inline bool kernel_page_present(struct page *page) { return true; }
- #endif /* CONFIG_HIBERNATION */
- #endif /* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */
+ extern void __kernel_map_pages(struct page *page, int numpages, int enable);
+
+ static inline void debug_pagealloc_map_pages(struct page *page, int numpages)
+ {
+ if (debug_pagealloc_enabled_static())
+ __kernel_map_pages(page, numpages, 1);
+ }
+
+ static inline void debug_pagealloc_unmap_pages(struct page *page, int numpages)
+ {
+ if (debug_pagealloc_enabled_static())
+ __kernel_map_pages(page, numpages, 0);
+ }
+ #else /* CONFIG_DEBUG_PAGEALLOC */
+ static inline void debug_pagealloc_map_pages(struct page *page, int numpages) {}
+ static inline void debug_pagealloc_unmap_pages(struct page *page, int numpages) {}
+ #endif /* CONFIG_DEBUG_PAGEALLOC */
#ifdef __HAVE_ARCH_GATE_AREA
extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
NR_ZONE_UNEVICTABLE,
NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */
NR_MLOCK, /* mlock()ed pages found and moved off LRU */
- NR_PAGETABLE, /* used for pagetables */
/* Second 128 byte cacheline */
NR_BOUNCE,
#if IS_ENABLED(CONFIG_ZSMALLOC)
#if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
NR_KERNEL_SCS_KB, /* measured in KiB */
#endif
+ NR_PAGETABLE, /* used for pagetables */
NR_VM_NODE_STAT_ITEMS
};
* DMA mask is assumed when ZONE_DMA32 is defined. Some 64-bit
* platforms may need both zones as they support peripherals with
* different DMA addressing limitations.
- *
- * Some examples:
- *
- * - i386 and x86_64 have a fixed 16M ZONE_DMA and ZONE_DMA32 for the
- * rest of the lower 4G.
- *
- * - arm only uses ZONE_DMA, the size, up to 4G, may vary depending on
- * the specific device.
- *
- * - arm64 has a fixed 1G ZONE_DMA and ZONE_DMA32 for the rest of the
- * lower 4G.
- *
- * - powerpc only uses ZONE_DMA, the size, up to 2G, may vary
- * depending on the specific device.
- *
- * - s390 uses ZONE_DMA fixed to the lower 2G.
- *
- * - ia64 and riscv only use ZONE_DMA32.
- *
- * - parisc uses neither.
*/
#ifdef CONFIG_ZONE_DMA
ZONE_DMA,
#endif
struct pglist_data *zone_pgdat;
struct per_cpu_pageset __percpu *pageset;
+ /*
+ * the high and batch values are copied to individual pagesets for
+ * faster access
+ */
+ int pageset_high;
+ int pageset_batch;
#ifndef CONFIG_SPARSEMEM
/*
#define subsection_map_init(_pfn, _nr_pages) do {} while (0)
#endif /* CONFIG_SPARSEMEM */
- /*
- * During memory init memblocks map pfns to nids. The search is expensive and
- * this caches recent lookups. The implementation of __early_pfn_to_nid
- * may treat start/end as pfns or sections.
- */
- struct mminit_pfnnid_cache {
- unsigned long last_start;
- unsigned long last_end;
- int last_nid;
- };
-
/*
* If it is possible to have holes within a MAX_ORDER_NR_PAGES, then we
* need to check pfn validity within that MAX_ORDER_NR_PAGES block.
#define pfn_valid_within(pfn) (1)
#endif
- #ifdef CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
- /*
- * pfn_valid() is meant to be able to tell if a given PFN has valid memmap
- * associated with it or not. This means that a struct page exists for this
- * pfn. The caller cannot assume the page is fully initialized in general.
- * Hotplugable pages might not have been onlined yet. pfn_to_online_page()
- * will ensure the struct page is fully online and initialized. Special pages
- * (e.g. ZONE_DEVICE) are never onlined and should be treated accordingly.
- *
- * In FLATMEM, it is expected that holes always have valid memmap as long as
- * there is valid PFNs either side of the hole. In SPARSEMEM, it is assumed
- * that a valid section has a memmap for the entire section.
- *
- * However, an ARM, and maybe other embedded architectures in the future
- * free memmap backing holes to save memory on the assumption the memmap is
- * never used. The page_zone linkages are then broken even though pfn_valid()
- * returns true. A walker of the full memmap must then do this additional
- * check to ensure the memmap they are looking at is sane by making sure
- * the zone and PFN linkages are still valid. This is expensive, but walkers
- * of the full memmap are extremely rare.
- */
- bool memmap_valid_within(unsigned long pfn,
- struct page *page, struct zone *zone);
- #else
- static inline bool memmap_valid_within(unsigned long pfn,
- struct page *page, struct zone *zone)
- {
- return true;
- }
- #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
-
#endif /* !__GENERATING_BOUNDS.H */
#endif /* !__ASSEMBLY__ */
#endif /* _LINUX_MMZONE_H */
static inline void fs_reclaim_release(gfp_t gfp_mask) { }
#endif
+ /**
+ * might_alloc - Mark possible allocation sites
+ * @gfp_mask: gfp_t flags that would be used to allocate
+ *
+ * Similar to might_sleep() and other annotations, this can be used in functions
+ * that might allocate, but often don't. Compiles to nothing without
+ * CONFIG_LOCKDEP. Includes a conditional might_sleep() if @gfp allows blocking.
+ */
+ static inline void might_alloc(gfp_t gfp_mask)
+ {
+ fs_reclaim_acquire(gfp_mask);
+ fs_reclaim_release(gfp_mask);
+
+ might_sleep_if(gfpflags_allow_blocking(gfp_mask));
+ }
+
/**
* memalloc_noio_save - Marks implicit GFP_NOIO allocation scope.
*
extern void membarrier_exec_mmap(struct mm_struct *mm);
+extern void membarrier_update_current_mm(struct mm_struct *next_mm);
+
#else
#ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
{
}
+static inline void membarrier_update_current_mm(struct mm_struct *next_mm)
+{
+}
#endif
#endif /* _LINUX_SCHED_MM_H */
/* cgroup namespace for init task */
struct cgroup_namespace init_cgroup_ns = {
- .count = REFCOUNT_INIT(2),
+ .ns.count = REFCOUNT_INIT(2),
.user_ns = &init_user_ns,
.ns.ops = &cgroupns_operations,
.ns.inum = PROC_CGROUP_INIT_INO,
* - cpuset: a task can be moved into an empty cpuset, and again it takes
* masks of ancestors.
*
- * - memcg: use_hierarchy is on by default and the cgroup file for the flag
- * is not created.
- *
* - blkcg: blk-throttle becomes properly hierarchical.
*
* - debug: disallowed on the default hierarchy.
if (err)
goto err_list_del;
- if (ss->broken_hierarchy && !ss->warned_broken_hierarchy &&
- cgroup_parent(parent)) {
- pr_warn("%s (%d) created nested cgroup for controller \"%s\" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.\n",
- current->comm, current->pid, ss->name);
- if (!strcmp(ss->name, "memory"))
- pr_warn("\"memory\" requires setting use_hierarchy to 1 on the root\n");
- ss->warned_broken_hierarchy = true;
- }
-
return css;
err_list_del:
mod_lruvec_page_state(vm->pages[0], NR_KERNEL_STACK_KB,
account * (THREAD_SIZE / 1024));
else
- mod_lruvec_slab_state(stack, NR_KERNEL_STACK_KB,
+ mod_lruvec_kmem_state(stack, NR_KERNEL_STACK_KB,
account * (THREAD_SIZE / 1024));
}
clear_user_return_notifier(tsk);
clear_tsk_need_resched(tsk);
set_task_stack_end_magic(tsk);
+ clear_syscall_work_syscall_user_dispatch(tsk);
#ifdef CONFIG_STACKPROTECTOR
tsk->stack_canary = get_random_canary();
account_kernel_stack(tsk, 1);
kcov_task_init(tsk);
+ kmap_local_fork(tsk);
#ifdef CONFIG_FAULT_INJECTION
tsk->fail_nth = 0;
mm->vmacache_seqnum = 0;
atomic_set(&mm->mm_users, 1);
atomic_set(&mm->mm_count, 1);
+ seqcount_init(&mm->write_protect_seq);
mmap_init_lock(mm);
INIT_LIST_HEAD(&mm->mmlist);
mm->core_state = NULL;
* to manually enable the seccomp thread flag here.
*/
if (p->seccomp.mode != SECCOMP_MODE_DISABLED)
- set_tsk_thread_flag(p, TIF_SECCOMP);
+ set_task_syscall_work(p, SECCOMP);
#endif
}
* child regardless of CLONE_PTRACE.
*/
user_disable_single_step(p);
- clear_tsk_thread_flag(p, TIF_SYSCALL_TRACE);
-#ifdef TIF_SYSCALL_EMU
- clear_tsk_thread_flag(p, TIF_SYSCALL_EMU);
+ clear_task_syscall_work(p, SYSCALL_TRACE);
+#if defined(CONFIG_GENERIC_ENTRY) || defined(TIF_SYSCALL_EMU)
+ clear_task_syscall_work(p, SYSCALL_EMU);
#endif
clear_tsk_latency_tracing(p);
INIT_LIST_HEAD(&p->thread_group);
p->task_works = NULL;
+#ifdef CONFIG_KRETPROBES
+ p->kretprobe_instances.first = NULL;
+#endif
+
/*
* Ensure that the cgroup subsystem policies allow the new process to be
* forked. It should be noted that the new process's css_set can be changed
raw_spin_unlock_irq(&worker->lock);
if (work) {
+ kthread_work_func_t func = work->func;
__set_current_state(TASK_RUNNING);
+ trace_sched_kthread_work_execute_start(work);
work->func(work);
+ /*
+ * Avoid dereferencing work after this point. The trace
+ * event only cares about the address.
+ */
+ trace_sched_kthread_work_execute_end(work, func);
} else if (!freezing(current))
schedule();
* A good practice is to add the cpu number also into the worker name.
* For example, use kthread_create_worker_on_cpu(cpu, "helper/%d", cpu).
*
- * Returns a pointer to the allocated worker on success, ERR_PTR(-ENOMEM)
+ * CPU hotplug:
+ * The kthread worker API is simple and generic. It just provides a way
+ * to create, use, and destroy workers.
+ *
+ * It is up to the API user how to handle CPU hotplug. They have to decide
+ * how to handle pending work items, prevent queuing new ones, and
+ * restore the functionality when the CPU goes off and on. There are a
+ * few catches:
+ *
+ * - CPU affinity gets lost when it is scheduled on an offline CPU.
+ *
+ * - The worker might not exist when the CPU was off when the user
+ * created the workers.
+ *
+ * Good practice is to implement two CPU hotplug callbacks and to
+ * destroy/create the worker when the CPU goes down/up.
+ *
+ * Return:
+ * The pointer to the allocated worker on success, ERR_PTR(-ENOMEM)
* when the needed structures could not get allocated, and ERR_PTR(-EINTR)
* when the worker was SIGKILLed.
*/
{
kthread_insert_work_sanity_check(worker, work);
+ trace_sched_kthread_work_queue_work(worker, work);
+
list_add_tail(&work->node, pos);
work->worker = worker;
if (!worker->current_work && likely(worker->task))
tsk->active_mm = mm;
}
tsk->mm = mm;
+ membarrier_update_current_mm(mm);
switch_mm_irqs_off(active_mm, mm, tsk);
local_irq_enable();
task_unlock(tsk);
finish_arch_post_lock_switch();
#endif
+ /*
+ * When a kthread starts operating on an address space, the loop
+ * in membarrier_{private,global}_expedited() may not observe
+ * that tsk->mm, and not issue an IPI. Membarrier requires a
+ * memory barrier after storing to tsk->mm, before accessing
+ * user-space memory. A full memory barrier for membarrier
+ * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
+ * mmdrop(), or explicitly with smp_mb().
+ */
if (active_mm != mm)
mmdrop(active_mm);
+ else
+ smp_mb();
to_kthread(tsk)->oldfs = force_uaccess_begin();
}
force_uaccess_end(to_kthread(tsk)->oldfs);
task_lock(tsk);
+ /*
+ * When a kthread stops operating on an address space, the loop
+ * in membarrier_{private,global}_expedited() may not observe
+ * that tsk->mm, and not issue an IPI. Membarrier requires a
+ * memory barrier after accessing user-space memory, before
+ * clearing tsk->mm.
+ */
+ smp_mb__after_spinlock();
sync_mm_rss(mm);
local_irq_disable();
tsk->mm = NULL;
+ membarrier_update_current_mm(NULL);
/* active_mm is still 'mm' */
enter_lazy_tlb(mm, tsk);
local_irq_enable();
return 0;
}
- ret = __access_remote_vm(tsk, mm, addr, buf, len, gup_flags);
+ ret = __access_remote_vm(mm, addr, buf, len, gup_flags);
mmput(mm);
return ret;
const struct cred *old_cred;
BUG_ON(!child->ptrace);
- clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
-#ifdef TIF_SYSCALL_EMU
- clear_tsk_thread_flag(child, TIF_SYSCALL_EMU);
+ clear_task_syscall_work(child, SYSCALL_TRACE);
+#if defined(CONFIG_GENERIC_ENTRY) || defined(TIF_SYSCALL_EMU)
+ clear_task_syscall_work(child, SYSCALL_EMU);
#endif
child->parent = child->real_parent;
return -EIO;
if (request == PTRACE_SYSCALL)
- set_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
+ set_task_syscall_work(child, SYSCALL_TRACE);
else
- clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
+ clear_task_syscall_work(child, SYSCALL_TRACE);
-#ifdef TIF_SYSCALL_EMU
+#if defined(CONFIG_GENERIC_ENTRY) || defined(TIF_SYSCALL_EMU)
if (request == PTRACE_SYSEMU || request == PTRACE_SYSEMU_SINGLESTEP)
- set_tsk_thread_flag(child, TIF_SYSCALL_EMU);
+ set_task_syscall_work(child, SYSCALL_EMU);
else
- clear_tsk_thread_flag(child, TIF_SYSCALL_EMU);
+ clear_task_syscall_work(child, SYSCALL_EMU);
#endif
if (is_singleblock(request)) {
{
struct worker_pool *pool = pwq->pool;
+ /* record the work call stack in order to print it in KASAN reports */
+ kasan_record_aux_stack(work);
+
/* we own @work, set data and link */
set_work_pwq(work, pwq, extra_flags);
list_add_tail(&work->entry, head);
pool->flags |= POOL_DISASSOCIATED;
raw_spin_unlock_irq(&pool->lock);
+
+ for_each_pool_worker(worker, pool)
+ WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_active_mask) < 0);
+
mutex_unlock(&wq_pool_attach_mutex);
/*
#include <linux/mutex.h>
#include <linux/ww_mutex.h>
#include <linux/sched.h>
+ #include <linux/sched/mm.h>
#include <linux/delay.h>
#include <linux/lockdep.h>
#include <linux/spinlock.h>
* Normal standalone locks, for the circular and irq-context
* dependency tests:
*/
-static DEFINE_RAW_SPINLOCK(lock_A);
-static DEFINE_RAW_SPINLOCK(lock_B);
-static DEFINE_RAW_SPINLOCK(lock_C);
-static DEFINE_RAW_SPINLOCK(lock_D);
+static DEFINE_SPINLOCK(lock_A);
+static DEFINE_SPINLOCK(lock_B);
+static DEFINE_SPINLOCK(lock_C);
+static DEFINE_SPINLOCK(lock_D);
static DEFINE_RWLOCK(rwlock_A);
static DEFINE_RWLOCK(rwlock_B);
* but X* and Y* are different classes. We do this so that
* we do not trigger a real lockup:
*/
-static DEFINE_RAW_SPINLOCK(lock_X1);
-static DEFINE_RAW_SPINLOCK(lock_X2);
-static DEFINE_RAW_SPINLOCK(lock_Y1);
-static DEFINE_RAW_SPINLOCK(lock_Y2);
-static DEFINE_RAW_SPINLOCK(lock_Z1);
-static DEFINE_RAW_SPINLOCK(lock_Z2);
+static DEFINE_SPINLOCK(lock_X1);
+static DEFINE_SPINLOCK(lock_X2);
+static DEFINE_SPINLOCK(lock_Y1);
+static DEFINE_SPINLOCK(lock_Y2);
+static DEFINE_SPINLOCK(lock_Z1);
+static DEFINE_SPINLOCK(lock_Z2);
static DEFINE_RWLOCK(rwlock_X1);
static DEFINE_RWLOCK(rwlock_X2);
*/
#define INIT_CLASS_FUNC(class) \
static noinline void \
-init_class_##class(raw_spinlock_t *lock, rwlock_t *rwlock, \
+init_class_##class(spinlock_t *lock, rwlock_t *rwlock, \
struct mutex *mutex, struct rw_semaphore *rwsem)\
{ \
- raw_spin_lock_init(lock); \
+ spin_lock_init(lock); \
rwlock_init(rwlock); \
mutex_init(mutex); \
init_rwsem(rwsem); \
* Shortcuts for lock/unlock API variants, to keep
* the testcases compact:
*/
-#define L(x) raw_spin_lock(&lock_##x)
-#define U(x) raw_spin_unlock(&lock_##x)
+#define L(x) spin_lock(&lock_##x)
+#define U(x) spin_unlock(&lock_##x)
#define LU(x) L(x); U(x)
-#define SI(x) raw_spin_lock_init(&lock_##x)
+#define SI(x) spin_lock_init(&lock_##x)
#define WL(x) write_lock(&rwlock_##x)
#define WU(x) write_unlock(&rwlock_##x)
#define I2(x) \
do { \
- raw_spin_lock_init(&lock_##x); \
+ spin_lock_init(&lock_##x); \
rwlock_init(&rwlock_##x); \
mutex_init(&mutex_##x); \
init_rwsem(&rwsem_##x); \
static void ww_test_spin_nest_unlocked(void)
{
- raw_spin_lock_nest_lock(&lock_A, &o.base);
+ spin_lock_nest_lock(&lock_A, &o.base);
U(A);
}
+/* This is not a deadlock, because we have X1 to serialize Y1 and Y2 */
+static void ww_test_spin_nest_lock(void)
+{
+ spin_lock(&lock_X1);
+ spin_lock_nest_lock(&lock_Y1, &lock_X1);
+ spin_lock(&lock_A);
+ spin_lock_nest_lock(&lock_Y2, &lock_X1);
+ spin_unlock(&lock_A);
+ spin_unlock(&lock_Y2);
+ spin_unlock(&lock_Y1);
+ spin_unlock(&lock_X1);
+}
+
static void ww_test_unneeded_slow(void)
{
WWAI(&t);
dotest(ww_test_spin_nest_unlocked, FAILURE, LOCKTYPE_WW);
pr_cont("\n");
+ print_testname("spinlock nest test");
+ dotest(ww_test_spin_nest_lock, SUCCESS, LOCKTYPE_WW);
+ pr_cont("\n");
+
printk(" -----------------------------------------------------\n");
printk(" |block | try |context|\n");
printk(" -----------------------------------------------------\n");
pr_cont("\n");
}
+ static void fs_reclaim_correct_nesting(void)
+ {
+ fs_reclaim_acquire(GFP_KERNEL);
+ might_alloc(GFP_NOFS);
+ fs_reclaim_release(GFP_KERNEL);
+ }
+
+ static void fs_reclaim_wrong_nesting(void)
+ {
+ fs_reclaim_acquire(GFP_KERNEL);
+ might_alloc(GFP_KERNEL);
+ fs_reclaim_release(GFP_KERNEL);
+ }
+
+ static void fs_reclaim_protected_nesting(void)
+ {
+ unsigned int flags;
+
+ fs_reclaim_acquire(GFP_KERNEL);
+ flags = memalloc_nofs_save();
+ might_alloc(GFP_KERNEL);
+ memalloc_nofs_restore(flags);
+ fs_reclaim_release(GFP_KERNEL);
+ }
+
+ static void fs_reclaim_tests(void)
+ {
+ printk(" --------------------\n");
+ printk(" | fs_reclaim tests |\n");
+ printk(" --------------------\n");
+
+ print_testname("correct nesting");
+ dotest(fs_reclaim_correct_nesting, SUCCESS, 0);
+ pr_cont("\n");
+
+ print_testname("wrong nesting");
+ dotest(fs_reclaim_wrong_nesting, FAILURE, 0);
+ pr_cont("\n");
+
+ print_testname("protected nesting");
+ dotest(fs_reclaim_protected_nesting, SUCCESS, 0);
+ pr_cont("\n");
+ }
+
void locking_selftest(void)
{
/*
if (IS_ENABLED(CONFIG_QUEUED_RWLOCKS))
queued_read_lock_tests();
+ fs_reclaim_tests();
+
if (unexpected_testcase_failures) {
printk("-----------------------------------------------------------------\n");
debug_locks = 0;
information includes global and per chunk statistics, which can
be used to help understand percpu memory usage.
- config GUP_BENCHMARK
- bool "Enable infrastructure for get_user_pages() and related calls benchmarking"
+ config GUP_TEST
+ bool "Enable infrastructure for get_user_pages()-related unit tests"
+ depends on DEBUG_FS
help
- Provides /sys/kernel/debug/gup_benchmark that helps with testing
- performance of get_user_pages() and related calls.
+ Provides /sys/kernel/debug/gup_test, which in turn provides a way
+ to make ioctl calls that can launch kernel-based unit tests for
+ the get_user_pages*() and pin_user_pages*() family of API calls.
- See tools/testing/selftests/vm/gup_benchmark.c
+ These tests include benchmark testing of the _fast variants of
+ get_user_pages*() and pin_user_pages*(), as well as smoke tests of
+ the non-_fast variants.
+
+ There is also a sub-test that allows running dump_page() on any
+ of up to eight pages (selected by command line args) within the
+ range of user-space addresses. These pages are either pinned via
+ pin_user_pages*(), or pinned via get_user_pages*(), as specified
+ by other command line arguments.
+
+ See tools/testing/selftests/vm/gup_test.c
+
+ comment "GUP_TEST needs to have DEBUG_FS enabled"
+ depends on !GUP_TEST && !DEBUG_FS
config GUP_GET_PTE_LOW_HIGH
bool
config MAPPING_DIRTY_HELPERS
bool
+config KMAP_LOCAL
+ bool
+
endmenu
return NULL;
}
+ static void put_compound_head(struct page *page, int refs, unsigned int flags)
+ {
+ if (flags & FOLL_PIN) {
+ mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
+ refs);
+
+ if (hpage_pincount_available(page))
+ hpage_pincount_sub(page, refs);
+ else
+ refs *= GUP_PIN_COUNTING_BIAS;
+ }
+
+ VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
+ /*
+ * Calling put_page() for each ref is unnecessarily slow. Only the last
+ * ref needs a put_page().
+ */
+ if (refs > 1)
+ page_ref_sub(page, refs - 1);
+ put_page(page);
+ }
+
/**
* try_grab_page() - elevate a page's refcount by a flag-dependent amount
*
return true;
}
- #ifdef CONFIG_DEV_PAGEMAP_OPS
- static bool __unpin_devmap_managed_user_page(struct page *page)
- {
- int count, refs = 1;
-
- if (!page_is_devmap_managed(page))
- return false;
-
- if (hpage_pincount_available(page))
- hpage_pincount_sub(page, 1);
- else
- refs = GUP_PIN_COUNTING_BIAS;
-
- count = page_ref_sub_return(page, refs);
-
- mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
- /*
- * devmap page refcounts are 1-based, rather than 0-based: if
- * refcount is 1, then the page is free and the refcount is
- * stable because nobody holds a reference on the page.
- */
- if (count == 1)
- free_devmap_managed_page(page);
- else if (!count)
- __put_page(page);
-
- return true;
- }
- #else
- static bool __unpin_devmap_managed_user_page(struct page *page)
- {
- return false;
- }
- #endif /* CONFIG_DEV_PAGEMAP_OPS */
-
/**
* unpin_user_page() - release a dma-pinned page
* @page: pointer to page to be released
*/
void unpin_user_page(struct page *page)
{
- int refs = 1;
-
- page = compound_head(page);
-
- /*
- * For devmap managed pages we need to catch refcount transition from
- * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
- * page is free and we need to inform the device driver through
- * callback. See include/linux/memremap.h and HMM for details.
- */
- if (__unpin_devmap_managed_user_page(page))
- return;
-
- if (hpage_pincount_available(page))
- hpage_pincount_sub(page, 1);
- else
- refs = GUP_PIN_COUNTING_BIAS;
-
- if (page_ref_sub_and_test(page, refs))
- __put_page(page);
-
- mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
+ put_compound_head(compound_head(page), 1, FOLL_PIN);
}
EXPORT_SYMBOL(unpin_user_page);
if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
return -EFAULT;
+ if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
+ return -EOPNOTSUPP;
+
if (write) {
if (!(vm_flags & VM_WRITE)) {
if (!(gup_flags & FOLL_FORCE))
goto next_page;
}
- if (!vma || check_vma_flags(vma, gup_flags)) {
+ if (!vma) {
ret = -EFAULT;
goto out;
}
+ ret = check_vma_flags(vma, gup_flags);
+ if (ret)
+ goto out;
+
if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
&start, &nr_pages, i,
}
#endif /* CONFIG_ELF_CORE */
- #if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
- static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
- {
- long i;
- struct vm_area_struct *vma_prev = NULL;
-
- for (i = 0; i < nr_pages; i++) {
- struct vm_area_struct *vma = vmas[i];
-
- if (vma == vma_prev)
- continue;
-
- vma_prev = vma;
-
- if (vma_is_fsdax(vma))
- return true;
- }
- return false;
- }
-
#ifdef CONFIG_CMA
static long check_and_migrate_cma_pages(struct mm_struct *mm,
unsigned long start,
struct vm_area_struct **vmas,
unsigned int gup_flags)
{
- struct vm_area_struct **vmas_tmp = vmas;
unsigned long flags = 0;
- long rc, i;
+ long rc;
- if (gup_flags & FOLL_LONGTERM) {
- if (!pages)
- return -EINVAL;
-
- if (!vmas_tmp) {
- vmas_tmp = kcalloc(nr_pages,
- sizeof(struct vm_area_struct *),
- GFP_KERNEL);
- if (!vmas_tmp)
- return -ENOMEM;
- }
+ if (gup_flags & FOLL_LONGTERM)
flags = memalloc_nocma_save();
- }
- rc = __get_user_pages_locked(mm, start, nr_pages, pages,
- vmas_tmp, NULL, gup_flags);
+ rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL,
+ gup_flags);
if (gup_flags & FOLL_LONGTERM) {
- if (rc < 0)
- goto out;
-
- if (check_dax_vmas(vmas_tmp, rc)) {
- if (gup_flags & FOLL_PIN)
- unpin_user_pages(pages, rc);
- else
- for (i = 0; i < rc; i++)
- put_page(pages[i]);
- rc = -EOPNOTSUPP;
- goto out;
- }
-
- rc = check_and_migrate_cma_pages(mm, start, rc, pages,
- vmas_tmp, gup_flags);
- out:
+ if (rc > 0)
+ rc = check_and_migrate_cma_pages(mm, start, rc, pages,
+ vmas, gup_flags);
memalloc_nocma_restore(flags);
}
-
- if (vmas_tmp != vmas)
- kfree(vmas_tmp);
return rc;
}
- #else /* !CONFIG_FS_DAX && !CONFIG_CMA */
- static __always_inline long __gup_longterm_locked(struct mm_struct *mm,
- unsigned long start,
- unsigned long nr_pages,
- struct page **pages,
- struct vm_area_struct **vmas,
- unsigned int flags)
- {
- return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
- NULL, flags);
- }
- #endif /* CONFIG_FS_DAX || CONFIG_CMA */
static bool is_valid_gup_flags(unsigned int gup_flags)
{
EXPORT_SYMBOL(get_user_pages);
/**
- * get_user_pages_locked() is suitable to replace the form:
+ * get_user_pages_locked() - variant of get_user_pages()
+ *
+ * @start: starting user address
+ * @nr_pages: number of pages from start to pin
+ * @gup_flags: flags modifying lookup behaviour
+ * @pages: array that receives pointers to the pages pinned.
+ * Should be at least nr_pages long. Or NULL, if caller
+ * only intends to ensure the pages are faulted in.
+ * @locked: pointer to lock flag indicating whether lock is held and
+ * subsequently whether VM_FAULT_RETRY functionality can be
+ * utilised. Lock must initially be held.
+ *
+ * It is suitable to replace the form:
*
* mmap_read_lock(mm);
* do_something()
* if (locked)
* mmap_read_unlock(mm);
*
- * @start: starting user address
- * @nr_pages: number of pages from start to pin
- * @gup_flags: flags modifying lookup behaviour
- * @pages: array that receives pointers to the pages pinned.
- * Should be at least nr_pages long. Or NULL, if caller
- * only intends to ensure the pages are faulted in.
- * @locked: pointer to lock flag indicating whether lock is held and
- * subsequently whether VM_FAULT_RETRY functionality can be
- * utilised. Lock must initially be held.
- *
* We can leverage the VM_FAULT_RETRY functionality in the page fault
* paths better by using either get_user_pages_locked() or
* get_user_pages_unlocked().
* This code is based heavily on the PowerPC implementation by Nick Piggin.
*/
#ifdef CONFIG_HAVE_FAST_GUP
-#ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
-
-/*
- * WARNING: only to be used in the get_user_pages_fast() implementation.
- *
- * With get_user_pages_fast(), we walk down the pagetables without taking any
- * locks. For this we would like to load the pointers atomically, but sometimes
- * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What
- * we do have is the guarantee that a PTE will only either go from not present
- * to present, or present to not present or both -- it will not switch to a
- * completely different present page without a TLB flush in between; something
- * that we are blocking by holding interrupts off.
- *
- * Setting ptes from not present to present goes:
- *
- * ptep->pte_high = h;
- * smp_wmb();
- * ptep->pte_low = l;
- *
- * And present to not present goes:
- *
- * ptep->pte_low = 0;
- * smp_wmb();
- * ptep->pte_high = 0;
- *
- * We must ensure here that the load of pte_low sees 'l' IFF pte_high sees 'h'.
- * We load pte_high *after* loading pte_low, which ensures we don't see an older
- * value of pte_high. *Then* we recheck pte_low, which ensures that we haven't
- * picked up a changed pte high. We might have gotten rubbish values from
- * pte_low and pte_high, but we are guaranteed that pte_low will not have the
- * present bit set *unless* it is 'l'. Because get_user_pages_fast() only
- * operates on present ptes we're safe.
- */
-static inline pte_t gup_get_pte(pte_t *ptep)
-{
- pte_t pte;
-
- do {
- pte.pte_low = ptep->pte_low;
- smp_rmb();
- pte.pte_high = ptep->pte_high;
- smp_rmb();
- } while (unlikely(pte.pte_low != ptep->pte_low));
-
- return pte;
-}
-#else /* CONFIG_GUP_GET_PTE_LOW_HIGH */
-/*
- * We require that the PTE can be read atomically.
- */
-static inline pte_t gup_get_pte(pte_t *ptep)
-{
- return ptep_get(ptep);
-}
-#endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
- static void put_compound_head(struct page *page, int refs, unsigned int flags)
- {
- if (flags & FOLL_PIN) {
- mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
- refs);
-
- if (hpage_pincount_available(page))
- hpage_pincount_sub(page, refs);
- else
- refs *= GUP_PIN_COUNTING_BIAS;
- }
-
- VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
- /*
- * Calling put_page() for each ref is unnecessarily slow. Only the last
- * ref needs a put_page().
- */
- if (refs > 1)
- page_ref_sub(page, refs - 1);
- put_page(page);
- }
-
static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
unsigned int flags,
struct page **pages)
ptem = ptep = pte_offset_map(&pmd, addr);
do {
- pte_t pte = gup_get_pte(ptep);
+ pte_t pte = ptep_get_lockless(ptep);
struct page *head, *page;
/*
return ret;
}
- static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
+ static unsigned long lockless_pages_from_mm(unsigned long start,
+ unsigned long end,
+ unsigned int gup_flags,
+ struct page **pages)
+ {
+ unsigned long flags;
+ int nr_pinned = 0;
+ unsigned seq;
+
+ if (!IS_ENABLED(CONFIG_HAVE_FAST_GUP) ||
+ !gup_fast_permitted(start, end))
+ return 0;
+
+ if (gup_flags & FOLL_PIN) {
+ seq = raw_read_seqcount(¤t->mm->write_protect_seq);
+ if (seq & 1)
+ return 0;
+ }
+
+ /*
+ * Disable interrupts. The nested form is used, in order to allow full,
+ * general purpose use of this routine.
+ *
+ * With interrupts disabled, we block page table pages from being freed
+ * from under us. See struct mmu_table_batch comments in
+ * include/asm-generic/tlb.h for more details.
+ *
+ * We do not adopt an rcu_read_lock() here as we also want to block IPIs
+ * that come from THPs splitting.
+ */
+ local_irq_save(flags);
+ gup_pgd_range(start, end, gup_flags, pages, &nr_pinned);
+ local_irq_restore(flags);
+
+ /*
+ * When pinning pages for DMA there could be a concurrent write protect
+ * from fork() via copy_page_range(), in this case always fail fast GUP.
+ */
+ if (gup_flags & FOLL_PIN) {
+ if (read_seqcount_retry(¤t->mm->write_protect_seq, seq)) {
+ unpin_user_pages(pages, nr_pinned);
+ return 0;
+ }
+ }
+ return nr_pinned;
+ }
+
+ static int internal_get_user_pages_fast(unsigned long start,
+ unsigned long nr_pages,
unsigned int gup_flags,
struct page **pages)
{
- unsigned long addr, len, end;
- unsigned long flags;
- int nr_pinned = 0, ret = 0;
+ unsigned long len, end;
+ unsigned long nr_pinned;
+ int ret;
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
FOLL_FORCE | FOLL_PIN | FOLL_GET |
might_lock_read(¤t->mm->mmap_lock);
start = untagged_addr(start) & PAGE_MASK;
- addr = start;
- len = (unsigned long) nr_pages << PAGE_SHIFT;
- end = start + len;
-
- if (end <= start)
+ len = nr_pages << PAGE_SHIFT;
+ if (check_add_overflow(start, len, &end))
return 0;
if (unlikely(!access_ok((void __user *)start, len)))
return -EFAULT;
- /*
- * Disable interrupts. The nested form is used, in order to allow
- * full, general purpose use of this routine.
- *
- * With interrupts disabled, we block page table pages from being
- * freed from under us. See struct mmu_table_batch comments in
- * include/asm-generic/tlb.h for more details.
- *
- * We do not adopt an rcu_read_lock(.) here as we also want to
- * block IPIs that come from THPs splitting.
- */
- if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && gup_fast_permitted(start, end)) {
- unsigned long fast_flags = gup_flags;
-
- local_irq_save(flags);
- gup_pgd_range(addr, end, fast_flags, pages, &nr_pinned);
- local_irq_restore(flags);
- ret = nr_pinned;
- }
-
- if (nr_pinned < nr_pages && !(gup_flags & FOLL_FAST_ONLY)) {
- /* Try to get the remaining pages with get_user_pages */
- start += nr_pinned << PAGE_SHIFT;
- pages += nr_pinned;
+ nr_pinned = lockless_pages_from_mm(start, end, gup_flags, pages);
+ if (nr_pinned == nr_pages || gup_flags & FOLL_FAST_ONLY)
+ return nr_pinned;
- ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned,
- gup_flags, pages);
-
- /* Have to be a bit careful with return values */
- if (nr_pinned > 0) {
- if (ret < 0)
- ret = nr_pinned;
- else
- ret += nr_pinned;
- }
+ /* Slow path: try to get the remaining pages with get_user_pages */
+ start += nr_pinned << PAGE_SHIFT;
+ pages += nr_pinned;
+ ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned, gup_flags,
+ pages);
+ if (ret < 0) {
+ /*
+ * The caller has to unpin the pages we already pinned so
+ * returning -errno is not an option
+ */
+ if (nr_pinned)
+ return nr_pinned;
+ return ret;
}
-
- return ret;
+ return ret + nr_pinned;
}
+
/**
* get_user_pages_fast_only() - pin user pages in memory
* @start: starting user address
#include <asm/tlbflush.h>
#include <linux/vmalloc.h>
-#if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32)
-DEFINE_PER_CPU(int, __kmap_atomic_idx);
-#endif
-
/*
* Virtual_count is not a pure "count".
* 0 means that it is not mapped, and has not been mapped
atomic_long_t _totalhigh_pages __read_mostly;
EXPORT_SYMBOL(_totalhigh_pages);
-EXPORT_PER_CPU_SYMBOL(__kmap_atomic_idx);
-
-unsigned int nr_free_highpages (void)
+unsigned int __nr_free_highpages (void)
{
struct zone *zone;
unsigned int pages = 0;
do { spin_unlock(&kmap_lock); (void)(flags); } while (0)
#endif
-struct page *kmap_to_page(void *vaddr)
+struct page *__kmap_to_page(void *vaddr)
{
unsigned long addr = (unsigned long)vaddr;
return virt_to_page(addr);
}
-EXPORT_SYMBOL(kmap_to_page);
+EXPORT_SYMBOL(__kmap_to_page);
static void flush_all_zero_pkmaps(void)
{
flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
}
-/**
- * kmap_flush_unused - flush all unused kmap mappings in order to remove stray mappings
- */
-void kmap_flush_unused(void)
+void __kmap_flush_unused(void)
{
lock_kmap();
flush_all_zero_pkmaps();
if (need_wakeup)
wake_up(pkmap_map_wait);
}
-
EXPORT_SYMBOL(kunmap_high);
-#endif /* CONFIG_HIGHMEM */
+
+ #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ void zero_user_segments(struct page *page, unsigned start1, unsigned end1,
+ unsigned start2, unsigned end2)
+ {
+ unsigned int i;
+
+ BUG_ON(end1 > page_size(page) || end2 > page_size(page));
+
+ for (i = 0; i < compound_nr(page); i++) {
+ void *kaddr = NULL;
+
+ if (start1 < PAGE_SIZE || start2 < PAGE_SIZE)
+ kaddr = kmap_atomic(page + i);
+
+ if (start1 >= PAGE_SIZE) {
+ start1 -= PAGE_SIZE;
+ end1 -= PAGE_SIZE;
+ } else {
+ unsigned this_end = min_t(unsigned, end1, PAGE_SIZE);
+
+ if (end1 > start1)
+ memset(kaddr + start1, 0, this_end - start1);
+ end1 -= this_end;
+ start1 = 0;
+ }
+
+ if (start2 >= PAGE_SIZE) {
+ start2 -= PAGE_SIZE;
+ end2 -= PAGE_SIZE;
+ } else {
+ unsigned this_end = min_t(unsigned, end2, PAGE_SIZE);
+
+ if (end2 > start2)
+ memset(kaddr + start2, 0, this_end - start2);
+ end2 -= this_end;
+ start2 = 0;
+ }
+
+ if (kaddr) {
+ kunmap_atomic(kaddr);
+ flush_dcache_page(page + i);
+ }
+
+ if (!end1 && !end2)
+ break;
+ }
+
+ BUG_ON((start1 | start2 | end1 | end2) != 0);
+ }
+ EXPORT_SYMBOL(zero_user_segments);
+ #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_HIGHMEM */
+
+#ifdef CONFIG_KMAP_LOCAL
+
+#include <asm/kmap_size.h>
+
+/*
+ * With DEBUG_KMAP_LOCAL the stack depth is doubled and every second
+ * slot is unused which acts as a guard page
+ */
+#ifdef CONFIG_DEBUG_KMAP_LOCAL
+# define KM_INCR 2
+#else
+# define KM_INCR 1
+#endif
+
+static inline int kmap_local_idx_push(void)
+{
+ WARN_ON_ONCE(in_irq() && !irqs_disabled());
+ current->kmap_ctrl.idx += KM_INCR;
+ BUG_ON(current->kmap_ctrl.idx >= KM_MAX_IDX);
+ return current->kmap_ctrl.idx - 1;
+}
+
+static inline int kmap_local_idx(void)
+{
+ return current->kmap_ctrl.idx - 1;
+}
+
+static inline void kmap_local_idx_pop(void)
+{
+ current->kmap_ctrl.idx -= KM_INCR;
+ BUG_ON(current->kmap_ctrl.idx < 0);
+}
+
+#ifndef arch_kmap_local_post_map
+# define arch_kmap_local_post_map(vaddr, pteval) do { } while (0)
+#endif
+
+#ifndef arch_kmap_local_pre_unmap
+# define arch_kmap_local_pre_unmap(vaddr) do { } while (0)
+#endif
+
+#ifndef arch_kmap_local_post_unmap
+# define arch_kmap_local_post_unmap(vaddr) do { } while (0)
+#endif
+
+#ifndef arch_kmap_local_map_idx
+#define arch_kmap_local_map_idx(idx, pfn) kmap_local_calc_idx(idx)
+#endif
+
+#ifndef arch_kmap_local_unmap_idx
+#define arch_kmap_local_unmap_idx(idx, vaddr) kmap_local_calc_idx(idx)
+#endif
+
+#ifndef arch_kmap_local_high_get
+static inline void *arch_kmap_local_high_get(struct page *page)
+{
+ return NULL;
+}
+#endif
+
+/* Unmap a local mapping which was obtained by kmap_high_get() */
+static inline bool kmap_high_unmap_local(unsigned long vaddr)
+{
+#ifdef ARCH_NEEDS_KMAP_HIGH_GET
+ if (vaddr >= PKMAP_ADDR(0) && vaddr < PKMAP_ADDR(LAST_PKMAP)) {
+ kunmap_high(pte_page(pkmap_page_table[PKMAP_NR(vaddr)]));
+ return true;
+ }
+#endif
+ return false;
+}
+
+static inline int kmap_local_calc_idx(int idx)
+{
+ return idx + KM_MAX_IDX * smp_processor_id();
+}
+
+static pte_t *__kmap_pte;
+
+static pte_t *kmap_get_pte(void)
+{
+ if (!__kmap_pte)
+ __kmap_pte = virt_to_kpte(__fix_to_virt(FIX_KMAP_BEGIN));
+ return __kmap_pte;
+}
+
+void *__kmap_local_pfn_prot(unsigned long pfn, pgprot_t prot)
+{
+ pte_t pteval, *kmap_pte = kmap_get_pte();
+ unsigned long vaddr;
+ int idx;
+
+ /*
+ * Disable migration so resulting virtual address is stable
+ * accross preemption.
+ */
+ migrate_disable();
+ preempt_disable();
+ idx = arch_kmap_local_map_idx(kmap_local_idx_push(), pfn);
+ vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
+ BUG_ON(!pte_none(*(kmap_pte - idx)));
+ pteval = pfn_pte(pfn, prot);
+ set_pte_at(&init_mm, vaddr, kmap_pte - idx, pteval);
+ arch_kmap_local_post_map(vaddr, pteval);
+ current->kmap_ctrl.pteval[kmap_local_idx()] = pteval;
+ preempt_enable();
+
+ return (void *)vaddr;
+}
+EXPORT_SYMBOL_GPL(__kmap_local_pfn_prot);
+
+void *__kmap_local_page_prot(struct page *page, pgprot_t prot)
+{
+ void *kmap;
+
+ /*
+ * To broaden the usage of the actual kmap_local() machinery always map
+ * pages when debugging is enabled and the architecture has no problems
+ * with alias mappings.
+ */
+ if (!IS_ENABLED(CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP) && !PageHighMem(page))
+ return page_address(page);
+
+ /* Try kmap_high_get() if architecture has it enabled */
+ kmap = arch_kmap_local_high_get(page);
+ if (kmap)
+ return kmap;
+
+ return __kmap_local_pfn_prot(page_to_pfn(page), prot);
+}
+EXPORT_SYMBOL(__kmap_local_page_prot);
+
+void kunmap_local_indexed(void *vaddr)
+{
+ unsigned long addr = (unsigned long) vaddr & PAGE_MASK;
+ pte_t *kmap_pte = kmap_get_pte();
+ int idx;
+
+ if (addr < __fix_to_virt(FIX_KMAP_END) ||
+ addr > __fix_to_virt(FIX_KMAP_BEGIN)) {
+ if (IS_ENABLED(CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP)) {
+ /* This _should_ never happen! See above. */
+ WARN_ON_ONCE(1);
+ return;
+ }
+ /*
+ * Handle mappings which were obtained by kmap_high_get()
+ * first as the virtual address of such mappings is below
+ * PAGE_OFFSET. Warn for all other addresses which are in
+ * the user space part of the virtual address space.
+ */
+ if (!kmap_high_unmap_local(addr))
+ WARN_ON_ONCE(addr < PAGE_OFFSET);
+ return;
+ }
+
+ preempt_disable();
+ idx = arch_kmap_local_unmap_idx(kmap_local_idx(), addr);
+ WARN_ON_ONCE(addr != __fix_to_virt(FIX_KMAP_BEGIN + idx));
+
+ arch_kmap_local_pre_unmap(addr);
+ pte_clear(&init_mm, addr, kmap_pte - idx);
+ arch_kmap_local_post_unmap(addr);
+ current->kmap_ctrl.pteval[kmap_local_idx()] = __pte(0);
+ kmap_local_idx_pop();
+ preempt_enable();
+ migrate_enable();
+}
+EXPORT_SYMBOL(kunmap_local_indexed);
+
+/*
+ * Invoked before switch_to(). This is safe even when during or after
+ * clearing the maps an interrupt which needs a kmap_local happens because
+ * the task::kmap_ctrl.idx is not modified by the unmapping code so a
+ * nested kmap_local will use the next unused index and restore the index
+ * on unmap. The already cleared kmaps of the outgoing task are irrelevant
+ * because the interrupt context does not know about them. The same applies
+ * when scheduling back in for an interrupt which happens before the
+ * restore is complete.
+ */
+void __kmap_local_sched_out(void)
+{
+ struct task_struct *tsk = current;
+ pte_t *kmap_pte = kmap_get_pte();
+ int i;
+
+ /* Clear kmaps */
+ for (i = 0; i < tsk->kmap_ctrl.idx; i++) {
+ pte_t pteval = tsk->kmap_ctrl.pteval[i];
+ unsigned long addr;
+ int idx;
+
+ /* With debug all even slots are unmapped and act as guard */
+ if (IS_ENABLED(CONFIG_DEBUG_HIGHMEM) && !(i & 0x01)) {
+ WARN_ON_ONCE(!pte_none(pteval));
+ continue;
+ }
+ if (WARN_ON_ONCE(pte_none(pteval)))
+ continue;
+
+ /*
+ * This is a horrible hack for XTENSA to calculate the
+ * coloured PTE index. Uses the PFN encoded into the pteval
+ * and the map index calculation because the actual mapped
+ * virtual address is not stored in task::kmap_ctrl.
+ * For any sane architecture this is optimized out.
+ */
+ idx = arch_kmap_local_map_idx(i, pte_pfn(pteval));
+
+ addr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
+ arch_kmap_local_pre_unmap(addr);
+ pte_clear(&init_mm, addr, kmap_pte - idx);
+ arch_kmap_local_post_unmap(addr);
+ }
+}
+
+void __kmap_local_sched_in(void)
+{
+ struct task_struct *tsk = current;
+ pte_t *kmap_pte = kmap_get_pte();
+ int i;
+
+ /* Restore kmaps */
+ for (i = 0; i < tsk->kmap_ctrl.idx; i++) {
+ pte_t pteval = tsk->kmap_ctrl.pteval[i];
+ unsigned long addr;
+ int idx;
+
+ /* With debug all even slots are unmapped and act as guard */
+ if (IS_ENABLED(CONFIG_DEBUG_HIGHMEM) && !(i & 0x01)) {
+ WARN_ON_ONCE(!pte_none(pteval));
+ continue;
+ }
+ if (WARN_ON_ONCE(pte_none(pteval)))
+ continue;
+
+ /* See comment in __kmap_local_sched_out() */
+ idx = arch_kmap_local_map_idx(i, pte_pfn(pteval));
+ addr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
+ set_pte_at(&init_mm, addr, kmap_pte - idx, pteval);
+ arch_kmap_local_post_map(addr, pteval);
+ }
+}
+
+void kmap_local_fork(struct task_struct *tsk)
+{
+ if (WARN_ON_ONCE(tsk->kmap_ctrl.idx))
+ memset(&tsk->kmap_ctrl, 0, sizeof(tsk->kmap_ctrl));
+}
+
+#endif
#if defined(HASHED_PAGE_VIRTUAL)