Catalin Marinas [Fri, 20 May 2022 17:50:35 +0000 (18:50 +0100)]
Merge branches 'for-next/sme', 'for-next/stacktrace', 'for-next/fault-in-subpage', 'for-next/misc', 'for-next/ftrace' and 'for-next/crashkernel', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf:
perf/arm-cmn: Decode CAL devices properly in debugfs
perf/arm-cmn: Fix filter_sel lookup
perf/marvell_cn10k: Fix tad_pmu_event_init() to check pmu type first
drivers/perf: hisi: Add Support for CPA PMU
drivers/perf: hisi: Associate PMUs in SICL with CPUs online
drivers/perf: arm_spe: Expose saturating counter to 16-bit
perf/arm-cmn: Add CMN-700 support
perf/arm-cmn: Refactor occupancy filter selector
perf/arm-cmn: Add CMN-650 support
dt-bindings: perf: arm-cmn: Add CMN-650 and CMN-700
perf: check return value of armpmu_request_irq()
perf: RISC-V: Remove non-kernel-doc ** comments
* for-next/sme: (30 commits)
: Scalable Matrix Extensions support.
arm64/sve: Move sve_free() into SVE code section
arm64/sve: Make kernel FPU protection RT friendly
arm64/sve: Delay freeing memory in fpsimd_flush_thread()
arm64/sme: More sensibly define the size for the ZA register set
arm64/sme: Fix NULL check after kzalloc
arm64/sme: Add ID_AA64SMFR0_EL1 to __read_sysreg_by_encoding()
arm64/sme: Provide Kconfig for SME
KVM: arm64: Handle SME host state when running guests
KVM: arm64: Trap SME usage in guest
KVM: arm64: Hide SME system registers from guests
arm64/sme: Save and restore streaming mode over EFI runtime calls
arm64/sme: Disable streaming mode and ZA when flushing CPU state
arm64/sme: Add ptrace support for ZA
arm64/sme: Implement ptrace support for streaming mode SVE registers
arm64/sme: Implement ZA signal handling
arm64/sme: Implement streaming SVE signal handling
arm64/sme: Disable ZA and streaming mode when handling signals
arm64/sme: Implement traps and syscall handling for SME
arm64/sme: Implement ZA context switching
arm64/sme: Implement streaming SVE context switching
...
* for-next/stacktrace:
: Stacktrace cleanups.
arm64: stacktrace: align with common naming
arm64: stacktrace: rename stackframe to unwind_state
arm64: stacktrace: rename unwinder functions
arm64: stacktrace: make struct stackframe private to stacktrace.c
arm64: stacktrace: delete PCS comment
arm64: stacktrace: remove NULL task check from unwind_frame()
* for-next/fault-in-subpage:
: btrfs search_ioctl() live-lock fix using fault_in_subpage_writeable().
btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page faults
arm64: Add support for user sub-page fault probing
mm: Add fault_in_subpage_writeable() to probe at sub-page granularity
* for-next/misc:
: Miscellaneous patches.
arm64: Kconfig.platforms: Add comments
arm64: Kconfig: Fix indentation and add comments
arm64: mm: avoid writable executable mappings in kexec/hibernate code
arm64: lds: move special code sections out of kernel exec segment
arm64/hugetlb: Implement arm64 specific huge_ptep_get()
arm64/hugetlb: Use ptep_get() to get the pte value of a huge page
arm64: mm: Make arch_faults_on_old_pte() check for migratability
arm64: mte: Clean up user tag accessors
arm64/hugetlb: Drop TLB flush from get_clear_flush()
arm64: Declare non global symbols as static
arm64: mm: Cleanup useless parameters in zone_sizes_init()
arm64: fix types in copy_highpage()
arm64: Set ARCH_NR_GPIO to 2048 for ARCH_APPLE
arm64: cputype: Avoid overflow using MIDR_IMPLEMENTOR_MASK
arm64: document the boot requirements for MTE
arm64/mm: Compute PTRS_PER_[PMD|PUD] independently of PTRS_PER_PTE
* for-next/ftrace:
: ftrace cleanups.
arm64/ftrace: Make function graph use ftrace directly
ftrace: cleanup ftrace_graph_caller enable and disable
* for-next/crashkernel:
: Support for crashkernel reservations above ZONE_DMA.
arm64: kdump: Do not allocate crash low memory if not needed
docs: kdump: Update the crashkernel description for arm64
of: Support more than one crash kernel regions for kexec -s
of: fdt: Add memory for devices by DT property "linux,usable-memory-range"
arm64: kdump: Reimplement crashkernel=X
arm64: Use insert_resource() to simplify code
kdump: return -ENOENT if required cmdline option does not exist
arch/arm64/kernel/fpsimd.c:294:13: warning: ‘sve_free’ defined but not used [-Wunused-function]
Fix this by moving sve_free() and __sve_free() into the existing section
protected by "#ifdef CONFIG_ARM64_SVE", now the last user outside that
section has been removed.
Juerg Haefliger [Tue, 17 May 2022 14:16:47 +0000 (16:16 +0200)]
arm64: Kconfig: Fix indentation and add comments
The convention for indentation seems to be a single tab. Help text is
further indented by an additional two whitespaces. Fix the lines that
violate these rules.
While add it, add trailing comments to endif and endmenu statements for
better readability.
arm64: mm: avoid writable executable mappings in kexec/hibernate code
The temporary mappings of the low-level kexec and hibernate helpers are
created with both writable and executable attributes, which is not
necessary here, and generally best avoided. So use read-only, executable
attributes instead.
arm64: lds: move special code sections out of kernel exec segment
There are a few code sections that are emitted into the kernel's
executable .text segment simply because they contain code, but are
actually never executed via this mapping, so they can happily live in a
region that gets mapped without executable permissions, reducing the
risk of being gadgetized.
Note that the kexec and hibernate region contents are always copied into
a fresh page, and so there is no need to align them as long as the
overall size of each is below 4 KiB.
Baolin Wang [Mon, 16 May 2022 00:55:58 +0000 (08:55 +0800)]
arm64/hugetlb: Implement arm64 specific huge_ptep_get()
Now we use huge_ptep_get() to get the pte value of a hugetlb page,
however it will only return one specific pte value for the CONT-PTE
or CONT-PMD size hugetlb on ARM64 system, which can contain several
continuous pte or pmd entries with same page table attributes. And it
will not take into account the subpages' dirty or young bits of a
CONT-PTE/PMD size hugetlb page.
So the huge_ptep_get() is inconsistent with huge_ptep_get_and_clear(),
which already takes account the dirty or young bits for any subpages
in this CONT-PTE/PMD size hugetlb [1]. Meanwhile we can miss dirty or
young flags statistics for hugetlb pages with current huge_ptep_get(),
such as the gather_hugetlb_stats() function, and CONT-PTE/PMD hugetlb
monitoring with DAMON.
Thus define an ARM64 specific huge_ptep_get() implementation as well as
enabling __HAVE_ARCH_HUGE_PTEP_GET, that will take into account any
subpages' dirty or young bits for CONT-PTE/PMD size hugetlb page, for
those functions that want to check the dirty and young flags of a hugetlb
page.
Baolin Wang [Mon, 16 May 2022 00:55:57 +0000 (08:55 +0800)]
arm64/hugetlb: Use ptep_get() to get the pte value of a huge page
The original huge_ptep_get() on ARM64 is just a wrapper of ptep_get(),
which will not take into account any contig-PTEs dirty and access bits.
Meanwhile we will implement a new ARM64-specific huge_ptep_get()
interface in following patch, which will take into account any contig-PTEs
dirty and access bits. To keep the same efficient logic to get the pte
value, change to use ptep_get() as a preparation.
Zhen Lei [Wed, 11 May 2022 03:20:32 +0000 (11:20 +0800)]
arm64: kdump: Do not allocate crash low memory if not needed
When "crashkernel=X,high" is specified, the specified "crashkernel=Y,low"
memory is not required in the following corner cases:
1. If both CONFIG_ZONE_DMA and CONFIG_ZONE_DMA32 are disabled, it means
that the devices can access any memory.
2. If the system memory is small, the crash high memory may be allocated
from the DMA zones. If that happens, there's no need to allocate
another crash low memory because there's already one.
Add condition '(crash_base >= CRASH_ADDR_LOW_MAX)' to determine whether
the 'high' memory is allocated above DMA zones. Note: when both
CONFIG_ZONE_DMA and CONFIG_ZONE_DMA32 are disabled, the entire physical
memory is DMA accessible, CRASH_ADDR_LOW_MAX equals 'PHYS_MASK + 1'.
Non RT kernels need to protect FPU against preemption and bottom half
processing. This is achieved by disabling bottom halves via
local_bh_disable() which implictly disables preemption.
On RT kernels this protection mechanism is not sufficient because
local_bh_disable() does not disable preemption. It serializes bottom half
related processing via a CPU local lock.
As bottom halves are running always in thread context on RT kernels
disabling preemption is the proper choice as it implicitly prevents bottom
half processing.
arm64: mm: Make arch_faults_on_old_pte() check for migratability
arch_faults_on_old_pte() relies on the calling context being
non-preemptible. CONFIG_PREEMPT_RT turns the PTE lock into a sleepable
spinlock, which doesn't disable preemption once acquired, triggering the
warning in arch_faults_on_old_pte().
It does however disable migration, ensuring the task remains on the same
CPU during the entirety of the critical section, making the read of
cpu_has_hw_af() safe and stable.
Make arch_faults_on_old_pte() check cant_migrate() instead of preemptible().
Robin Murphy [Wed, 11 May 2022 13:12:53 +0000 (14:12 +0100)]
perf/arm-cmn: Decode CAL devices properly in debugfs
The debugfs code is lazy, and since it only keeps the bottom byte of
each connect_info register to save space, it also treats the whole thing
as the device_type since the other bits were reserved anyway. Upon
closer inspection, though, this is no longer true on newer IP versions,
so let's be good and decode the exact field properly. This should help
it not get confused when a Component Aggregation Layer is present (which
is already implied if Node IDs are found for both device addresses
represented by the next two lines of the table).
arm64/hugetlb: Drop TLB flush from get_clear_flush()
This drops now redundant TLB flush in get_clear_flush() which is no longer
required after recent commit 697a1d44af8b ("tlb: hugetlb: Add more sizes to
tlb_remove_huge_tlb_entry"). It also renames this function i.e dropping off
'_flush' and replacing it with '__contig' as appropriate.
Robin Murphy [Tue, 10 May 2022 21:23:08 +0000 (22:23 +0100)]
perf/arm-cmn: Fix filter_sel lookup
Carefully considering the bounds of an array is all well and good,
until you forget that that array also contains a NULL sentinel at
the end and dereference it. So close...
Tanmay Jagdale [Tue, 10 May 2022 10:26:57 +0000 (15:56 +0530)]
perf/marvell_cn10k: Fix tad_pmu_event_init() to check pmu type first
Make sure to check the pmu type first and then check event->attr.disabled.
Doing so would avoid reading the disabled attribute of an event that is
not handled by TAD PMU.
Zhen Lei [Fri, 6 May 2022 11:44:02 +0000 (19:44 +0800)]
docs: kdump: Update the crashkernel description for arm64
Now arm64 has added support for "crashkernel=X,high" and
"crashkernel=Y,low". Unlike x86, crash low memory is not allocated if
"crashkernel=Y,low" is not specified.
Zhen Lei [Fri, 6 May 2022 11:44:01 +0000 (19:44 +0800)]
of: Support more than one crash kernel regions for kexec -s
When "crashkernel=X,high" is used, there may be two crash regions:
high=crashk_res and low=crashk_low_res. But now the syscall
kexec_file_load() only add crashk_res into "linux,usable-memory-range",
this may cause the second kernel to have no available dma memory.
Fix it like kexec-tools does for option -c, add both 'high' and 'low'
regions into the dtb.
Chen Zhou [Fri, 6 May 2022 11:44:00 +0000 (19:44 +0800)]
of: fdt: Add memory for devices by DT property "linux,usable-memory-range"
When reserving crashkernel in high memory, some low memory is reserved
for crash dump kernel devices and never mapped by the first kernel.
This memory range is advertised to crash dump kernel via DT property
under /chosen,
linux,usable-memory-range = <BASE1 SIZE1 [BASE2 SIZE2]>
We reused the DT property linux,usable-memory-range and made the low
memory region as the second range "BASE2 SIZE2", which keeps compatibility
with existing user-space and older kdump kernels.
Crash dump kernel reads this property at boot time and call memblock_add()
to add the low memory region after memblock_cap_memory_range() has been
called.
Chen Zhou [Fri, 6 May 2022 11:43:59 +0000 (19:43 +0800)]
arm64: kdump: Reimplement crashkernel=X
There are following issues in arm64 kdump:
1. We use crashkernel=X to reserve crashkernel in DMA zone, which
will fail when there is not enough low memory.
2. If reserving crashkernel above DMA zone, in this case, crash dump
kernel will fail to boot because there is no low memory available
for allocation.
To solve these issues, introduce crashkernel=X,[high,low].
The "crashkernel=X,high" is used to select a region above DMA zone, and
the "crashkernel=Y,low" is used to allocate specified size low memory.
Zhen Lei [Fri, 6 May 2022 11:43:58 +0000 (19:43 +0800)]
arm64: Use insert_resource() to simplify code
insert_resource() traverses the subtree layer by layer from the root node
until a proper location is found. Compared with request_resource(), the
parent node does not need to be determined in advance.
In addition, move the insertion of node 'crashk_res' into function
reserve_crashkernel() to make the associated code close together.
Zhen Lei [Fri, 6 May 2022 11:43:57 +0000 (19:43 +0800)]
kdump: return -ENOENT if required cmdline option does not exist
According to the current crashkernel=Y,low support in other ARCHes, it's
an optional command-line option. When it doesn't exist, kernel will try
to allocate minimum required memory below 4G automatically.
However, __parse_crashkernel() returns '-EINVAL' for all error cases. It
can't distinguish the nonexistent option from invalid option.
Change __parse_crashkernel() to return '-ENOENT' for the nonexistent option
case. With this change, crashkernel,low memory will take the default
value if crashkernel=,low is not specified; while crashkernel reservation
will fail and bail out if an invalid option is specified.
Mark Brown [Thu, 5 May 2022 22:15:17 +0000 (23:15 +0100)]
arm64/sme: More sensibly define the size for the ZA register set
Since the vector length configuration mechanism is identical between SVE
and SME we share large elements of the code including the definition for
the maximum vector length. Unfortunately when we were defining the ABI
for SVE we included not only the actual maximum vector length of 2048
bits but also the value possible if all the bits reserved in the
architecture for expansion of the LEN field were used, 16384 bits.
This starts creating problems if we try to allocate anything for the ZA
matrix based on the maximum possible vector length, as we do for the
regset used with ptrace during the process of generating a core dump.
While the maximum potential size for ZA with the current architecture is
a reasonably managable 64K with the higher reserved limit ZA would be
64M which leads to entirely reasonable complaints from the memory
management code when we try to allocate a buffer of that size. Avoid
these issues by defining the actual maximum vector length for the
architecture and using it for the SME regsets.
Also use the full ZA_PT_SIZE() with the header rather than just the
actual register payload when specifying the size, fixing support for the
largest vector lengths now that we have this new, lower define. With the
SVE maximum this did not cause problems due to the extra headroom we
had.
While we're at it add a comment clarifying why even though ZA is a
single register we tell the regset code that it is a multi-register
regset.
Qi Liu [Fri, 15 Apr 2022 10:23:52 +0000 (18:23 +0800)]
drivers/perf: hisi: Add Support for CPA PMU
On HiSilicon Hip09 platform, there is a CPA (Coherency Protocol Agent) on
each SICL (Super IO Cluster) which implements packet format translation,
route parsing and traffic statistics.
CPA PMU has 8 PMU counters and interrupt is supported to handle counter
overflow. Let's support its driver under the framework of HiSilicon PMU
driver.
Qi Liu [Fri, 15 Apr 2022 10:23:51 +0000 (18:23 +0800)]
drivers/perf: hisi: Associate PMUs in SICL with CPUs online
If a PMU is in a SICL (Super IO cluster), it is not appropriate to
associate this PMU with a CPU die. So we associate it with all CPUs
online, rather than CPUs in the nearest SCCL.
As the firmware of Hip09 platform hasn't been published yet, change
of PMU driver will not influence backwards compatibility between
driver and firmware.
Robin Murphy [Mon, 18 Apr 2022 22:57:41 +0000 (23:57 +0100)]
perf/arm-cmn: Add CMN-700 support
Add the identifiers, events, and subtleties for CMN-700. Highlights
include yet more options for doubling up CHI channels, which finally
grows event IDs beyond 8 bits for XPs, and a new set of CML gateway
nodes adding support for CXL as well as CCIX, where the Link Agent is
now internal to the CMN mesh so we gain regular PMU events for that too.
Robin Murphy [Mon, 18 Apr 2022 22:57:40 +0000 (23:57 +0100)]
perf/arm-cmn: Refactor occupancy filter selector
So far, DNs and HN-Fs have each had one event ralated to occupancy
trackers which are filtered by a separate field. CMN-700 raises the
stakes by introducing two more sets of HN-F events with corresponding
additional filter fields. Prepare for this by refactoring our filter
selection and tracking logic to account for multiple filter types
coexisting on the same node. This need not affect the uAPI, which can
just continue to encode any per-event filter setting in the "occupid"
config field, even if it's technically not the most accurate name for
some of them.
Robin Murphy [Mon, 18 Apr 2022 22:57:39 +0000 (23:57 +0100)]
perf/arm-cmn: Add CMN-650 support
Add the identifiers and events for CMN-650, which slots into its
evolutionary position between CMN-600 and the 700-series products.
Imagine CMN-600 made bigger, and with most of the rough edges smoothed
off, but that then balanced out by some bonkers PMU functionality for
the new HN-P enhancement in CMN-650r2.
Most of the CXG events are actually common to newer revisions of CMN-600
too, so they're arguably a little late; oh well.
Robin Murphy [Mon, 18 Apr 2022 22:57:38 +0000 (23:57 +0100)]
dt-bindings: perf: arm-cmn: Add CMN-650 and CMN-700
If you were to guess from the product names that CMN-650 and CMN-700 are
the next two evolutionary steps of Arm's enterprise-level interconnect
following on from CMN-600, you'd be pleasantly correct. Add them to the
DT binding.
In copy_highpage() the `kto` and `kfrom` local variables are pointers to
struct page, but these are used to hold arbitrary pointers to kernel memory
. Each call to page_address() returns a void pointer to memory associated
with the relevant page, and copy_page() expects void pointers to this
memory.
This inconsistency was introduced in commit 2563776b41c3 ("arm64: mte:
Tags-aware copy_{user_,}highpage() implementations") and while this
doesn't appear to be harmful in practice it is clearly wrong.
Correct this by making `kto` and `kfrom` void pointers.
Hector Martin [Mon, 2 May 2022 09:14:27 +0000 (18:14 +0900)]
arm64: Set ARCH_NR_GPIO to 2048 for ARCH_APPLE
We're already running into the 512 GPIO limit on t600[01] depending on
how many SMC GPIOs we allocate, and a 2-die version could double that.
Let's make it 2K to be safe for now.
Michal Orzel [Tue, 26 Apr 2022 07:06:03 +0000 (09:06 +0200)]
arm64: cputype: Avoid overflow using MIDR_IMPLEMENTOR_MASK
Value of macro MIDR_IMPLEMENTOR_MASK exceeds the range of integer
and can lead to overflow. Currently there is no issue as it is used
in expressions implicitly casting it to u32. To avoid possible
problems, fix the macro.
arm64/ftrace: Make function graph use ftrace directly
As we do in commit 0c0593b45c9b ("x86/ftrace: Make function graph
use ftrace directly"), we don't need special hook for graph tracer,
but instead we use graph_ops:func function to install return_hooker.
Since commit 3b23e4991fb6 ("arm64: implement ftrace with regs") add
implementation for FTRACE_WITH_REGS on arm64, we can easily adopt
the same cleanup on arm64.
And this cleanup only changes the FTRACE_WITH_REGS implementation,
so the mcount-based implementation is unaffected.
While in theory it would be possible to make a similar cleanup for
!FTRACE_WITH_REGS, this will require rework of the core code, and
so for now we only change the FTRACE_WITH_REGS implementation.
ftrace: cleanup ftrace_graph_caller enable and disable
The ftrace_[enable,disable]_ftrace_graph_caller() are used to do
special hooks for graph tracer, which are not needed on some ARCHs
that use graph_ops:func function to install return_hooker.
So introduce the weak version in ftrace core code to cleanup
in x86.
Mark Brown [Wed, 27 Apr 2022 13:08:28 +0000 (14:08 +0100)]
arm64/sme: Add ID_AA64SMFR0_EL1 to __read_sysreg_by_encoding()
We need to explicitly enumerate all the ID registers which we rely on
for CPU capabilities in __read_sysreg_by_encoding(), ID_AA64SMFR0_EL1 was
missed from this list so we trip a BUG() in paths which rely on that
function such as CPU hotplug. Add the register.
When booting the kernel we access system registers such as GCR_EL1
if MTE is supported. These accesses are defined to trap to EL3 if
SCR_EL3.ATA is disabled. Furthermore, tag accesses will not behave
as expected if SCR_EL3.ATA is not set, or if HCR_EL2.ATA is not set
and we were booted at EL1. Therefore, require that these bits are
enabled when appropriate.
btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page faults
Commit a48b73eca4ce ("btrfs: fix potential deadlock in the search
ioctl") addressed a lockdep warning by pre-faulting the user pages and
attempting the copy_to_user_nofault() in an infinite loop. On
architectures like arm64 with MTE, an access may fault within a page at
a location different from what fault_in_writeable() probed. Since the
sk_offset is rewound to the previous struct btrfs_ioctl_search_header
boundary, there is no guaranteed forward progress and search_ioctl() may
live-lock.
Use fault_in_subpage_writeable() instead of fault_in_writeable() to
ensure the permission is checked at the right granularity (smaller than
PAGE_SIZE).
arm64: Add support for user sub-page fault probing
With MTE, even if the pte allows an access, a mismatched tag somewhere
within a page can still cause a fault. Select ARCH_HAS_SUBPAGE_FAULTS if
MTE is enabled and implement the probe_subpage_writeable() function.
Note that get_user() is sufficient for the writeable MTE check since the
same tag mismatch fault would be triggered by a read. The caller of
probe_subpage_writeable() will need to check the pte permissions
(put_user, GUP).
mm: Add fault_in_subpage_writeable() to probe at sub-page granularity
On hardware with features like arm64 MTE or SPARC ADI, an access fault
can be triggered at sub-page granularity. Depending on how the
fault_in_writeable() function is used, the caller can get into a
live-lock by continuously retrying the fault-in on an address different
from the one where the uaccess failed.
In the majority of cases progress is ensured by the following
conditions:
1. copy_to_user_nofault() guarantees at least one byte access if the
user address is not faulting.
2. The fault_in_writeable() loop is resumed from the first address that
could not be accessed by copy_to_user_nofault().
If the loop iteration is restarted from an earlier (initial) point, the
loop is repeated with the same conditions and it would live-lock.
Introduce an arch-specific probe_subpage_writeable() and call it from
the newly added fault_in_subpage_writeable() function. The arch code
with sub-page faults will have to implement the specific probing
functionality.
Note that no other fault_in_subpage_*() functions are added since they
have no callers currently susceptible to a live-lock.
Mark Brown [Tue, 19 Apr 2022 11:22:35 +0000 (12:22 +0100)]
arm64/sme: Provide Kconfig for SME
Now that basline support for the Scalable Matrix Extension (SME) is present
introduce the Kconfig option allowing it to be built. While the feature
registers don't impose a strong requirement for a system with SME to
support SVE at runtime the support for streaming mode SVE is mostly
shared with normal SVE so depend on SVE.
Mark Brown [Tue, 19 Apr 2022 11:22:34 +0000 (12:22 +0100)]
KVM: arm64: Handle SME host state when running guests
While we don't currently support SME in guests we do currently support it
for the host system so we need to take care of SME's impact, including
the floating point register state, when running guests. Simiarly to SVE
we need to manage the traps in CPACR_RL1, what is new is the handling of
streaming mode and ZA.
Normally we defer any handling of the floating point register state until
the guest first uses it however if the system is in streaming mode FPSIMD
and SVE operations may generate SME traps which we would need to distinguish
from actual attempts by the guest to use SME. Rather than do this for the
time being if we are in streaming mode when entering the guest we force
the floating point state to be saved immediately and exit streaming mode,
meaning that the guest won't generate SME traps for supported operations.
We could handle ZA in the access trap similarly to the FPSIMD/SVE state
without the disruption caused by streaming mode but for simplicity
handle it the same way as streaming mode for now.
This will be revisited when we support SME for guests (hopefully before SME
hardware becomes available), for now it will only incur additional cost on
systems with SME and even there only if streaming mode or ZA are enabled.
Mark Brown [Tue, 19 Apr 2022 11:22:33 +0000 (12:22 +0100)]
KVM: arm64: Trap SME usage in guest
SME defines two new traps which need to be enabled for guests to ensure
that they can't use SME, one for the main SME operations which mirrors the
traps for SVE and another for access to TPIDR2 in SCTLR_EL2.
For VHE manage SMEN along with ZEN in activate_traps() and the FP state
management callbacks, along with SCTLR_EL2.EnTPIDR2. There is no
existing dynamic management of SCTLR_EL2.
For nVHE manage TSM in activate_traps() along with the fine grained
traps for TPIDR2 and SMPRI. There is no existing dynamic management of
fine grained traps.
Mark Brown [Tue, 19 Apr 2022 11:22:32 +0000 (12:22 +0100)]
KVM: arm64: Hide SME system registers from guests
For the time being we do not support use of SME by KVM guests, support for
this will be enabled in future. In order to prevent any side effects or
side channels via the new system registers, including the EL0 read/write
register TPIDR2, explicitly undefine all the system registers added by
SME and mask out the SME bitfield in SYS_ID_AA64PFR1.
Mark Brown [Tue, 19 Apr 2022 11:22:31 +0000 (12:22 +0100)]
arm64/sme: Save and restore streaming mode over EFI runtime calls
When saving and restoring the floating point state over an EFI runtime
call ensure that we handle streaming mode, only handling FFR if we are not
in streaming mode and ensuring that we are in normal mode over the call
into runtime services.
We currently assume that ZA will not be modified by runtime services, the
specification is not yet finalised so this may need updating if that
changes.
Mark Brown [Tue, 19 Apr 2022 11:22:30 +0000 (12:22 +0100)]
arm64/sme: Disable streaming mode and ZA when flushing CPU state
Both streaming mode and ZA may increase power consumption when they are
enabled and streaming mode makes many FPSIMD and SVE instructions undefined
which will cause problems for any kernel mode floating point so disable
both when we flush the CPU state. This covers both kernel_neon_begin() and
idle and after flushing the state a reload is always required anyway.
Mark Brown [Tue, 19 Apr 2022 11:22:29 +0000 (12:22 +0100)]
arm64/sme: Add ptrace support for ZA
The ZA array can be read and written with the NT_ARM_ZA. Similarly to
our interface for the SVE vector registers the regset consists of a
header with information on the current vector length followed by an
optional register data payload, represented as for signals as a series
of horizontal vectors from 0 to VL/8 in the endianness independent
format used for vectors.
On get if ZA is enabled then register data will be provided, otherwise
it will be omitted. On set if register data is provided then ZA is
enabled and initialized using the provided data, otherwise it is
disabled.
Mark Brown [Tue, 19 Apr 2022 11:22:28 +0000 (12:22 +0100)]
arm64/sme: Implement ptrace support for streaming mode SVE registers
The streaming mode SVE registers are represented using the same data
structures as for SVE but since the vector lengths supported and in use
may not be the same as SVE we represent them with a new type NT_ARM_SSVE.
Unfortunately we only have a single 16 bit reserved field available in
the header so there is no space to fit the current and maximum vector
length for both standard and streaming SVE mode without redefining the
structure in a way the creates a complicatd and fragile ABI. Since FFR
is not present in streaming mode it is read and written as zero.
Setting NT_ARM_SSVE registers will put the task into streaming mode,
similarly setting NT_ARM_SVE registers will exit it. Reads that do not
correspond to the current mode of the task will return the header with
no register data. For compatibility reasons on write setting no flag for
the register type will be interpreted as setting SVE registers, though
users can provide no register data as an alternative mechanism for doing
so.
Mark Brown [Tue, 19 Apr 2022 11:22:27 +0000 (12:22 +0100)]
arm64/sme: Implement ZA signal handling
Implement support for ZA in signal handling in a very similar way to how
we implement support for SVE registers, using a signal context structure
with optional register state after it. Where present this register state
stores the ZA matrix as a series of horizontal vectors numbered from 0 to
VL/8 in the endinanness independent format used for vectors.
As with SVE we do not allow changes in the vector length during signal
return but we do allow ZA to be enabled or disabled.
Mark Brown [Tue, 19 Apr 2022 11:22:26 +0000 (12:22 +0100)]
arm64/sme: Implement streaming SVE signal handling
When in streaming mode we have the same set of SVE registers as we do in
regular SVE mode with the exception of FFR and the use of the SME vector
length. Provide signal handling for these registers by taking one of the
reserved words in the SVE signal context as a flags field and defining a
flag which is set for streaming mode. When the flag is set the vector
length is set to the streaming mode vector length and we save and
restore streaming mode data. We support entering or leaving streaming
mode based on the value of the flag but do not support changing the
vector length, this is not currently supported SVE signal handling.
We could instead allocate a separate record in the signal frame for the
streaming mode SVE context but this inflates the size of the maximal signal
frame required and adds complication when validating signal frames from
userspace, especially given the current structure of the code.
Any implementation of support for streaming mode vectors in signals will
have some potential for causing issues for applications that attempt to
handle SVE vectors in signals, use streaming mode but do not understand
streaming mode in their signal handling code, it is hard to identify a
case that is clearly better than any other - they all have cases where
they could cause unexpected register corruption or faults.
Mark Brown [Tue, 19 Apr 2022 11:22:25 +0000 (12:22 +0100)]
arm64/sme: Disable ZA and streaming mode when handling signals
The ABI requires that streaming mode and ZA are disabled when invoking
signal handlers, do this in setup_return() when we prepare the task state
for the signal handler.
Mark Brown [Tue, 19 Apr 2022 11:22:24 +0000 (12:22 +0100)]
arm64/sme: Implement traps and syscall handling for SME
By default all SME operations in userspace will trap. When this happens
we allocate storage space for the SME register state, set up the SVE
registers and disable traps. We do not need to initialize ZA since the
architecture guarantees that it will be zeroed when enabled and when we
trap ZA is disabled.
On syscall we exit streaming mode if we were previously in it and ensure
that all but the lower 128 bits of the registers are zeroed while
preserving the state of ZA. This follows the aarch64 PCS for SME, ZA
state is preserved over a function call and streaming mode is exited.
Since the traps for SME do not distinguish between streaming mode SVE
and ZA usage if ZA is in use rather than reenabling traps we instead
zero the parts of the SVE registers not shared with FPSIMD and leave SME
enabled, this simplifies handling SME traps. If ZA is not in use then we
reenable SME traps and fall through to normal handling of SVE.
Mark Brown [Tue, 19 Apr 2022 11:22:23 +0000 (12:22 +0100)]
arm64/sme: Implement ZA context switching
Allocate space for storing ZA on first access to SME and use that to save
and restore ZA state when context switching. We do this by using the vector
form of the LDR and STR ZA instructions, these do not require streaming
mode and have implementation recommendations that they avoid contention
issues in shared SMCU implementations.
Since ZA is architecturally guaranteed to be zeroed when enabled we do not
need to explicitly zero ZA, either we will be restoring from a saved copy
or trapping on first use of SME so we know that ZA must be disabled.
Mark Brown [Tue, 19 Apr 2022 11:22:22 +0000 (12:22 +0100)]
arm64/sme: Implement streaming SVE context switching
When in streaming mode we need to save and restore the streaming mode
SVE register state rather than the regular SVE register state. This uses
the streaming mode vector length and omits FFR but is otherwise identical,
if TIF_SVE is enabled when we are in streaming mode then streaming mode
takes precedence.
This does not handle use of streaming SVE state with KVM, ptrace or
signals. This will be updated in further patches.
Mark Brown [Tue, 19 Apr 2022 11:22:21 +0000 (12:22 +0100)]
arm64/sme: Implement SVCR context switching
In SME the use of both streaming SVE mode and ZA are tracked through
PSTATE.SM and PSTATE.ZA, visible through the system register SVCR. In
order to context switch the floating point state for SME we need to
context switch the contents of this register as part of context
switching the floating point state.
Since changing the vector length exits streaming SVE mode and disables
ZA we also make sure we update SVCR appropriately when setting vector
length, and similarly ensure that new threads have streaming SVE mode
and ZA disabled.
Mark Brown [Tue, 19 Apr 2022 11:22:20 +0000 (12:22 +0100)]
arm64/sme: Implement support for TPIDR2
The Scalable Matrix Extension introduces support for a new thread specific
data register TPIDR2 intended for use by libc. The kernel must save the
value of TPIDR2 on context switch and should ensure that all new threads
start off with a default value of 0. Add a field to the thread_struct to
store TPIDR2 and context switch it with the other thread specific data.
In case there are future extensions which also use TPIDR2 we introduce
system_supports_tpidr2() and use that rather than system_supports_sme()
for TPIDR2 handling.
Mark Brown [Tue, 19 Apr 2022 11:22:17 +0000 (12:22 +0100)]
arm64/sme: Identify supported SME vector lengths at boot
The vector lengths used for SME are controlled through a similar set of
registers to those for SVE and enumerated using a similar algorithm with
some slight differences due to the fact that unlike SVE there are no
restrictions on which combinations of vector lengths can be supported
nor any mandatory vector lengths which must be implemented. Add a new
vector type and implement support for enumerating it.
One slightly awkward feature is that we need to read the current vector
length using a different instruction (or enter streaming mode which
would have the same issue and be higher cost). Rather than add an ops
structure we add special cases directly in the otherwise generic
vec_probe_vqs() function, this is a bit inelegant but it's the only
place where this is an issue.
Mark Brown [Tue, 19 Apr 2022 11:22:15 +0000 (12:22 +0100)]
arm64/sme: Early CPU setup for SME
SME requires similar setup to that for SVE: disable traps to EL2 and
make sure that the maximum vector length is available to EL1, for SME we
have two traps - one for SME itself and one for TPIDR2.
In addition since we currently make no active use of priority control
for SCMUs we map all SME priorities lower ELs may configure to 0, the
architecture specified minimum priority, to ensure that nothing we
manage is able to configure itself to consume excessive resources. This
will need to be revisited should there be a need to manage SME
priorities at runtime.
Mark Brown [Tue, 19 Apr 2022 11:22:14 +0000 (12:22 +0100)]
arm64/sme: Manually encode SME instructions
As with SVE rather than impose ambitious toolchain requirements for SME
we manually encode the few instructions which we require in order to
perform the work the kernel needs to do. The instructions used to save
and restore context are provided as assembler macros while those for
entering and leaving streaming mode are done in asm volatile blocks
since they are expected to be used from C.
We could do the SMSTART and SMSTOP operations with read/modify/write
cycles on SVCR but using the aliases provided for individual field
accesses should be slightly faster. These instructions are aliases for
MSR but since our minimum toolchain requirements are old enough to mean
that we can't use the sX_X_cX_cX_X form and they always use xzr rather
than taking a value like write_sysreg_s() wants we just use .inst.
Mark Brown [Tue, 19 Apr 2022 11:22:13 +0000 (12:22 +0100)]
arm64/sme: System register and exception syndrome definitions
The arm64 Scalable Matrix Extension (SME) adds some new system registers,
fields in existing system registers and exception syndromes. This patch
adds definitions for these for use in future patches implementing support
for this extension.
Since SME will be the first user of FEAT_HCX in the kernel also include
the definitions for enumerating it and the HCRX system register it adds.
Mark Brown [Tue, 19 Apr 2022 11:22:12 +0000 (12:22 +0100)]
arm64/sme: Provide ABI documentation for SME
Provide ABI documentation for SME similar to that for SVE. Due to the very
large overlap around streaming SVE mode in both implementation and
interfaces documentation for streaming mode SVE is added to the SVE
document rather than the SME one.
For historical reasons, the naming of parameters and their types in the
arm64 stacktrace code differs from that used in generic code and other
architectures, even though the types are equivalent.
For consistency and clarity, use the generic names.
There should be no functional change as a result of this patch.
Mark Rutland [Wed, 13 Apr 2022 14:59:07 +0000 (15:59 +0100)]
arm64: stacktrace: make struct stackframe private to stacktrace.c
Now that arm64 uses arch_stack_walk() consistently, struct stackframe is
only used within stacktrace.c. To make it easier to read and maintain
this code, it would be nicer if the definition were there too.
Move the definition into stacktrace.c.
There should be no functional change as a result of this patch.
Mark Rutland [Wed, 13 Apr 2022 14:59:06 +0000 (15:59 +0100)]
arm64: stacktrace: delete PCS comment
The comment at the top of stacktrace.c isn't all that helpful, as it's
not associated with the code which inspects the frame record, and the
code example isn't representative of common code generation today.
Delete it.
There should be no functional change as a result of this patch.
arm64/mm: Compute PTRS_PER_[PMD|PUD] independently of PTRS_PER_PTE
Possible page table entries (or pointers) on non-zero page table levels are
dependent on a single page size i.e PAGE_SIZE and size required for each
individual page table entry i.e 8 bytes. PTRS_PER_[PMD|PUD] as such are not
related to PTRS_PER_PTE in any manner, as being implied currently. So lets
just make this very explicit and compute these macros independently.
Merge tag 'for-linus-5.18-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixlet from Juergen Gross:
"A single cleanup patch for the Xen balloon driver"
* tag 'for-linus-5.18-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/balloon: don't use PV mode extra memory for zone device allocations
Merge tag 'x86-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Two x86 fixes related to TSX:
- Use either MSR_TSX_FORCE_ABORT or MSR_IA32_TSX_CTRL to disable TSX
to cover all CPUs which allow to disable it.
- Disable TSX development mode at boot so that a microcode update
which provides TSX development mode does not suddenly make the
system vulnerable to TSX Asynchronous Abort"
* tag 'x86-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsx: Disable TSX development mode at boot
x86/tsx: Use MSR_TSX_CTRL to clear CPUID bits
Merge tag 'timers-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"A small set of fixes for the timers core:
- Fix the warning condition in __run_timers() which does not take
into account that a CPU base (especially the deferrable base) never
has a timer armed on it and therefore the next_expiry value can
become stale.
- Replace a WARN_ON() in the NOHZ code with a WARN_ON_ONCE() to
prevent endless spam in dmesg.
- Remove the double star from a comment which is not meant to be in
kernel-doc format"
* tag 'timers-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/sched: Fix non-kernel-doc comment
tick/nohz: Use WARN_ON_ONCE() to prevent console saturation
timers: Fix warning condition in __run_timers()
Merge tag 'smp-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP fixes from Thomas Gleixner:
"Two fixes for the SMP core:
- Make the warning condition in flush_smp_call_function_queue()
correct, which checked a just emptied list head for being empty
instead of validating that there was no pending entry on the
offlined CPU at all.
- The @cpu member of struct cpuhp_cpu_state is initialized when the
CPU hotplug thread for the upcoming CPU is created. That's too late
because the creation of the thread can fail and then the following
rollback operates on CPU0. Get rid of the CPU member and hand the
CPU number to the involved functions directly"
* tag 'smp-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Remove the 'cpu' member of cpuhp_cpu_state
smp: Fix offline cpu check in flush_smp_call_function_queue()
Merge tag 'irq-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
"A single fix for the interrupt affinity spreading logic to take into
account that there can be an imbalance between present and possible
CPUs, which causes already assigned bits to be overwritten"
* tag 'irq-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/affinity: Consider that CPUs on nodes can be unbalanced
Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Regular set of fixes for drivers and the dev-interface"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: ismt: Fix undefined behavior due to shift overflowing the constant
i2c: dev: Force case user pointers in compat_i2cdev_ioctl()
i2c: dev: check return value when calling dev_set_name()
i2c: qcom-geni: Use dev_err_probe() for GPI DMA error
i2c: imx: Implement errata ERR007805 or e7805 bus frequency limit
i2c: pasemi: Wait for write xfers to finish
Merge tag 'gpio-fixes-for-v5.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
"A single fix for gpio-sim and two patches for GPIO ACPI pulled from
Andy:
- fix the set/get_multiple() callbacks in gpio-sim
- use correct format characters in gpiolib-acpi
- use an unsigned type for pins in gpiolib-acpi"
* tag 'gpio-fixes-for-v5.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: sim: fix setting and getting multiple lines
gpiolib: acpi: Convert type for pin to be unsigned
gpiolib: acpi: use correct format characters
Merge tag 'random-5.18-rc3-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator fixes from Jason Donenfeld:
- Per your suggestion, random reads now won't fail if there's a page
fault after some non-zero amount of data has been read, which makes
the behavior consistent with all other reads in the kernel.
- Rather than an inconsistent mix of random_get_entropy() returning an
unsigned long or a cycles_t, now it just returns an unsigned long.
- A memcpy() was replaced with an memmove(), because the addresses are
sometimes overlapping. In practice the destination is always before
the source, so not really an issue, but better to be correct than
not.
* tag 'random-5.18-rc3-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
random: use memmove instead of memcpy for remaining 32 bytes
random: make random_get_entropy() return an unsigned long
random: allow partial reads if later user copies fail
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"13 fixes, all in drivers.
The most extensive changes are in the iscsi series (affecting drivers
qedi, cxgbi and bnx2i), the next most is scsi_debug, but that's just a
simple revert and then minor updates to pm80xx"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: iscsi: MAINTAINERS: Add Mike Christie as co-maintainer
scsi: qedi: Fix failed disconnect handling
scsi: iscsi: Fix NOP handling during conn recovery
scsi: iscsi: Merge suspend fields
scsi: iscsi: Fix unbound endpoint error handling
scsi: iscsi: Fix conn cleanup and stop race during iscsid restart
scsi: iscsi: Fix endpoint reuse regression
scsi: iscsi: Release endpoint ID when its freed
scsi: iscsi: Fix offload conn cleanup when iscsid restarts
scsi: iscsi: Move iscsi_ep_disconnect()
scsi: pm80xx: Enable upper inbound, outbound queues
scsi: pm80xx: Mask and unmask upper interrupt vectors 32-63
Revert "scsi: scsi_debug: Address races following module load"
random: use memmove instead of memcpy for remaining 32 bytes
In order to immediately overwrite the old key on the stack, before
servicing a userspace request for bytes, we use the remaining 32 bytes
of block 0 as the key. This means moving indices 8,9,a,b,c,d,e,f ->
4,5,6,7,8,9,a,b. Since 4 < 8, for the kernel implementations of
memcpy(), this doesn't actually appear to be a problem in practice. But
relying on that characteristic seems a bit brittle. So let's change that
to a proper memmove(), which is the by-the-books way of handling
overlapping memory copies.
Subsystems affected by this patch series: MAINTAINERS, binfmt, and
mm (tmpfs, secretmem, kasan, kfence, pagealloc, zram, compaction,
hugetlb, vmalloc, and kmemleak)"
* emailed patches from Andrew Morton <[email protected]>:
mm: kmemleak: take a full lowmem check in kmemleak_*_phys()
mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"
revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"
hugetlb: do not demote poisoned hugetlb pages
mm: compaction: fix compiler warning when CONFIG_COMPACTION=n
mm: fix unexpected zeroed page mapping with zram swap
mm, page_alloc: fix build_zonerefs_node()
mm, kfence: support kmem_dump_obj() for KFENCE objects
kasan: fix hw tags enablement when KUNIT tests are disabled
irq_work: use kasan_record_aux_stack_noalloc() record callstack
mm/secretmem: fix panic when growing a memfd_secret
tmpfs: fix regressions from wider use of ZERO_PAGE
MAINTAINERS: Broadcom internal lists aren't maintainers
Merge tag 'for-5.18/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- Fix memory corruption in DM integrity target when tag_size is less
than digest size.
- Fix DM multipath's historical-service-time path selector to not use
sched_clock() and ktime_get_ns(); only use ktime_get_ns().
- Fix dm_io->orig_bio NULL pointer dereference in dm_zone_map_bio() due
to 5.18 changes that overlooked DM zone's use of ->orig_bio
- Fix for regression that broke the use of dm_accept_partial_bio() for
"abnormal" IO (e.g. WRITE ZEROES) that does not need duplicate bios
- Fix DM's issuing of empty flush bio so that it's size is 0.
* tag 'for-5.18/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: fix bio length of empty flush
dm: allow dm_accept_partial_bio() for dm_io without duplicate bios
dm zone: fix NULL pointer dereference in dm_zone_map_bio
dm mpath: only use ktime_get_ns() in historical selector
dm integrity: fix memory corruption when tag_size is less than digest size
Patrick Wang [Fri, 15 Apr 2022 02:14:04 +0000 (19:14 -0700)]
mm: kmemleak: take a full lowmem check in kmemleak_*_phys()
The kmemleak_*_phys() apis do not check the address for lowmem's min
boundary, while the caller may pass an address below lowmem, which will
trigger an oops:
The callers may not quite know the actual address they pass(e.g. from
devicetree). So the kmemleak_*_phys() apis should guarantee the address
they finally use is in lowmem range, so check the address for lowmem's
min boundary.
mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of
vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
purge the vmap areas instead of doing it lazily.
Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller
context") moved the purging from the vunmap() caller to a worker thread.
Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
(possibly forever). For example, consider the following scenario:
1. Thread reads from /proc/vmcore. This eventually calls
__copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
vmap_lazy_nr to lazy_max_pages() + 1.
2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
pages (one page plus the guard page) to the purge list and
vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
drain_vmap_work is scheduled.
3. Thread returns from the kernel and is scheduled out.
4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
frees the 2 pages on the purge list. vmap_lazy_nr is now
lazy_max_pages() + 1.
5. This is still over the threshold, so it tries to purge areas again,
but doesn't find anything.
6. Repeat 5.
If the system is running with only one CPU (which is typicial for kdump)
and preemption is disabled, then this will never make forward progress:
there aren't any more pages to purge, so it hangs. If there is more
than one CPU or preemption is enabled, then the worker thread will spin
forever in the background. (Note that if there were already pages to be
purged at the time that set_iounmap_nonlazy() was called, this bug is
avoided.)
This can be reproduced with anything that reads from /proc/vmcore
multiple times. E.g., vmcore-dmesg /proc/vmcore.
It turns out that improvements to vmap() over the years have obsoleted
the need for this "optimization". I benchmarked `dd if=/proc/vmcore
of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore.
The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and
5.18-rc1 with set_iounmap_nonlazy() removed entirely:
Andrew Morton [Fri, 15 Apr 2022 02:13:55 +0000 (19:13 -0700)]
revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"
Commit 925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for
loaders") was an attempt to fix regressions due to 9630f0d60fec5f
("fs/binfmt_elf: use PT_LOAD p_align values for static PIE").
Mike Kravetz [Fri, 15 Apr 2022 02:13:52 +0000 (19:13 -0700)]
hugetlb: do not demote poisoned hugetlb pages
It is possible for poisoned hugetlb pages to reside on the free lists.
The huge page allocation routines which dequeue entries from the free
lists make a point of avoiding poisoned pages. There is no such check
and avoidance in the demote code path.
If a hugetlb page on the is on a free list, poison will only be set in
the head page rather then the page with the actual error. If such a
page is demoted, then the poison flag may follow the wrong page. A page
without error could have poison set, and a page with poison could not
have the flag set.
Check for poison before attempting to demote a hugetlb page. Also,
return -EBUSY to the caller if only poisoned pages are on the free list.