Git Repo - linux.git/log

arm64: convert syscall trace logic to C

Currently syscall tracing is a tricky assembly state machine, which can
be rather difficult to follow, and even harder to modify. Before we
start fiddling with it for pt_regs syscalls, let's convert it to C.

This is not intended to have any functional change.

Signed-off-by: Mark Rutland <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: convert raw syscall invocation to C

As a first step towards invoking syscalls with a pt_regs argument,
convert the raw syscall invocation logic to C. We end up with a bit more
register shuffling, but the unified invocation logic means we can unify
the tracing paths, too.

Previously, assembly had to open-code calls to ni_sys() when the system
call number was out-of-bounds for the relevant syscall table. This case
is now handled by invoke_syscall(), and the assembly no longer need to
handle this case explicitly. This allows the tracing paths to be
simplified and unified, as we no longer need the __ni_sys_trace path and
the __sys_trace_return label.

This only converts the invocation of the syscall. The rest of the
syscall triage and tracing is left in assembly for now, and will be
converted in subsequent patches.

Signed-off-by: Mark Rutland <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: introduce syscall_fn_t

In preparation for invoking arbitrary syscalls from C code, let's define
a type for an arbitrary syscall, matching the parameter passing rules of
the AAPCS.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: remove sigreturn wrappers

The arm64 sigreturn* syscall handlers are non-standard. Rather than
taking a number of user parameters in registers as per the AAPCS,
they expect the pt_regs as their sole argument.

To make this work, we override the syscall definitions to invoke
wrappers written in assembly, which mov the SP into x0, and branch to
their respective C functions.

On other architectures (such as x86), the sigreturn* functions take no
argument and instead use current_pt_regs() to acquire the user
registers. This requires less boilerplate code, and allows for other
features such as interposing C code in this path.

This patch takes the same approach for arm64.

Signed-off-by: Mark Rutland <[email protected]>
Tentatively-reviewed-by: Dave Martin <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: move sve_user_{enable,disable} to <asm/fpsimd.h>

In subsequent patches, we'll want to make use of sve_user_enable() and
sve_user_disable() outside of kernel/fpsimd.c. Let's move these to
<asm/fpsimd.h> where we can make use of them.

To avoid ifdeffery in sequences like:

if (system_supports_sve() && some_condition)
sve_user_disable();

... empty stubs are provided when support for SVE is not enabled. Note
that system_supports_sve() contains as IS_ENABLED(CONFIG_ARM64_SVE), so
the sve_user_disable() call should be optimized away entirely when
CONFIG_ARM64_SVE is not selected.

To ensure that this is the case, the stub definitions contain a
BUILD_BUG(), as we do for other stubs for which calls should always be
optimized away when the relevant config option is not selected.

At the same time, the include list of <asm/fpsimd.h> is sorted while
adding <asm/sysreg.h>.

Signed-off-by: Mark Rutland <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Reviewed-by: Dave Martin <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: kill change_cpacr()

Now that we have sysreg_clear_set(), we can use this instead of
change_cpacr().

Note that the order of the set and clear arguments differs between
change_cpacr() and sysreg_clear_set(), so these are flipped as part of
the conversion. Also, sve_user_enable() redundantly clears
CPACR_EL1_ZEN_EL0EN before setting it; this is removed for clarity.

Signed-off-by: Mark Rutland <[email protected]>
Reviewed-by: Dave Martin <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: James Morse <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: kill config_sctlr_el1()

Now that we have sysreg_clear_set(), we can consistently use this
instead of config_sctlr_el1().

Signed-off-by: Mark Rutland <[email protected]>
Reviewed-by: Dave Martin <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: James Morse <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: move SCTLR_EL{1,2} assertions to <asm/sysreg.h>

Currently we assert that the SCTLR_EL{1,2}_{SET,CLEAR} bits are
self-consistent with an assertion in config_sctlr_el1(). This is a bit
unusual, since config_sctlr_el1() doesn't make use of these definitions,
and is far away from the definitions themselves.

We can use the CPP #error directive to have equivalent assertions in
<asm/sysreg.h>, next to the definitions of the set/clear bits, which is
a bit clearer and simpler.

At the same time, lets fill in the upper 32 bits for both registers in
their respective RES0 definitions. This could be a little nicer with
GENMASK_ULL(63, 32), but this currently lives in <linux/bitops.h>, which
cannot safely be included from assembly, as <asm/sysreg.h> can.

Note the when the preprocessor evaluates an expression for an #if
directive, all signed or unsigned values are treated as intmax_t or
uintmax_t respectively. To avoid ambiguity, we define explicitly define
the mask of all 64 bits.

Signed-off-by: Mark Rutland <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Dave Martin <[email protected]>
Cc: James Morse <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: consistently use unsigned long for thread flags

In do_notify_resume, we manipulate thread_flags as a 32-bit unsigned
int, whereas thread_info::flags is a 64-bit unsigned long, and elsewhere
(e.g. in the entry assembly) we manipulate the flags as a 64-bit
quantity.

For consistency, and to avoid problems if we end up with more than 32
flags, let's make do_notify_resume take the flags as a 64-bit unsigned
long.

Signed-off-by: Mark Rutland <[email protected]>
Reviewed-by: Dave Martin <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

Revert "arm64: fix infinite stacktrace"

This reverts commit 7e7df71fd57ff2894d96abb0080922bf39460a79.

When unwinding out of the IRQ stack and onto the interrupted EL1 stack,
we cannot rely on the frame pointer being strictly increasing, as this
could terminate the backtrace early depending on how the stacks have
been allocated.

Signed-off-by: Will Deacon <[email protected]>

asm-generic: unistd.h: Wire up sys_rseq

The new rseq call arrived in 4.18-rc1, so provide it in the asm-generic
unistd.h for architectures such as arm64.

Acked-by: Arnd Bergmann <[email protected]>
Acked-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: rseq: Implement backend rseq calls and select HAVE_RSEQ

Implement calls to rseq_signal_deliver, rseq_handle_notify_resume
and rseq_syscall so that we can select HAVE_RSEQ on arm64.

Acked-by: Mathieu Desnoyers <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: make flatmem depend on !NUMA

Building without NUMA but with FLATMEM results in a link error
because mem_map[] is not available:

aarch64-linux-ld -EB -maarch64elfb --no-undefined -X -pie -shared -Bsymbolic --no-apply-dynamic-relocs --build-id -o .tmp_vmlinux1 -T ./arch/arm64/kernel/vmlinux.lds --whole-archive built-in.a --no-whole-archive --start-group arch/arm64/lib/lib.a lib/lib.a --end-group
init/do_mounts.o: In function `mount_block_root':
do_mounts.c:(.init.text+0x1e8): undefined reference to `mem_map'
arch/arm64/kernel/vdso.o: In function `vdso_init':
vdso.c:(.init.text+0xb4): undefined reference to `mem_map'

This uses the same trick as the other architectures, making flatmem
depend on !NUMA to avoid the broken configuration.

Fixes: e7d4bac428ed ("arm64: add ARM64-specific support for flatmem")
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: numa: rework ACPI NUMA initialization

Current ACPI ARM64 NUMA initialization code in

acpi_numa_gicc_affinity_init()

carries out NUMA nodes creation and cpu<->node mappings at the same time
in the arch backend so that a single SRAT walk is needed to parse both
pieces of information. This implies that the cpu<->node mappings must
be stashed in an array (sized NR_CPUS) so that SMP code can later use
the stashed values to avoid another SRAT table walk to set-up the early
cpu<->node mappings.

If the kernel is configured with a NR_CPUS value less than the actual
processor entries in the SRAT (and MADT), the logic in
acpi_numa_gicc_affinity_init() is broken in that the cpu<->node mapping
is only carried out (and stashed for future use) only for a number of
SRAT entries up to NR_CPUS, which do not necessarily correspond to the
possible cpus detected at SMP initialization in
acpi_map_gic_cpu_interface() (ie MADT and SRAT processor entries order
is not enforced), which leaves the kernel with broken cpu<->node
mappings.

Furthermore, given the current ACPI NUMA code parsing logic in
acpi_numa_gicc_affinity_init(), PXM domains for CPUs that are not parsed
because they exceed NR_CPUS entries are not mapped to NUMA nodes (ie the
PXM corresponding node is not created in the kernel) leaving the system
with a broken NUMA topology.

Rework the ACPI ARM64 NUMA initialization process so that the NUMA
nodes creation and cpu<->node mappings are decoupled. cpu<->node
mappings are moved to SMP initialization code (where they are needed),
at the cost of an extra SRAT walk so that ACPI NUMA mappings can be
batched before being applied, fixing current parsing pitfalls.

Acked-by: Hanjun Guo <[email protected]>
Tested-by: John Garry <[email protected]>
Fixes: d8b47fca8c23 ("arm64, ACPI, NUMA: NUMA support based on SRAT and
SLIT")
Link: http://lkml.kernel.org/r/[email protected]
Reported-by: Xie XiuQi <[email protected]>
Signed-off-by: Lorenzo Pieralisi <[email protected]>
Cc: Punit Agrawal <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Hanjun Guo <[email protected]>
Cc: Ganapatrao Kulkarni <[email protected]>
Cc: Jeremy Linton <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Xie XiuQi <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: add ARM64-specific support for flatmem

Flatmem is useful in reducing kernel memory usage.
One usecase is in kdump kernel. We are able to save
~14M by moving to flatmem scheme.

Cc: [email protected]
Cc: Nikunj Kela <[email protected]>
Signed-off-by: Nikunj Kela <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

MAINTAINERS: arm64: Remove boot/dts/ directory from arm64 entry

The arm-soc tree does a good job handling .dts files, so exclude them
from the ARM64 entry in MAINTAINERS.

Cc: Catalin Marinas <[email protected]>
Acked-by: Olof Johansson <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: mm: Export __flush_icache_range() to modules

lkdtm calls flush_icache_range(), which results in an out-of-line call
to __flush_icache_range(), which is not exported to modules.

Export the symbol to modules to fix this build breakage.

Fixes: 3b8c9f1cdfc5 ("arm64: IPI each CPU after invalidating the I-cache for kernel mappings")
Signed-off-by: Will Deacon <[email protected]>

arm64: topology: re-introduce numa mask check for scheduler MC selection

Commit 37c3ec2d810f ("arm64: topology: divorce MC scheduling domain from
core_siblings") selected the smallest of LLC, socket siblings, and NUMA
node siblings to ensure that the sched domain we build for the MC layer
isn't larger than the DIE above it or it's shrunk to the socket or NUMA
node if LLC exist acrosis NUMA node/chiplets.

Commit acd32e52e4e0 ("arm64: topology: Avoid checking numa mask for
scheduler MC selection") reverted the NUMA siblings checks since the
CPU topology masks weren't updated on hotplug at that time.

This patch re-introduces numa mask check as the CPU and NUMA topology
is now updated in hotplug paths. Effectively, this patch does the
partial revert of commit acd32e52e4e0.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Tested-by: Ganapatrao Kulkarni <[email protected]>
Tested-by: Hanjun Guo <[email protected]>
Signed-off-by: Sudeep Holla <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: topology: rename llc_siblings to align with other struct members

Similar to core_sibling and thread_sibling, it's better to align and
rename llc_siblings to llc_sibling.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Tested-by: Ganapatrao Kulkarni <[email protected]>
Tested-by: Hanjun Guo <[email protected]>
Signed-off-by: Sudeep Holla <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: smp: remove cpu and numa topology information when hotplugging out CPU

We already repopulate the information on CPU hotplug-in, so we can safely
remove the CPU topology and NUMA cpumap information during CPU hotplug
out operation. This will help to provide the correct cpumask for
scheduler domains.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Tested-by: Ganapatrao Kulkarni <[email protected]>
Tested-by: Hanjun Guo <[email protected]>
Signed-off-by: Sudeep Holla <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: topology: restrict updating siblings_masks to online cpus only

It's incorrect to iterate over all the possible CPUs to update the
sibling masks when any CPU is hotplugged in. In case the topology
siblings masks of the CPU is removed when is it hotplugged out, we
end up updating those masks when one of it's sibling is powered up
again. This will provide inconsistent view.

Further, since the CPU calling update_sibling_masks is yet to be set
online, there's no need to compare itself with each online CPU when
updating the siblings masks.

This patch restricts updation of sibling masks only for CPUs that are
already online. It also the drops the unnecessary cpuid check.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Tested-by: Ganapatrao Kulkarni <[email protected]>
Tested-by: Hanjun Guo <[email protected]>
Signed-off-by: Sudeep Holla <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: topology: add support to remove cpu topology sibling masks

This patch adds support to remove all the CPU topology information using
clear_cpu_topology and also resetting the sibling information on other
sibling CPUs. This will be used in cpu_disable so that all the topology
sibling information is removed on CPU hotplug out.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Tested-by: Ganapatrao Kulkarni <[email protected]>
Tested-by: Hanjun Guo <[email protected]>
Signed-off-by: Sudeep Holla <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: numa: separate out updates to percpu nodeid and NUMA node cpumap

Currently numa_clear_node removes both cpu information from the NUMA
node cpumap as well as the NUMA node id from the cpu. Similarly
numa_store_cpu_info updates both percpu nodeid and NUMA cpumap.

However we need to retain the numa node id for the cpu and only remove
the cpu information from the numa node cpumap during CPU hotplug out.
The same can be extended for hotplugging in the CPU.

This patch separates out numa_{add,remove}_cpu from numa_clear_node and
numa_store_cpu_info.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Reviewed-by: Ganapatrao Kulkarni <[email protected]>
Tested-by: Ganapatrao Kulkarni <[email protected]>
Tested-by: Hanjun Guo <[email protected]>
Signed-off-by: Sudeep Holla <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: topology: refactor reset_cpu_topology to add support for removing topology

Currently reset_cpu_topology clears all the CPU topology information
and resets to default values. However we may need to just clear the
information when we hotplug out the CPU. In preparation to add the
support the same, let's refactor reset_cpu_topology to just reset
the information and move clearing out the topology information to
clear_cpu_topology.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Tested-by: Ganapatrao Kulkarni <[email protected]>
Tested-by: Hanjun Guo <[email protected]>
Signed-off-by: Sudeep Holla <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: errata: Don't define type field twice for arm64_errata[] entries

The ERRATA_MIDR_REV_RANGE macro assigns ARM64_CPUCAP_LOCAL_CPU_ERRATUM
to the '.type' field of the 'struct arm64_cpu_capabilities', so there's
no need to assign it explicitly as well.

Signed-off-by: Will Deacon <[email protected]>

arm64: Implement page table free interfaces

arm64 requires break-before-make. Originally, before
setting up new pmd/pud entry for huge mapping, in few
cases, the modifying pmd/pud entry was still valid
and pointing to next level page table as we only
clear off leaf PTE in unmap leg.

a) This was resulting into stale entry in TLBs (as few
    TLBs also cache intermediate mapping for performance
    reasons)
b) Also, modifying pmd/pud was the only reference to
    next level page table and it was getting lost without
    freeing it. So, page leaks were happening.

Implement pud_free_pmd_page() and pmd_free_pte_page() to
enforce BBM and also free the leaking page tables.

Implementation requires,
1) Clearing off the current pud/pmd entry
2) Invalidation of TLB
3) Freeing of the un-used next level page tables

Reviewed-by: Will Deacon <[email protected]>
Signed-off-by: Chintan Pandya <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: tlbflush: Introduce __flush_tlb_kernel_pgtable

Add an interface to invalidate intermediate page tables
from TLB for kernel.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Chintan Pandya <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

Merge branch 'x86/mm' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into aarch64/for-next/core

Pull in core ioremap changes from -tip, since we depend on these for
re-enabling huge I/O mappings on arm64.

Signed-off-by: Will Deacon <[email protected]>

arm64: insn: Don't fallback on nosync path for general insn patching

Patching kernel instructions at runtime requires other CPUs to undergo
a context synchronisation event via an explicit ISB or an IPI in order
to ensure that the new instructions are visible. This is required even
for "hotpatch" instructions such as NOP and BL, so avoid optimising in
this case and always go via stop_machine() when performing general
patching.

ftrace isn't quite as strict, so it can continue to call the nosync
code directly.

Signed-off-by: Will Deacon <[email protected]>

arm64: IPI each CPU after invalidating the I-cache for kernel mappings

When invalidating the instruction cache for a kernel mapping via
flush_icache_range(), it is also necessary to flush the pipeline for
other CPUs so that instructions fetched into the pipeline before the
I-cache invalidation are discarded. For example, if module 'foo' is
unloaded and then module 'bar' is loaded into the same area of memory,
a CPU could end up executing instructions from 'foo' when branching into
'bar' if these instructions were fetched into the pipeline before 'foo'
was unloaded.

Whilst this is highly unlikely to occur in practice, particularly as
any exception acts as a context-synchronizing operation, following the
letter of the architecture requires us to execute an ISB on each CPU
in order for the new instruction stream to be visible.

Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: remove unused COMPAT_PSR definitions

Now that users have been migrated to PSR_AA32, kill the unused
COMPAT_PSR definitions.

The only difference we need a definition for is COMPAT_PSR_DIT_BIT,
which differs from PSR_AA32_DIT_BIT.

Signed-off-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

kvm/arm: use PSR_AA32 definitions

Some code cares about the SPSR_ELx format for exceptions taken from
AArch32 to inspect or manipulate the SPSR_ELx value, which is already in
the SPSR_ELx format, and not in the AArch32 PSR format.

To separate these from cases where we care about the AArch32 PSR format,
migrate these cases to use the PSR_AA32_* definitions rather than
COMPAT_PSR_*.

There should be no functional change as a result of this patch.

Note that arm64 KVM does not support a compat KVM API, and always uses
the SPSR_ELx format, even for AArch32 guests.

Signed-off-by: Mark Rutland <[email protected]>
Acked-by: Christoffer Dall <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: use PSR_AA32 definitions

Some code cares about the SPSR_ELx format for exceptions taken from
AArch32 to inspect or manipulate the SPSR_ELx value, which is already in
the SPSR_ELx format, and not in the AArch32 PSR format.

To separate these from cases where we care about the AArch32 PSR format,
migrate these cases to use the PSR_AA32_* definitions rather than
COMPAT_PSR_*.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: ptrace: map SPSR_ELx<->PSR for compat tasks

The SPSR_ELx format for exceptions taken from AArch32 is slightly
different to the AArch32 PSR format.

Map between the two in the compat ptrace code.

Signed-off-by: Mark Rutland <[email protected]>
Fixes: 7206dc93a58fb764 ("arm64: Expose Arm v8.4 features")
Cc: Catalin Marinas <[email protected]>
Cc: Suzuki Poulose <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: compat: map SPSR_ELx<->PSR for signals

The SPSR_ELx format for exceptions taken from AArch32 differs from the
AArch32 PSR format. Thus, we must translate between the two when setting
up a compat sigframe, or restoring context from a compat sigframe.

Signed-off-by: Mark Rutland <[email protected]>
Fixes: 7206dc93a58fb764 ("arm64: Expose Arm v8.4 features")
Cc: Catalin Marinas <[email protected]>
Cc: Suzuki Poulose <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: don't zero DIT on signal return

Currently valid_user_regs() treats SPSR_ELx.DIT as a RES0 bit, causing
it to be zeroed upon exception return, rather than preserved. Thus, code
relying on DIT will not function as expected, and may expose an
unexpected timing sidechannel.

Let's remove DIT from the set of RES0 bits, such that it is preserved.
At the same time, the related comment is updated to better describe the
situation, and to take into account the most recent documentation of
SPSR_ELx, in ARM DDI 0487C.a.

Signed-off-by: Mark Rutland <[email protected]>
Fixes: 7206dc93a58fb764 ("arm64: Expose Arm v8.4 features")
Cc: Catalin Marinas <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: add PSR_AA32_* definitions

The AArch32 CPSR/SPSR format is *almost* identical to the AArch64
SPSR_ELx format for exceptions taken from AArch32, but the two have
diverged with the addition of DIT, and we need to treat the two as
logically distinct.

This patch adds new definitions for the SPSR_ELx format for exceptions
taken from AArch32, with a consistent PSR_AA32_ prefix. The existing
COMPAT_PSR_ definitions will be used for the PSR format as seen from
AArch32.

Definitions of DIT are provided for both, and inline functions are
provided to map between the two formats. Note that for SPSR_ELx, the
(RES0) J bit has been re-allocated as the DIT bit.

Once users of the COMPAT_PSR definitions have been migrated over to the
PSR_AA32 definitions, the (majority of) the former will be removed, so
no efforts is made to avoid duplication until then.

Signed-off-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christoffer Dall <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Suzuki Poulose <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: Handle mismatched cache type

Track mismatches in the cache type register (CTR_EL0), other
than the D/I min line sizes and trap user accesses if there are any.

Fixes: be68a8aaf925 ("arm64: cpufeature: Fix CTR_EL0 field definitions")
Cc: <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Catalin Marinas <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: Fix mismatched cache line size detection

If there is a mismatch in the I/D min line size, we must
always use the system wide safe value both in applications
and in the kernel, while performing cache operations. However,
we have been checking more bits than just the min line sizes,
which triggers false negatives. We may need to trap the user
accesses in such cases, but not necessarily patch the kernel.

This patch fixes the check to do the right thing as advertised.
A new capability will be added to check mismatches in other
fields and ensure we trap the CTR accesses.

Fixes: be68a8aaf925 ("arm64: cpufeature: Fix CTR_EL0 field definitions")
Cc: <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Reported-by: Will Deacon <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: kconfig: Ensure spinlock fastpaths are inlined if !PREEMPT

When running with CONFIG_PREEMPT=n, the spinlock fastpaths fit inside
64 bytes, which typically coincides with the L1 I-cache line size.

Inline the spinlock fastpaths, like we do already for rwlocks.

Signed-off-by: Will Deacon <[email protected]>

arm64: locking: Replace ticket lock implementation with qspinlock

It's fair to say that our ticket lock has served us well over time, but
it's time to bite the bullet and start using the generic qspinlock code
so we can make use of explicit MCS queuing and potentially better PV
performance in future.

Signed-off-by: Will Deacon <[email protected]>

arm64: barrier: Implement smp_cond_load_relaxed

We can provide an implementation of smp_cond_load_relaxed using READ_ONCE
and __cmpwait_relaxed.

Signed-off-by: Will Deacon <[email protected]>

x86/mm: Add TLB purge to free pmd/pte page interfaces

ioremap() calls pud_free_pmd_page() / pmd_free_pte_page() when it creates
a pud / pmd map.  The following preconditions are met at their entry.
- All pte entries for a target pud/pmd address range have been cleared.
- System-wide TLB purges have been peformed for a target pud/pmd address
   range.

The preconditions assure that there is no stale TLB entry for the range.
Speculation may not cache TLB entries since it requires all levels of page
entries, including ptes, to have P & A-bits set for an associated address.
However, speculation may cache pud/pmd entries (paging-structure caches)
when they have P-bit set.

Add a system-wide TLB purge (INVLPG) to a single page after clearing
pud/pmd entry's P-bit.

SDM 4.10.4.1, Operation that Invalidate TLBs and Paging-Structure Caches,
states that:
  INVLPG invalidates all paging-structure caches associated with the
  current PCID regardless of the liner addresses to which they correspond.

Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Signed-off-by: Toshi Kani <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Joerg Roedel <[email protected]>
Cc: [email protected]
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

ioremap: Update pgtable free interfaces with addr

The following kernel panic was observed on ARM64 platform due to a stale
TLB entry.

1. ioremap with 4K size, a valid pte page table is set.
2. iounmap it, its pte entry is set to 0.
3. ioremap the same address with 2M size, update its pmd entry with
a new value.
4. CPU may hit an exception because the old pmd entry is still in TLB,
which leads to a kernel panic.

Commit b6bdb7517c3d ("mm/vmalloc: add interfaces to free unmapped page
table") has addressed this panic by falling to pte mappings in the above
case on ARM64.

To support pmd mappings in all cases, TLB purge needs to be performed
in this case on ARM64.

Add a new arg, 'addr', to pud_free_pmd_page() and pmd_free_pte_page()
so that TLB purge can be added later in seprate patches.

[[email protected]: merge changes, rewrite patch description]
Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Signed-off-by: Chintan Pandya <[email protected]>
Signed-off-by: Toshi Kani <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Will Deacon <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: [email protected]
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/mm: Disable ioremap free page handling on x86-PAE

ioremap() supports pmd mappings on x86-PAE.  However, kernel's pmd
tables are not shared among processes on x86-PAE.  Therefore, any
update to sync'd pmd entries need re-syncing.  Freeing a pte page
also leads to a vmalloc fault and hits the BUG_ON in vmalloc_sync_one().

Disable free page handling on x86-PAE.  pud_free_pmd_page() and
pmd_free_pte_page() simply return 0 if a given pud/pmd entry is present.
This assures that ioremap() does not update sync'd pmd entries at the
cost of falling back to pte mappings.

Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Reported-by: Joerg Roedel <[email protected]>
Signed-off-by: Toshi Kani <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

arm64: kexec: always reset to EL2 if present

Currently machine_kexec() doesn't reset to EL2 in the case of a
crashdump kernel. This leaves potentially dodgy state active at EL2, and
means that if the crashdump kernel attempts to online secondary CPUs,
these will be booted as mismatched ELs.

Let's reset to EL2, as we do in all other cases, and simplify things. If
EL2 state is corrupt, things are already sufficiently bad that kdump is
unlikely to work, and it's best-effort regardless.

Cc: Catalin Marinas <[email protected]>
Cc: James Morse <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Signed-off-by: Mark Rutland <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: fix infinite stacktrace

I've got this infinite stacktrace when debugging another problem:
[  908.795225] INFO: rcu_preempt detected stalls on CPUs/tasks:
[  908.796176]  1-...!: (1 GPs behind) idle=952/1/4611686018427387904 softirq=1462/1462 fqs=355
[  908.797692]  2-...!: (1 GPs behind) idle=f42/1/4611686018427387904 softirq=1550/1551 fqs=355
[  908.799189]  (detected by 0, t=2109 jiffies, g=130, c=129, q=235)
[  908.800284] Task dump for CPU 1:
[  908.800871] kworker/1:1     R  running task        0    32      2 0x00000022
[  908.802127] Workqueue: writecache-writeabck writecache_writeback [dm_writecache]
[  908.820285] Call trace:
[  908.824785]  __switch_to+0x68/0x90
[  908.837661]  0xfffffe00603afd90
[  908.844119]  0xfffffe00603afd90
[  908.850091]  0xfffffe00603afd90
[  908.854285]  0xfffffe00603afd90
[  908.863538]  0xfffffe00603afd90
[  908.865523]  0xfffffe00603afd90

The machine just locked up and kept on printing the same line over and
over again. This patch fixes it.

Signed-off-by: Mikulas Patocka <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

ARM64: dump: Convert to use DEFINE_SHOW_ATTRIBUTE macro

Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.

Signed-off-by: Peng Donglin <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

Linux 4.18-rc3

Merge tag 'for-4.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
"We have a few regression fixes for qgroup rescan status tracking and
  the vm_fault_t conversion that mixed up the error values"

* tag 'for-4.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix mount failure when qgroup rescan is in progress
  Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion
  btrfs: quota: Set rescan progress to (u64)-1 if we hit last leaf

Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs fix from Al Viro:
"Followup to procfs-seq_file series this window"

This fixes a memory leak by making sure that proc seq files release any
private data on close. The 'proc_seq_open' has to be properly paired
with 'proc_seq_release' that releases the extra private data.

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
proc: add proc_seq_release

Merge tag 'staging-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging/IIO fixes from Greg KH:
"Here are a few small staging and IIO driver fixes for 4.18-rc3.

  Nothing major or big, all just fixes for reported problems since
  4.18-rc1. All of these have been in linux-next this week with no
  reported problems"

* tag 'staging-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: android: ion: Return an ERR_PTR in ion_map_kernel
  staging: comedi: quatech_daqp_cs: fix no-op loop daqp_ao_insn_write()
  iio: imu: inv_mpu6050: Fix probe() failure on older ACPI based machines
  iio: buffer: fix the function signature to match implementation
  iio: mma8452: Fix ignoring MMA8452_INT_DRDY
  iio: tsl2x7x/tsl2772: avoid potential division by zero
  iio: pressure: bmp280: fix relative humidity unit

Merge tag 'tty-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial fixes from Greg KH:
"Here are five fixes for the tty core and some serial drivers.

  The tty core ones fix some security and other issues reported by the
  syzbot that I have taken too long in responding to (sorry Tetsuo!).

  The 8350 serial driver fix resolves an issue of devices that used to
  work properly stopping working as they shouldn't have been added to a
  blacklist.

  All of these have been in linux-next for a few days with no reported
  issues"

* tag 'tty-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  vt: prevent leaking uninitialized data to userspace via /dev/vcs*
  serdev: fix memleak on module unload
  serial: 8250_pci: Remove stalled entries in blacklist
  n_tty: Access echo_* variables carefully.
  n_tty: Fix stall at n_tty_receive_char_special().

Merge tag 'usb-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
"Here is a number of USB gadget and other driver fixes for 4.18-rc3.

  There's a bunch of them here, most of them being gadget driver and
  xhci host controller fixes for reported issues (as normal), but there
  are also some new device ids, and some fixes for the typec code.

  There is an acpi core patch in here that was acked by the acpi
  maintainer as it is needed for the typec fixes in order to properly
  solve a problem in that driver.

  All of these have been in linux-next this week with no reported
  issues"

* tag 'usb-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (33 commits)
  usb: chipidea: host: fix disconnection detect issue
  usb: typec: tcpm: fix logbuffer index is wrong if _tcpm_log is re-entered
  typec: tcpm: Fix a msecs vs jiffies bug
  NFC: pn533: Fix wrong GFP flag usage
  usb: cdc_acm: Add quirk for Uniden UBC125 scanner
  staging/typec: fix tcpci_rt1711h build errors
  usb: typec: ucsi: Fix for incorrect status data issue
  usb: typec: ucsi: acpi: Workaround for cache mode issue
  acpi: Add helper for deactivating memory region
  usb: xhci: increase CRS timeout value
  usb: xhci: tegra: fix runtime PM error handling
  usb: xhci: remove the code build warning
  xhci: Fix kernel oops in trace_xhci_free_virt_device
  xhci: Fix perceived dead host due to runtime suspend race with event handler
  dwc2: gadget: Fix ISOC IN DDMA PID bitfield value calculation
  usb: gadget: dwc2: fix memory leak in gadget_init()
  usb: gadget: composite: fix delayed_status race condition when set_interface
  usb: dwc2: fix isoc split in transfer with no data
  usb: dwc2: alloc dma aligned buffer for isoc split in
  usb: dwc2: fix the incorrect bitmaps for the ports of multi_tt hub
  ...

Merge tag 'dma-mapping-4.18-2' of git://git.infradead.org/users/hch/dma-mapping

Pull dma mapping fixlet from Christoph Hellwig:
"Add a missing export required by riscv and unicore"

* tag 'dma-mapping-4.18-2' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: export swiotlb_dma_ops

Merge branch 'parisc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux

Pull parisc fixes and cleanups from Helge Deller:
"Nothing exiting in this patchset, just

   - small cleanups of header files

   - default to 4 CPUs when building a SMP kernel

   - mark 16kB and 64kB page sizes broken

   - addition of the new io_pgetevents syscall"

* 'parisc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Build kernel without -ffunction-sections
  parisc: Reduce debug output in unwind code
  parisc: Wire up io_pgetevents syscall
  parisc: Default to 4 SMP CPUs
  parisc: Convert printk(KERN_LEVEL) to pr_lvl()
  parisc: Mark 16kB and 64kB page sizes BROKEN
  parisc: Drop struct sigaction from not exported header file

Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Pull ARM SoC fixes from Olof Johansson:
"A smaller batch for the end of the week (let's see if I can keep the
  weekly cadence going for once).

  All medium-grade fixes here, nothing worrisome:

   - Fixes for some fairly old bugs around SD card write-protect
     detection and GPIO interrupt assignments on Davinci.

   - Wifi module suspend fix for Hikey.

   - Minor DT tweaks to fix inaccuracies for Amlogic platforms, one
     of which solves booting with third-party u-boot"

* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  arm64: dts: hikey960: Define wl1837 power capabilities
  arm64: dts: hikey: Define wl1835 power capabilities
  ARM64: dts: meson-gxl: fix Mali GPU compatible string
  ARM64: dts: meson-axg: fix ethernet stability issue
  ARM64: dts: meson-gx: fix ATF reserved memory region
  ARM64: dts: meson-gxl-s905x-p212: Add phy-supply for usb0
  ARM64: dts: meson: fix register ranges for SD/eMMC
  ARM64: dts: meson: disable sd-uhs modes on the libretech-cc
  ARM: dts: da850: Fix interrups property for gpio
  ARM: davinci: board-da850-evm: fix WP pin polarity for MMC/SD

Merge tag 'kbuild-fixes-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

- introduce __diag_* macros and suppress -Wattribute-alias warnings
   from GCC 8

- fix stack protector test script for x86_64

- fix line number handling in Kconfig

- document that '#' starts a comment in Kconfig

- handle P_SYMBOL property in dump debugging of Kconfig

- correct help message of LD_DEAD_CODE_DATA_ELIMINATION

- fix occasional segmentation faults in Kconfig

* tag 'kbuild-fixes-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kconfig: loop boundary condition fix
  kbuild: reword help of LD_DEAD_CODE_DATA_ELIMINATION
  kconfig: handle P_SYMBOL in print_symbol()
  kconfig: document Kconfig source file comments
  kconfig: fix line numbers for if-entries in menu tree
  stack-protector: Fix test with 32-bit userland and CONFIG_64BIT=y
  powerpc: Remove -Wattribute-alias pragmas
  disable -Wattribute-alias warning for SYSCALL_DEFINEx()
  kbuild: add macro for controlling warnings to linux/compiler.h

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:
"The biggest diffstat comes from self-test updates, plus there's entry
  code fixes, 5-level paging related fixes, console debug output fixes,
  and misc fixes"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Clean up the printk()s in show_fault_oops()
  x86/mm: Drop unneeded __always_inline for p4d page table helpers
  x86/efi: Fix efi_call_phys_epilog() with CONFIG_X86_5LEVEL=y
  selftests/x86/sigreturn: Do minor cleanups
  selftests/x86/sigreturn/64: Fix spurious failures on AMD CPUs
  x86/entry/64/compat: Fix "x86/entry/64/compat: Preserve r8-r11 in int $0x80"
  x86/mm: Don't free P4D table when it is folded at runtime
  x86/entry/32: Add explicit 'l' instruction suffix
  x86/mm: Get rid of KERN_CONT in show_fault_oops()

Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:
"Tooling fixes mostly, plus a build warning fix"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  perf/core: Move inline keyword at the beginning of declaration
  tools/headers: Pick up latest kernel ABIs
  perf tools: Fix crash caused by accessing feat_ops[HEADER_LAST_FEATURE]
  perf script: Fix crash because of missing evsel->priv
  perf script: Add missing output fields in a hint
  perf bench: Fix numa report output code
  perf stat: Remove duplicate event counting
  perf alias: Rebuild alias expression string to make it comparable
  perf alias: Remove trailing newline when reading sysfs files
  perf tools: Fix a clang 7.0 compilation error
  tools include uapi: Synchronize bpf.h with the kernel
  tools include uapi: Update if_link.h to pick IFLA_{BRPORT_ISOLATED,VXLAN_TTL_INHERIT}
  tools include powerpc: Update arch/powerpc/include/uapi/asm/unistd.h copy to get 'rseq' syscall
  perf tools: Update x86's syscall_64.tbl, adding 'io_pgetevents' and 'rseq'
  tools headers uapi: Synchronize drm/drm.h
  perf intel-pt: Fix packet decoding of CYC packets
  perf tests: Add valid callback for parse-events test
  perf tests: Add event parsing error handling to parse events test
  perf report powerpc: Fix crash if callchain is empty
  perf test session topology: Fix test on s390
  ...

Merge tag 'selinux-pr-20180629' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux

Pull selinux fix from Paul Moore:
"One fairly straightforward patch to fix a longstanding issue where a
  process could stall while accessing files in selinuxfs and block
  everyone else due to a held mutex.

  The patch passes all our tests and looks to apply cleanly to your
  current tree"

* tag 'selinux-pr-20180629' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  selinux: move user accesses in selinuxfs out of locked regions

Merge tag 'for-linus-20180629' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"Small set of fixes for this series. Mostly just minor fixes, the only
  oddball in here is the sg change.

  The sg change came out of the stall fix for NVMe, where we added a
  mempool and limited us to a single page allocation. CONFIG_SG_DEBUG
  sort-of ruins that, since we'd need to account for that. That's
  actually a generic problem, since lots of drivers need to allocate SG
  lists. So this just removes support for CONFIG_SG_DEBUG, which I added
  back in 2007 and to my knowledge it was never useful.

  Anyway, outside of that, this pull contains:

   - clone of request with special payload fix (Bart)

   - drbd discard handling fix (Bart)

   - SATA blk-mq stall fix (me)

   - chunk size fix (Keith)

   - double free nvme rdma fix (Sagi)"

* tag 'for-linus-20180629' of git://git.kernel.dk/linux-block:
  sg: remove ->sg_magic member
  drbd: Fix drbd_request_prepare() discard handling
  blk-mq: don't queue more if we get a busy return
  block: Fix cloning of requests with a special payload
  nvme-rdma: fix possible double free of controller async event buffer
  block: Fix transfer when chunk sectors exceeds max

Merge tag 'powerpc-4.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
"Two regression fixes, and a new syscall wire-up:

   - A fix for the recent conversion to time64_t in the powermac RTC
     routines, which caused time to go backward.

   - Another fix for fallout from the split PMD PTL conversion.

   - Wire up the new io_pgetevents() syscall.

  Thanks to: Aneesh Kumar K.V, Arnd Bergmann, Breno Leitao, Mathieu
  Malaterre"

* tag 'powerpc-4.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/powermac: Fix rtc read/write functions
  powerpc/mm/32: Fix pgtable_page_dtor call
  powerpc: Wire up io_pgetevents

Merge tag 'davinci-fixes-for-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nsekhar/linux-davinci into fixes

This fixes polarity of SD card write-protect pin on DA850 EVM
and fixes interrupt property for DA850 SoC GPIO as defined in
device-tree.

Both of these are not introduced with v4.18 merge but have
existed prior.

* tag 'davinci-fixes-for-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nsekhar/linux-davinci:
ARM: dts: da850: Fix interrups property for gpio
ARM: davinci: board-da850-evm: fix WP pin polarity for MMC/SD

Signed-off-by: Olof Johansson <[email protected]>

Merge tag 'hisi-fixes-for-4.18' of git://github.com/hisilicon/linux-hisi into fixes

ARM64: hisi fixes for 4.18

- Added power capabilities for the mmc host controller on the
  hikey and hikey960 boards to avoid broken wifi.

* tag 'hisi-fixes-for-4.18' of git://github.com/hisilicon/linux-hisi:
  arm64: dts: hikey960: Define wl1837 power capabilities
  arm64: dts: hikey: Define wl1835 power capabilities

Signed-off-by: Olof Johansson <[email protected]>

Merge tag 'amlogic-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-amlogic into fixes

Amlogic fixes for v4.18-rc
- minor 64-bit DT fixes

* tag 'amlogic-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-amlogic:
  ARM64: dts: meson-gxl: fix Mali GPU compatible string
  ARM64: dts: meson-axg: fix ethernet stability issue
  ARM64: dts: meson-gx: fix ATF reserved memory region
  ARM64: dts: meson-gxl-s905x-p212: Add phy-supply for usb0
  ARM64: dts: meson: fix register ranges for SD/eMMC
  ARM64: dts: meson: disable sd-uhs modes on the libretech-cc

Signed-off-by: Olof Johansson <[email protected]>

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Catalin Marinas:

- The alternatives patching code uses flush_icache_range() which itself
   uses alternatives. Change the code to use an unpatched variant of
   cache maintenance

- Remove unnecessary ISBs from set_{pte,pmd,pud}

- perf: xgene_pmu: Fix IOB SLOW PMU parser error

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: Remove unnecessary ISBs from set_{pte,pmd,pud}
  arm64: Avoid flush_icache_range() in alternatives patching code
  drivers/perf: xgene_pmu: Fix IOB SLOW PMU parser error

Merge branch 'i2c/for-current-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:

- a revert because of bugzilla #200045 (and some documentation about
   it)

- another regression fix in the i2c-gpio driver

- a leak fix for the i2c core

* 'i2c/for-current-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: gpio: initialize SCL to HIGH again
  i2c: smbus: kill memory leak on emulated and failed DMA SMBus xfers
  i2c: algos: bit: mention our experience about initial states
  Revert "i2c: algo-bit: init the bus to a known state"

Merge tag 'ceph-for-4.18-rc3' of git://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
"A trivial dentry leak fix from Zheng"

* tag 'ceph-for-4.18-rc3' of git://github.com/ceph/ceph-client:
ceph: fix dentry leak in splice_dentry()

parisc: Build kernel without -ffunction-sections

As suggested by Nick Piggin it seems we can drop the -ffunction-sections
compile flag, now that the kernel uses thin archives. Testing with 32-
and 64-bit kernel showed no difference in kernel size.

Suggested-by: Nicholas Piggin <[email protected]>
Signed-off-by: Helge Deller <[email protected]>

sg: remove ->sg_magic member

This was introduced more than a decade ago when sg chaining was
added, but we never really caught anything with it. The scatterlist
entry size can be critical, since drivers allocate it, so remove
the magic member. Recently it's been triggering allocation stalls
and failures in NVMe.

Tested-by: Jordan Glover <[email protected]>
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

Merge tag 'pci-v4.18-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI fixes from Bjorn Helgaas:

- Fix crash caused by endpoint library initialization order change
   (Alan Douglas)

- Fix shpchp NULL pointer dereference regression on non-ACPI platforms
   (Bjorn Helgaas)

- Move PCI_DOMAINS selection to fix build regression (Lorenzo
   Pieralisi)

* tag 'pci-v4.18-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
  PCI: controller: Move PCI_DOMAINS selection to arch Kconfig
  PCI: Initialize endpoint library before controllers
  PCI: shpchp: Manage SHPC unconditionally on non-ACPI systems

Merge tag 'pm-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
"These fix up recently added features (the Kryo cpufreq driver and
  performance states coverage in the generic power domains framework),
  add missing documentation for a recently added sysfs knob in the
  intel_pstate driver and fix an error in its documentation.

  Specifics:

   - Fix the initialization time error handling in the recently added
     Kryo cpufreq driver (Dan Carpenter).

   - Fix up the recently added coverage of performance states in the
     generic power domains (genpd) framework (Viresh Kumar).

   - Add missing documentation of the new hwp_dynamic_boost sysfs knob
     in the intel_pstate driver (Rafael Wysocki).

   - Fix incorrect sysfs path in the intel_pstate driver documentation
     (Rafael Wysocki)"

* tag 'pm-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  Documentation: intel_pstate: Describe hwp_dynamic_boost sysfs knob
  Documentation: admin-guide: intel_pstate: Fix sysfs path
  PM / Domains: Rename opp_node to np
  PM / Domains: Fix return value of of_genpd_opp_to_performance_state()
  cpufreq: qcom-kryo: Fix error handling in probe()

Merge tag 'drm-fixes-2018-06-29' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
"Nothing too major this round:

   - small set of mali-dp fixes

   - single meson fix

   - a bunch of amdgpu fixes (one makes non-4k page sizes not be a bad
     experience)"

* tag 'drm-fixes-2018-06-29' of git://anongit.freedesktop.org/drm/drm:
  drm/amd/display: release spinlock before committing updates to stream
  drm/amdgpu:Support new VCN FW version naming convention
  drm/amdgpu: fix UBSAN: Undefined behaviour for amdgpu_fence.c
  drm/meson: Fix an un-handled error path in 'meson_drv_bind_master()'
  drm/amdgpu: GPU vs CPU page size fixes in amdgpu_vm_bo_split_mapping
  drm/amdgpu: Count disabled CRTCs in commit tail earlier
  drm/mali-dp: Rectify the width and height passed to rotmem_required()
  drm/arm/malidp: Preserve LAYER_FORMAT contents when setting format
  drm: mali-dp: Enable Global SE interrupts mask for DP500
  drm/arm/malidp: Ensure that the crtcs are shutdown before removing any encoder/connector

Merge tag 'for-4.18/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

- Fix dm core to use more efficient bio_split() instead of
   bio_clone_bioset(). Also fixes splitting bio that has integrity
   payload.

- Three fixes related to properly validating DAX capabilities of a
   stacked DM device that will advertise DAX support.

- Update DM writecache target to use 2-factor allocator arguments. Kees
   says this is the last related change for 4.18.

- Fix DM zoned target to use GFP_NOIO to avoid triggering reclaim
   during IO submission (caught by lockdep).

- Fix DM thinp to gracefully recover from running out of data space
   while a previous async discard completes (whereby freeing space).

- Fix DM thinp's metadata transaction commit to avoid needless work.

* tag 'for-4.18/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: prevent DAX mounts if not supported
  dax: check for QUEUE_FLAG_DAX in bdev_dax_supported()
  pmem: only set QUEUE_FLAG_DAX for fsdax mode
  dm thin: handle running out of data space vs concurrent discard
  dm raid: don't use 'const' in function return
  dm zoned: avoid triggering reclaim from inside dmz_map()
  dm writecache: use 2-factor allocator arguments
  dm thin metadata: remove needless work from __commit_transaction
  dm: use bio_split() when splitting out the already processed bio

Merge branch 'nvme-4.18' of git://git.infradead.org/nvme into for-linus

Pull single NVMe fix from Christoph.

* 'nvme-4.18' of git://git.infradead.org/nvme:
nvme-rdma: fix possible double free of controller async event buffer

drbd: Fix drbd_request_prepare() discard handling

Fix the test that verifies whether bio_op(bio) represents a discard
or write zeroes operation. Compile-tested only.

Cc: Philipp Reisner <[email protected]>
Cc: Lars Ellenberg <[email protected]>
Fixes: 7435e9018f91 ("drbd: zero-out partial unaligned discards on local backend")
Signed-off-by: Bart Van Assche <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

blk-mq: don't queue more if we get a busy return

Some devices have different queue limits depending on the type of IO. A
classic case is SATA NCQ, where some commands can queue, but others
cannot. If we have NCQ commands inflight and encounter a non-queueable
command, the driver returns busy. Currently we attempt to dispatch more
from the scheduler, if we were able to queue some commands. But for the
case where we ended up stopping due to BUSY, we should not attempt to
retrieve more from the scheduler. If we do, we can get into a situation
where we attempt to queue a non-queueable command, get BUSY, then
successfully retrieve more commands from that scheduler and queue those.
This can repeat forever, starving the non-queuable command indefinitely.

Fix this by NOT attempting to pull more commands from the scheduler, if
we get a BUSY return. This should also be more optimal in terms of
letting requests stay in the scheduler for as long as possible, if we
get a BUSY due to the regular out-of-tags condition.

Reviewed-by: Omar Sandoval <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

aio: mark __aio_sigset::sigmask const

io_pgetevents() will not change the signal mask. Mark it const to make
it clear and to reduce the need for casts in user code.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Al Viro <[email protected]>
[hch: reapply the patch that got incorrectly reverted]
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

net: handle NULL ->poll gracefully

The big aio poll revert broke various network protocols that don't
implement ->poll as a patch in the aio poll serie removed sock_no_poll
and made the common code handle this case.

Reported-by: [email protected]
Reported-by: [email protected]
Reported-by: [email protected]
Reported-by: Tetsuo Handa <[email protected]>
Fixes: a11e1d432b51 ("Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL")
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge branch 'pm-domains'

Merge fixups for the recent extenstion of the generic power domains
(genpd) framework covering performance states.

* pm-domains:
PM / Domains: Rename opp_node to np
PM / Domains: Fix return value of of_genpd_opp_to_performance_state()

i2c: gpio: initialize SCL to HIGH again

It seems that during the conversion from gpio* to gpiod*, the initial
state of SCL was wrongly switched to LOW. Fix it to be HIGH again.

Fixes: 7bb75029ef34 ("i2c: gpio: Enforce open drain through gpiolib")
Signed-off-by: Wolfram Sang <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Reviewed-by: Linus Walleij <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>
Cc: [email protected]

i2c: smbus: kill memory leak on emulated and failed DMA SMBus xfers

If DMA safe memory was allocated, but the subsequent I2C transfer
fails the memory is leaked. Plug this leak.

Fixes: 8a77821e74d6 ("i2c: smbus: use DMA safe buffers for emulated SMBus transactions")
Signed-off-by: Peter Rosin <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>
Cc: [email protected]

i2c: algos: bit: mention our experience about initial states

So, if somebody wants to re-implement this in the future, we pinpoint to
a problem case.

Signed-off-by: Wolfram Sang <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

Revert "i2c: algo-bit: init the bus to a known state"

This reverts commit 3e5f06bed72fe72166a6778f630241a893f67799. As per
bugzilla #200045, this caused a regression. I don't really see a way to
fix it without having the hardware. So, revert the patch and I will fix
the issue I was seeing originally in the i2c-gpio driver itself. I
couldn't find new users of this algorithm since, so there should be no
one depending on the new behaviour.

Reported-by: Sergey Larin <[email protected]>
Fixes: 3e5f06bed72f ("i2c: algo-bit: init the bus to a known state")
Signed-off-by: Wolfram Sang <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Tested-by: Sergey Larin <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>
Cc: [email protected]

selinux: move user accesses in selinuxfs out of locked regions

If a user is accessing a file in selinuxfs with a pointer to a userspace
buffer that is backed by e.g. a userfaultfd, the userspace access can
stall indefinitely, which can block fsi->mutex if it is held.

For sel_read_policy(), remove the locking, since this method doesn't seem
to access anything that requires locking.

For sel_read_bool(), move the user access below the locked region.

For sel_write_bool() and sel_commit_bools_write(), move the user access
up above the locked region.

Cc: [email protected]
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jann Horn <[email protected]>
Acked-by: Stephen Smalley <[email protected]>
[PM: removed an unused variable in sel_read_policy()]
Signed-off-by: Paul Moore <[email protected]>

parisc: Reduce debug output in unwind code

Signed-off-by: Helge Deller <[email protected]>

Merge tag 'drm-misc-fixes-2018-06-28' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

drm-misc-fixes for v4.18-rc3:
- A single fix in meson for an unhandled error path in meson_drv_bind_master().

Signed-off-by: Dave Airlie <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Merge branch 'drm-fixes-4.18' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

A few fixes for 4.18:
- fix a read past the end of an array due to vega20 changes
- fix driver on systems with non-4K pages
- fix locking with pageflipping in DC that could lead to a sleep while atomic
- fix VCN firmware version reporting for upcoming firmware

Signed-off-by: Dave Airlie <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

dm: prevent DAX mounts if not supported

Currently device_supports_dax() just checks to see if the QUEUE_FLAG_DAX
flag is set on the device's request queue to decide whether or not the
device supports filesystem DAX.  Really we should be using
bdev_dax_supported() like filesystems do at mount time.  This performs
other tests like checking to make sure the dax_direct_access() path works.

We also explicitly clear QUEUE_FLAG_DAX on the DM device's request queue if
any of the underlying devices do not support DAX.  This makes the handling
of QUEUE_FLAG_DAX consistent with the setting/clearing of most other flags
in dm_table_set_restrictions().

Now that bdev_dax_supported() explicitly checks for QUEUE_FLAG_DAX, this
will ensure that filesystems built upon DM devices will only be able to
mount with DAX if all underlying devices also support DAX.

Signed-off-by: Ross Zwisler <[email protected]>
Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support")
Cc: [email protected]
Acked-by: Dan Williams <[email protected]>
Reviewed-by: Toshi Kani <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

dax: check for QUEUE_FLAG_DAX in bdev_dax_supported()

Add an explicit check for QUEUE_FLAG_DAX to __bdev_dax_supported(). This
is needed for DM configurations where the first element in the dm-linear or
dm-stripe target supports DAX, but other elements do not. Without this
check __bdev_dax_supported() will pass for such devices, letting a
filesystem on that device mount with the DAX option.

Signed-off-by: Ross Zwisler <[email protected]>
Suggested-by: Mike Snitzer <[email protected]>
Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support")
Cc: [email protected]
Acked-by: Dan Williams <[email protected]>
Reviewed-by: Toshi Kani <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

pmem: only set QUEUE_FLAG_DAX for fsdax mode

QUEUE_FLAG_DAX is an indication that a given block device supports
filesystem DAX and should not be set for PMEM namespaces which are in "raw"
mode. These namespaces lack struct page and are prevented from
participating in filesystem DAX as of commit 569d0365f571 ("dax: require
'struct page' by default for filesystem dax").

Signed-off-by: Ross Zwisler <[email protected]>
Suggested-by: Mike Snitzer <[email protected]>
Fixes: 569d0365f571 ("dax: require 'struct page' by default for filesystem dax")
Cc: [email protected]
Acked-by: Dan Williams <[email protected]>
Reviewed-by: Toshi Kani <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

Merge tag 'printk-for-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk

Pull printk fix from Petr Mladek:
"Revert a commit that went in by mistake. I already have a better fix
in the queue for 4.19"

* tag 'printk-for-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
Revert "lib/test_printf.c: call wait_for_random_bytes() before plain %p tests"

Merge tag 'sound-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"Over a dozen changes, but all small and clear fixes.

  Half of them are the regression fixes for CA0132 HD-audio codec, and
  the rest are, again, a few more fixups for HD-audio, two UBSAN fixes
  in the core ioctls, and a trivial fix in the error path handling in
  lx6464es driver"

* tag 'sound-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: seq: Fix UBSAN warning at SNDRV_SEQ_IOCTL_QUERY_NEXT_CLIENT ioctl
  ALSA: timer: Fix UBSAN warning at SNDRV_TIMER_IOCTL_NEXT_DEVICE ioctl
  ALSA: hda/realtek - Fix the problem of two front mics on more machines
  ALSA: hda/realtek - Add a quirk for FSC ESPRIMO U9210
  ALSA: hda/ca0132: make array ca0132_alt_chmaps static
  ALSA: hda - Force to link down at runtime suspend on ATI/AMD HDMI
  ALSA: lx6464es: Missing error code in snd_lx6464es_create()
  ALSA: hda/ca0132: Fix DMic data rate for Alienware M17x R4
  ALSA: hda/ca0132: Restore PCM Analog Mic-In2
  ALSA: hda/ca0132: Don't test for QUIRK_NONE
  ALSA: hda/ca0132: Restore behavior of QUIRK_ALIENWARE
  ALSA: hda/ca0132: Delete redundant UNSOL event requests
  ALSA: hda/ca0132: Delete pointless assignments to struct auto_pin_cfg fields
  ALSA: hda/realtek - Fix pop noise on Lenovo P50 & co

Merge tag 'mtd/fixes-for-4.18-rc3' of git://git.infradead.org/linux-mtd

Pull mtd fixes from Boris Brezillon:
"NAND fixes:

   - add a quirk for a bunch of broken Macronix chips

   - fix nand_block_bad() when chip->ecc.read_oob() returns a positive
     value encoding the number of bitflips

   - fix OOB handling in the MXC driver fo V2.1 controllers

   - flag the ONFI_FEATURE_ON_DIE_ECC as supported in the Micron driver

   - hardcode clk rate in the denali_dt driver to address a bad DT
     representation (the proper fix will be queued for 4.19)

  SPI NOR fixes:

   - add an ULL constant to some ID definitions so that the ID is not
     truncated on 32-bit platforms

  MTD fixes:

   - fix the sector unlocking logic in the CFI driver"

* tag 'mtd/fixes-for-4.18-rc3' of git://git.infradead.org/linux-mtd:
  mtd: rawnand: denali_dt: set clk_x_rate to 200 MHz unconditionally
  mtd: dataflash: Use ULL suffix for 64-bit constants
  mtd: cfi_cmdset_0002: Avoid walking all chips when unlocking.
  mtd: cfi_cmdset_0002: Fix unlocking requests crossing a chip boudary
  mtd: cfi_cmdset_0002: fix SEGV unlocking multiple chips
  mtd: cfi_cmdset_0002: Use right chip in do_ppb_xxlock()
  mtd: rawnand: All AC chips have a broken GET_FEATURES(TIMINGS).
  mtd: rawnand: fix return value check for bad block status
  mtd: rawnand: mxc: set spare area size register explicitly
  mtd: rawnand: micron: add ONFI_FEATURE_ON_DIE_ECC to supported features

Merge branch 'akpm' (patches from Andrew)

Merge fixes from Andrew Morton:
"7 fixes"

* emailed patches from Andrew Morton <[email protected]>:
  proc: add Alexey to MAINTAINERS
  kasan: depend on CONFIG_SLUB_DEBUG
  include/linux/dax.h: dax_iomap_fault() returns vm_fault_t
  x86/e820: put !E820_TYPE_RAM regions into memblock.reserved
  slub: fix failure when we delete and create a slab cache
  Revert mm/vmstat.c: fix vmstat_update() preemption BUG
  lib/percpu_ida.c: don't do alloc from per-CPU list if there is none

proc: add Alexey to MAINTAINERS

I know I'll regret it.

Link: http://lkml.kernel.org/r/20180627194840.GA18113@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kasan: depend on CONFIG_SLUB_DEBUG

KASAN depends on having access to some of the accounting that SLUB_DEBUG
does; without it, there are immediate crashes [1]. So, the natural
thing to do is to make KASAN select SLUB_DEBUG.

[1] http://lkml.kernel.org/r/CAHmME9rtoPwxUSnktxzKso14iuVCWT7BE_-_8PAC=pGw1iJnQg@mail.gmail.com

Link: http://lkml.kernel.org/r/[email protected]
Fixes: f9e13c0a5a33 ("slab, slub: skip unnecessary kasan_cache_shutdown()")
Signed-off-by: Jason A. Donenfeld <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/dax.h: dax_iomap_fault() returns vm_fault_t

Commit 1c8f422059ae ("mm: change return type to vm_fault_t") missed a
conversion. It's not a big problem at present because mainline is still
using

typedef int vm_fault_t;

Fixes: 1c8f422059ae ("mm: change return type to vm_fault_t")
Link: http://lkml.kernel.org/r/20180620172046.GA27894@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

There is a kernel panic that is triggered when reading /proc/kpageflags
on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':

  BUG: unable to handle kernel paging request at fffffffffffffffe
  PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
  Oops: 0000 [#1] SMP PTI
  CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
  RIP: 0010:stable_page_flags+0x27/0x3c0
  Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
  RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
  RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
  RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
  R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
  R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
  FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
  Call Trace:
   kpageflags_read+0xc7/0x120
   proc_reg_read+0x3c/0x60
   __vfs_read+0x36/0x170
   vfs_read+0x89/0x130
   ksys_pread64+0x71/0x90
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7efc42e75e23
  Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24

According to kernel bisection, this problem became visible due to commit
f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
which changes how struct pages are initialized.

Memblock layout affects the pfn ranges covered by node/zone.  Consider
that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
the default (no memmap= given) memblock layout is like below:

  MEMBLOCK configuration:
   memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
   memory.cnt  = 0x4
   memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
   memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
   memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
   memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
   ...

If you give memmap=1G!4G (so it just covers memory[0x2]),
the range [0x100000000-0x13fffffff] is gone:

  MEMBLOCK configuration:
   memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
   memory.cnt  = 0x3
   memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
   memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
   memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
   ...

This causes shrinking node 0's pfn range because it is calculated by the
address range of memblock.memory.  So some of struct pages in the gap
range are left uninitialized.

We have a function zero_resv_unavail() which does zeroing the struct pages
within the reserved unavailable range (i.e.  memblock.memory &&
!memblock.reserved).  This patch utilizes it to cover all unavailable
ranges by putting them into memblock.reserved.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Naoya Horiguchi <[email protected]>
Tested-by: Oscar Salvador <[email protected]>
Tested-by: "Herton R. Krzesinski" <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Pavel Tatashin <[email protected]>
Cc: Steven Sistare <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>