Git Repo - linux.git/log

s390/perf: obtain sie_block from the right address

Since commit 1179f170b6f0 ("s390: fix fpu restore in entry.S"), the
sie_block pointer is located at empty1[1], but in sie_block() it was
taken from empty1[0].

This leads to a random pointer being dereferenced, possibly causing
system crash.

This problem can be observed when running a simple guest with an endless
loop and recording the cpu-clock event:

sudo perf kvm --guestvmlinux=<guestkernel> --guest top -e cpu-clock

With this fix, the correct guest address is shown.

Fixes: 1179f170b6f0 ("s390: fix fpu restore in entry.S")
Cc: [email protected]
Acked-by: Christian Borntraeger <[email protected]>
Acked-by: Claudio Imbrenda <[email protected]>
Reviewed-by: Heiko Carstens <[email protected]>
Signed-off-by: Nico Boehr <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390: generate register offsets into pt_regs automatically

Use asm offsets method to generate register offsets into pt_regs,
instead of open-coding at several places.

Reviewed-by: Alexander Gordeev <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390: simplify early program check handler

Due to historic reasons the base program check handler calls a
configurable function. Given that there is only the early program
check handler left, simplify the code by directly calling that
function.

The only other user was removed with commit d485235b0054 ("s390:
assume diag308 set always works").

Also rename all functions and the asm file to reflect this.

Reviewed-by: Sven Schnelle <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/crypto: fix scatterwalk_unmap() callers in AES-GCM

The argument of scatterwalk_unmap() is supposed to be the void* that was
returned by the previous scatterwalk_map() call.
The s390 AES-GCM implementation was instead passing the pointer to the
struct scatter_walk.

This doesn't actually break anything because scatterwalk_unmap() only uses
its argument under CONFIG_HIGHMEM and ARCH_HAS_FLUSH_ON_KUNMAP.

Fixes: bf7fa038707c ("s390/crypto: add s390 platform specific aes gcm support.")
Signed-off-by: Jann Horn <[email protected]>
Acked-by: Harald Freudenberger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>

s390/head: get rid of 31 bit leftovers

Get rid of old 31 bit leftovers within ipl code:

- convert everything to pc relative code
- use 64 bit addressing mode as early as possible
- use 64 bit arithmetics wherever possible

This way the code doesn't look as odd as before anymore.

Reviewed-by: Sven Schnelle <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

scripts/min-tool-version.sh: raise minimum clang version to 14.0.0 for s390

Before version 14.0.0 llvm's integrated assembler fails to handle some
displacement variants:

arch/s390/purgatory/head.S:108:10: error: invalid operand for instruction
lg %r11,kernel_type-.base_crash(%r13)

Instead of working around this and given that this is already fixed
raise the minimum clang version from 13.0.0 to 14.0.0.

Acked-by: Nick Desaulniers <[email protected]>
Tested-by: Nathan Chancellor <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Link: https://reviews.llvm.org/D113341
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>

s390/boot: do not emit debug info for assembly with llvm's IAS

Commit ee6d777d3e93 ("s390/decompressor: support extra debug flags")
added extra debug flags, in particular debug info is created,
depending on config options.

With llvm's IAS this causes this compile warning:

arch/s390/boot/head.S:38:1: warning: DWARF2 only supports one section per compilation unit
.section ".head.text","ax"
^

This is a known problem and was addressed with commit b8a9092330da
("Kbuild: do not emit debug info for assembly with LLVM_IAS=1").
Just do the same for s390 to get rid of this warning.

Tested-by: Nathan Chancellor <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>

s390/boot: workaround llvm IAS bug

For at least the mvc and clc instructions llvm's integrated assembler can
generate incorrect code. In particular this happens with decompressor boot
code. The reason seems to be that relocations for the second displacement
of each instruction are at incorrect locations (-/+: gas vs llvm IAS):

mvc     __LC_IO_NEW_PSW(16),.Lnewpsw

results in

        4:      d2 0f 01 f0 00 00       mvc     496(16,%r0),0
-                       8: R_390_12     .head.text+0x10
+        6: R_390_12     .head.text+0x10

and
clc     0(3,%r4),.L_hdr
results in

      258:      d5 02 40 00 00 00       clc     0(3,%r4),0
-                       25c: R_390_12   .head.text+0x324
+        25a: R_390_12   .head.text+0x324

Workaround this by writing the code in a different way.

Tested-by: Nathan Chancellor <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Link: https://github.com/llvm/llvm-project/issues/55411
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>

s390/purgatory: workaround llvm's IAS limitations

llvm's integrated assembler cannot handle immediate values which are
calculated with two local labels:

arch/s390/purgatory/head.S:139:11: error: invalid operand for instruction
aghi %r8,-(.base_crash-purgatory_start)

Workaround this by partially rewriting the code.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>

s390/entry: workaround llvm's IAS limitations

llvm's integrated assembler cannot handle immediate values which are
calculated with two local labels:

<instantiation>:3:13: error: invalid operand for instruction
clgfi %r14,.Lsie_done - .Lsie_gmap

Workaround this by adding clang specific code which reads the specific
value from memory. Since this code is within the hot paths of the kernel
and adds an additional memory reference, keep the original code, and add
ifdef'ed code.

Acked-by: Alexander Gordeev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>

s390/alternatives: remove padding generation code

clang fails to handle ".if" statements in inline assembly which are heavily
used in the alternatives code.

To work around this remove this code, and enforce that users of
alternatives must specify original and alternative instruction sequences
which have identical sizes. Add a compile time check with two ".org"
statements similar to arm64.

In result not only clang can handle this, but also quite a lot of code can
be removed.

Acked-by: Vasily Gorbik <[email protected]>
Tested-by: Nathan Chancellor <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Link: https://github.com/ClangBuiltLinux/linux/issues/1356
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>

s390/alternatives: provide identical sized orginal/alternative sequences

Explicitly provide identical sized original/alternative instruction
sequences. This way there is no need for the s390 specific alternatives
infrastructure to generate padding sequences.
The code which generates such sequences will be removed with a follow on
patch.

Acked-by: Vasily Gorbik <[email protected]>
Tested-by: Nathan Chancellor <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>

s390/cpumf: add new extended counter set for IBM z16

Export the extended counter set counters of the IBM z16 via sysfs.

Signed-off-by: Thomas Richter <[email protected]>
Acked-by: Sumanth Korikkar <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/preempt: disable __preempt_count_add() optimization for PROFILE_ALL_BRANCHES

gcc 12 does not (always) optimize away code that should only be generated
if parameters are constant and within in a certain range. This depends on
various obscure kernel config options, however in particular
PROFILE_ALL_BRANCHES can trigger this compile error:

In function ‘__atomic_add_const’,
    inlined from ‘__preempt_count_add.part.0’ at ./arch/s390/include/asm/preempt.h:50:3:
./arch/s390/include/asm/atomic_ops.h:80:9: error: impossible constraint in ‘asm’
   80 |         asm volatile(                                                   \
      |         ^~~

Workaround this by simply disabling the optimization for
PROFILE_ALL_BRANCHES, since the kernel will be so slow, that this
optimization won't matter at all.

Reported-by: Thomas Richter <[email protected]>
Reviewed-by: Sven Schnelle <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/stp: clock_delta should be signed

clock_delta is declared as unsigned long in various places. However,
the clock sync delta can be negative. This would add a huge positive
offset in clock_sync_global where clock_delta is added to clk.eitod
which is a 72 bit integer. Declare it as signed long to fix this.

Cc: [email protected]
Signed-off-by: Sven Schnelle <[email protected]>
Reviewed-by: Heiko Carstens <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/stp: fix todoff size

The size of the TOD offset field in the stp info response is 64 bits.

Signed-off-by: Sven Schnelle <[email protected]>
Reviewed-by: Heiko Carstens <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/pai: add support for cryptography counters

PMU device driver perf_pai_crypto supports Processor Activity
Instrumentation (PAI), available with IBM z16:
- maps a full page to lowcore address 0x1500.
- uses CR0 bit 13 to turn PAI crypto counting on and off.
- creates a sample with raw data on each context switch out when
  at context switch some mapped counters have a value of nonzero.
This device driver only supports CPU wide context, no task context
is allowed.

Support for counting:
- one or more counters can be specified using
  perf stat -e pai_crypto/xxx/
  where xxx stands for the counter event name. Multiple invocation
  of this command is possible. The counter names are listed in
  /sys/devices/pai_crypto/events directory.
- one special counters can be specified using
  perf stat -e pai_crypto/CRYPTO_ALL/
  which returns the sum of all incremented crypto counters.
- one event pai_crypto/CRYPTO_ALL/ is reserved for sampling.
  No multiple invocations are possible. The event collects data at
  context switch out and saves them in the ring buffer.

Add qpaci assembly instruction to query supported memory mapped crypto
counters. It returns the number of counters (no holes allowed in that
range).

The PAI crypto counter events are system wide and can not be executed
in parallel. Therefore some restrictions documented in function
paicrypt_busy apply.
In particular event CRYPTO_ALL for sampling must run exclusive.
Only counting events can run in parallel.

PAI crypto counter events can not be created when a CPU hot plug
add is processed. This means a CPU hot plug add does not get
the necessary PAI event to record PAI cryptography counter increments
on the newly added CPU. CPU hot plug remove removes the event and
terminates the counting of PAI counters immediately.

Co-developed-by: Sven Schnelle <[email protected]>
Signed-off-by: Sven Schnelle <[email protected]>
Reviewed-by: Juergen Christ <[email protected]>
Signed-off-by: Thomas Richter <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>

entry: Rename arch_check_user_regs() to arch_enter_from_user_mode()

arch_check_user_regs() is used at the moment to verify that struct pt_regs
contains valid values when entering the kernel from userspace. s390 needs
a place in the generic entry code to modify a cpu data structure when
switching from userspace to kernel mode. As arch_check_user_regs() is
exactly this, rename it to arch_enter_from_user_mode().

When entering the kernel from userspace, arch_check_user_regs() is
used to verify that struct pt_regs contains valid values. Note that
the NMI codepath doesn't call this function. s390 needs a place in the
generic entry code to modify a cpu data structure when switching from
userspace to kernel mode. As arch_check_user_regs() is exactly this,
rename it to arch_enter_from_user_mode().

Signed-off-by: Sven Schnelle <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>

s390/compat: cleanup compat_linux.h header file

Remove various declarations from former s390 specific compat system
calls which have been removed with commit fef747bab3c0 ("s390: use
generic UID16 implementation"). While at it clean up the whole small
header file.

Signed-off-by: Heiko Carstens <[email protected]>

s390/entry: remove broken and not needed code

LLVM's integrated assembler reports the following error when compiling
entry.S:

<instantiation>:38:5: error: unknown token in expression
tm %r8,0x0001 # coming from user space?

The correct instruction would have been tmhh instead of tm.
The current code is doing nothing, since (with gas) it get's
translated to a tm instruction which reads from real address 8, which
again contains always zero, and therefore the conditional code is
never executed.
Note that due to the missing displacement gas translates "%r8" into
"8(%r0)".

Also code inspection reveals that this conditional code is not needed.
Therefore remove it.

Reviewed-by: Sven Schnelle <[email protected]>
Reviewed-by: Alexander Gordeev <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/boot: convert parmarea to C

Convert parmarea to C, which makes it much easier to initialize it. No need
to keep offsets in assembler code in sync with struct parmarea anymore.

Reviewed-by: Vasily Gorbik <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/boot: convert initial lowcore to C

Convert initial lowcore to C and use proper defines and structures to
initialize it. This should make the z/VM ipl procedure a bit less magic.

Acked-by: Peter Oberparleiter <[email protected]>
Reviewed-by: Vasily Gorbik <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/ptrace: move short psw definitions to ptrace header file

The short psw definitions are contained in compat header files, however
short psws are not compat specific. Therefore move the definitions to
ptrace header file. This also gets rid of a compat header include in kvm
code.

Acked-by: Vasily Gorbik <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/head: initialize all new psws

Initialize all new psws with disabled wait psws, except for the restart new
psw. This way every unexpected exception, svc, machine check, or interrupt
is handled properly.

Reviewed-by: Vasily Gorbik <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/boot: change initial program check handler to disabled wait psw

The program check handler of the kernel image points to
startup_pgm_check_handler. However an early program check which happens
while loading the kernel image will jump to potentially random code, since
the code of the program check handler is not yet loaded; leading to a
program check loop.

Therefore initialize it to a disabled wait psw and let the startup code set
the proper psw when everything is in memory.

Reviewed-by: Vasily Gorbik <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/head: adjust iplstart entry point

Move iplstart entry point to 0x200 again, instead of the middle of the ipl
code. This way even the comment describing the ccw program is correct
again.

Acked-by: Vasily Gorbik <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/boot: get rid of startup archive

The final kernel image is created by linking decompressor object files with
a startup archive. The startup archive file however does not contain only
optional code and data which can be discarded if not referenced. It also
contains mandatory object data like head.o which must never be discarded,
even if not referenced.

Move the decompresser code and linker script to the boot directory and get
rid of the startup archive so everything is kept during link time.

Acked-by: Vasily Gorbik <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/vx: remove comments from macros which break LLVM's IAS

LLVM's integrated assembler does not like comments within macros:

<instantiation>:3:19: error: too many positional arguments
GR_NUM b2, 1 /* Base register */
^
Remove them, since they are obvious anyway.

Signed-off-by: Heiko Carstens <[email protected]>

s390/extable: prefer local labels in .set directives

Use local labels in .set directives to avoid potential compile errors
with LTO + clang. See commit 334865b2915c ("x86/extable: Prefer local
labels in .set directives") for further details.

Since s390 doesn't support LTO currently this doesn't fix a real bug
for now, but helps to avoid problems as soon as required pieces have
been added to llvm.

Signed-off-by: Heiko Carstens <[email protected]>

s390/nospec: prefer local labels in .set directives

Use local labels in .set directives to avoid potential compile errors
with LTO + clang. See commit 334865b2915c ("x86/extable: Prefer local
labels in .set directives") for further details.

Since s390 doesn't support LTO currently this doesn't fix a real bug
for now, but helps to avoid problems as soon as required pieces have
been added to llvm.

Signed-off-by: Heiko Carstens <[email protected]>

s390/hypfs: fix typos in comments

Various spelling mistakes in comments.
Detected with the help of Coccinelle.

Signed-off-by: Julia Lawall <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>

s390/crypto: fix typos in comments

Various spelling mistakes in comments.
Detected with the help of Coccinelle.

Signed-off-by: Julia Lawall <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>

s390/consoles: improve panic notifiers reliability

Currently many console drivers for s390 rely on panic/reboot notifiers
to invoke callbacks on these events. The panic() function disables local
IRQs, secondary CPUs and preemption, so callbacks invoked on panic are
effectively running in atomic context.

Happens that most of these console callbacks from s390 doesn't take the
proper care with regards to atomic context, like taking spinlocks that
might be taken in other function/CPU and hence will cause a lockup
situation.

The goal for this patch is to improve the notifiers reliability, acting
on 4 console drivers, as detailed below:

(1) con3215: changed a regular spinlock to the trylock alternative.

(2) con3270: also changed a regular spinlock to its trylock counterpart,
but here we also have another problem: raw3270_activate_view() takes a
different spinlock. So, we worked a helper to validate if this other lock
is safe to acquire, and if so, raw3270_activate_view() should be safe.

Notice though that there is a functional change here: it's now possible
to continue the notifier code [reaching con3270_wait_write() and
con3270_rebuild_update()] without executing raw3270_activate_view().

(3) sclp: a global lock is used heavily in the functions called from
the notifier, so we added a check here - if the lock is taken already,
we just bail-out, preventing the lockup.

(4) sclp_vt220: same as (3), a lock validation was added to prevent the
potential lockup problem.

Besides (1)-(4), we also removed useless void functions, adding the
code called from the notifier inside its own body, and changed the
priority of such notifiers to execute late, since they are "heavyweight"
for the panic environment, so we aim to reduce risks here.
Changed return values to NOTIFY_DONE as well, the standard one.

Signed-off-by: Guilherme G. Piccoli <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>

s390/irq: utilize RCU instead of irq_lock_sparse() in show_msi_interrupt()

As demonstrated by commit 74bdf7815dfb ("genirq: Speedup
show_interrupts()"), irq_desc can be accessed safely in RCU read section.

Hence here resorting to rcu read lock to get rid of irq_lock_sparse().

Signed-off-by: Pingfan Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>

s390: add KCSAN instrumentation to barriers and spinlocks

test_barrier fails on s390 because of the missing KCSAN instrumentation
for several synchronization primitives.

Add it to barriers by defining __mb(), __rmb(), __wmb(), __dma_rmb()
and __dma_wmb(), and letting the common code in asm-generic/barrier.h
do the rest.

Spinlocks require instrumentation only on the unlock path; notify KCSAN
that the CPU cannot move memory accesses outside of the spin lock. In
reality it also cannot move stores inside of it, but this is not
important and can be omitted.

Reported-by: Tobias Huschle <[email protected]>
Signed-off-by: Ilya Leoshkevich <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/pci: add error record for CC 2 retries

Currently it is not detectable from within Linux when PCI instructions
are retried because of a busy condition. Detecting such conditions and
especially how long they lasted can however be quite useful in problem
determination. This patch enables this by adding an s390dbf error log
when a CC 2 is first encountered as well as after the retried
instruction.

Despite being unlikely it may be possible that these added debug
messages drown out important other messages so allow setting the debug
level in zpci_err_insn*() and set their level to 1 so they can be
filtered out if need be.

Reviewed-by: Matthew Rosato <[email protected]>
Reviewed-by: Pierre Morel <[email protected]>
Signed-off-by: Niklas Schnelle <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/pci: add PCI access type and length to error records

Currently when a PCI instruction returns a non-zero condition code it
can be very hard to tell from the s390dbf logs what kind of instruction
was executed. In case of PCI memory I/O (MIO) instructions it is even
impossible to tell if we attempted a load, store or block store or how
large the access was because only the address is logged.

Improve this by adding an indicator byte for the instruction type to the
error record and also store the length of the access for MIO
instructions where this can not be deduced from the request.

We use the following indicator values:
- 'l': PCI load
- 's': PCI store
- 'b': PCI store block
- 'L': PCI load (MIO)
- 'S': PCI store (MIO)
- 'B': PCI store block (MIO)
- 'M': MPCIFC
- 'R': RPCIT

Reviewed-by: Matthew Rosato <[email protected]>
Reviewed-by: Pierre Morel <[email protected]>
Signed-off-by: Niklas Schnelle <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/pci: don't log availability events as errors

Availability events are logged in s390dbf in s390dbf/pci_error/hex_ascii
even though they don't indicate an error condition.

They have also become redundant as commit 6526a597a2e85 ("s390/pci: add
simpler s390dbf traces for events") added an s390dbf/pci_msg/sprintf log
entry for availability events which contains all non reserved fields of
struct zpci_ccdf_avail. On the other hand the availability entries in
the error log make it easy to miss actual errors and may even overwrite
error entries if the message buffer wraps.

Thus simply remove the availability events from the error log thereby
establishing the rule that any content in s390dbf/pci_error indicates
some kind of error.

Reviewed-by: Matthew Rosato <[email protected]>
Reviewed-by: Pierre Morel <[email protected]>
Signed-off-by: Niklas Schnelle <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/pci: make better use of zpci_dbg() levels

While the zpci_dbg() macro offers a level parameter this is currently
largely unused. The only instance with higher importance than 3 is the
UID checking change debug message which is not actually more important
as the UID uniqueness guarantee is already exposed in sysfs so this
should rather be 3 as well.

On the other hand the "add ..." message which shows what devices are
visible at the lowest level is essential during problem determination.
By setting its level to 1, lowering the debug level can act as a filter
to only show the available functions.

On the error side the default level is set to 6 while all existing
messages are printed at level 0. This is inconsistent and means there is
no room for having messages be invisible on the default level so instead
set the default level to 3 like for errors matching the default for
debug messages.

Reviewed-by: Matthew Rosato <[email protected]>
Reviewed-by: Pierre Morel <[email protected]>
Signed-off-by: Niklas Schnelle <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/vfio-ap: remove superfluous MODULE_DEVICE_TABLE declaration

The vfio_ap module tries to register for the vfio_ap bus - but that's
the interface that it provides itself, so this does not make much sense,
thus let's simply drop this statement now.

Signed-off-by: Thomas Huth <[email protected]>
Reviewed-by: Tony Krowiak <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>

s390/vdso: add vdso randomization

Randomize the address of vdso if randomize_va_space is enabled.
Note that this keeps the vdso address on the same PMD as the stack
to avoid allocating an extra page table just for vdso.

Signed-off-by: Sven Schnelle <[email protected]>
Reviewed-by: Heiko Carstens <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/vdso: map vdso above stack

In the current code vdso is mapped below the stack. This is
problematic when programs mapped to the top of the address space
are allocating a lot of memory, because the heap will clash with
the vdso. To avoid this map the vdso above the stack and move
STACK_TOP so that it all fits into three level paging.

Signed-off-by: Sven Schnelle <[email protected]>
Reviewed-by: Heiko Carstens <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/vdso: move vdso mapping to its own function

This is a preparation patch for adding vdso randomization to s390.
It adds a function vdso_size(), which will be used later in calculating
the STACK_TOP value. It also moves the vdso mapping into a new function
vdso_map(), to keep the code similar to other architectures.

Signed-off-by: Sven Schnelle <[email protected]>
Reviewed-by: Heiko Carstens <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/mmap: increase stack/mmap gap to 128MB

This basically reverts commit 9e78a13bfb16 ("[S390] reduce miminum
gap between stack and mmap_base"). 32MB is not enough space
between stack and mmap for some programs. Given that compat
task aren't common these days, lets revert back to 128MB.

Signed-off-by: Sven Schnelle <[email protected]>
Reviewed-by: Heiko Carstens <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/zcrypt: code cleanup

This patch tries to fix as much as possible of the
checkpatch.pl --strict findings:
  CHECK: Logical continuations should be on the previous line
  CHECK: No space is necessary after a cast
  CHECK: Alignment should match open parenthesis
  CHECK: 'useable' may be misspelled - perhaps 'usable'?
  WARNING: Possible repeated word: 'is'
  CHECK: spaces preferred around that '*' (ctx:VxV)
  CHECK: Comparison to NULL could be written "!msg"
  CHECK: Prefer kzalloc(sizeof(*zc)...) over kzalloc(sizeof(struct...)...)
  CHECK: Unnecessary parentheses around resp_type->work
  CHECK: Avoid CamelCase: <xcRB>

There is no functional change comming with this patch, only
code cleanup, renaming, whitespaces, indenting, ... but no
semantic change in any way. Also the API (zcrypt and pkey
header file) is semantically unchanged.

Signed-off-by: Harald Freudenberger <[email protected]>
Reviewed-by: Jürgen Christ <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/zcrypt: cleanup CPRB struct definitions

This patch does a little cleanup on the CPRBX struct
in zcrypt.h and the redundant CPRB struct definition in
zcrypt_msgtype6.c. Especially some of the misleading
fields from the CPRBX struct have been removed.

There is no semantic change coming with this patch.
The field names changed in the XCRB struct are only related
to reserved fields which should never been used.

Signed-off-by: Harald Freudenberger <[email protected]>
Reviewed-by: Jürgen Christ <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/ap: uevent on apmask/aqpmask change

This patch introduces user space notifications for changes
on the apmask or aqmask attributes. So it could be possible
to write a udev rule to load/unload the vfio_ap kernel module
based on changes of these masks.

On chance of the apmask or aqmask an AP change event will
be produced with an uevent environment variable showing
the new APMASK or AQMASK mask.

So a change on the apmask triggers an uvevent like this:

  KERNEL[490.160396] change   /devices/ap (ap)
  ACTION=change
  DEVPATH=/devices/ap
  SUBSYSTEM=ap
  APMASK=0xffffffdfffffffffffffffffffffffffffffffffffffffffffffffffffffffff
  SEQNUM=13367

and a change on the aqmask looks like this:

  KERNEL[283.217642] change   /devices/ap (ap)
  ACTION=change
  DEVPATH=/devices/ap
  SUBSYSTEM=ap
  AQMASK=0xfbffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
  SEQNUM=13348

Only real changes to the masks are processed - the old and
new masks are compared and no action is done if the values
are equal (and thus no uevent). The emit of the uevent is
the very last action done when a mask change is processed.
However, there is no guarantee that all unbind/bind actions
caused by the apmask/aqmask changes are completed when the
apmask/aqmask change uevent is received in userspace.

Signed-off-by: Harald Freudenberger <[email protected]>
Tested-by: Thomas Huth <[email protected]>
Reviewed-by: Jürgen Christ <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/cio: simplify the calculation of variables

Fix the following coccicheck warnings:
./arch/s390/include/asm/scsw.h:695:47-49: WARNING
!A || A && B is equivalent to !A || B

I apply a readable version just to get rid of a warning.

Signed-off-by: Haowen Bai <[email protected]>
Reviewed-by: Peter Oberparleiter <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Cc: Alexander Gordeev <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Sven Schnelle <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/smp: sort out physical vs virtual CPU0 lowcore pointer

SPX instruction called from set_prefix() expects physical
address of the lowcore to be installed, but instead the
virtual address is passed.

Note: this does not fix a bug currently, since virtual and
physical addresses are identical.

Reviewed-by: Heiko Carstens <[email protected]>
Signed-off-by: Alexander Gordeev <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/zcrypt: add display of ASYM master key verification pattern

This patch extends the sysfs attribute mkvps for CCA cards
to show the states and master key verification patterns for
the old, current and new ASYM master key registers.

With this patch now all relevant master key verification
patterns related to a CCA HSM are available with the mkvps
sysfs attribute. This is a requirement for some exploiters
like the kubernetes cex plugin or initrd code needing to
verify the master key verification patterns on HSMs before
use.

A sample output:
  cat /sys/devices/ap/card04/04.0005/mkvps
  AES NEW: empty 0x0000000000000000
  AES CUR: valid 0xe9a49a58cd039bed
  AES OLD: valid 0x7d10d17bc8a409c4
  APKA NEW: empty 0x0000000000000000
  APKA CUR: valid 0x5f2f27aaa2d59b4a
  APKA OLD: valid 0x82a5e2cd5030d5ec
  ASYM NEW: empty 0x00000000000000000000000000000000
  ASYM CUR: valid 0x650c25a89c27e716d0e692b6c83f10e5
  ASYM OLD: valid 0xf8ae2acf8bfc57f0a0957c732c16078b

Signed-off-by: Harald Freudenberger <[email protected]>
Reviewed-by: Jörg Schmidbauer <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/kexec: set end-of-ipl flag in last diag308 call

If the facility IPL-complete-control is present then the last diag308
call made by kexec shall set the end-of-ipl flag in the subcode register
to signal the hypervisor that this is the last diag308 call made by Linux.
Only the diag308 calls made during a regular kexec need to set
the end-of-ipl flag, in all other cases the hypervisor will ignore it.

Signed-off-by: Alexander Egorenkov <[email protected]>
Reviewed-by: Heiko Carstens <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

s390/sclp: add detection of IPL-complete-control facility

The presence of the IPL-complete-control facility can be derived
from the hypervisor's SCLP info response.

Signed-off-by: Alexander Egorenkov <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>

Linux 5.18-rc4

Merge tag 'sched_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Borislav Petkov:

- Fix a corner case when calculating sched runqueue variables

That fix also removes a check for a zero divisor in the code, without
mentioning it.  Vincent clarified that it's ok after I whined about it:

  https://lore.kernel.org/all/CAKfTPtD2QEyZ6ADd5WrwETMOX0XOwJGnVddt7VHgfURdqgOS-Q@mail.gmail.com/

* tag 'sched_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/pelt: Fix attach_entity_load_avg() corner case

Merge tag 'powerpc-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

- Partly revert a change to our timer_interrupt() that caused lockups
   with high res timers disabled.

- Fix a bug in KVM TCE handling that could corrupt kernel memory.

- Two commits fixing Power9/Power10 perf alternative event selection.

Thanks to Alexey Kardashevskiy, Athira Rajeev, David Gibson, Frederic
Barrat, Madhavan Srinivasan, Miguel Ojeda, and Nicholas Piggin.

* tag 'powerpc-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/perf: Fix 32bit compile
  powerpc/perf: Fix power10 event alternatives
  powerpc/perf: Fix power9 event alternatives
  KVM: PPC: Fix TCE handling for VFIO
  powerpc/time: Always set decrementer in timer_interrupt()

Merge tag 'perf_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

- Add Sapphire Rapids CPU support

- Fix a perf vmalloc-ed buffer mapping error (PERF_USE_VMALLOC in use)

* tag 'perf_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/cstate: Add SAPPHIRERAPIDS_X CPU support
perf/core: Fix perf_mmap fail when CONFIG_PERF_USE_VMALLOC enabled

Merge tag 'edac_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC fix from Borislav Petkov:

- Read the reported error count from the proper register on
synopsys_edac

* tag 'edac_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/synopsys: Read the error count from the correct register

kvmalloc: use vmalloc_huge for vmalloc allocations

Since commit 559089e0a93d ("vmalloc: replace VM_NO_HUGE_VMAP with
VM_ALLOW_HUGE_VMAP"), the use of hugepage mappings for vmalloc is an
opt-in strategy, because it caused a number of problems that weren't
noticed until x86 enabled it too.

One of the issues was fixed by Nick Piggin in commit 3b8000ae185c
("mm/vmalloc: huge vmalloc backing pages should be split rather than
compound"), but I'm still worried about page protection issues, and
VM_FLUSH_RESET_PERMS in particular.

However, like the hash table allocation case (commit f2edd118d02d:
"page_alloc: use vmalloc_huge for large system hash"), the use of
kvmalloc() should be safe from any such games, since the returned
pointer might be a SLUB allocation, and as such no user should
reasonably be using it in any odd ways.

We also know that the allocations are fairly large, since it falls back
to the vmalloc case only when a kmalloc() fails. So using a hugepage
mapping seems both safe and relevant.

This patch does show a weakness in the opt-in strategy: since the opt-in
flag is in the 'vm_flags', not the usual gfp_t allocation flags, very
few of the usual interfaces actually expose it.

That's not much of an issue in this case that already used one of the
fairly specialized low-level vmalloc interfaces for the allocation, but
for a lot of other vmalloc() users that might want to opt in, it's going
to be very inconvenient.

We'll either have to fix any compatibility problems, or expose it in the
gfp flags (__GFP_COMP would have made a lot of sense) to allow normal
vmalloc() users to use hugepage mappings. That said, the cases that
really matter were probably already taken care of by the hash tabel
allocation.

Link: https://lore.kernel.org/all/[email protected]/
Link: https://lore.kernel.org/all/CAHk-=whao=iosX1s5Z4SF-ZGa-ebAukJoAdUJFk5SPwnofV+Vg@mail.gmail.com/
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Menzel <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

page_alloc: use vmalloc_huge for large system hash

Use vmalloc_huge() in alloc_large_system_hash() so that large system
hash (>= PMD_SIZE) could benefit from huge pages.

Note that vmalloc_huge only allocates huge pages for systems with
HAVE_ARCH_HUGE_VMALLOC.

Signed-off-by: Song Liu <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag '5.18-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd

Pull ksmbd server fixes from Steve French:

- cap maximum sector size reported to avoid mount problems

- reference count fix

- fix filename rename race

* tag '5.18-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd:
  ksmbd: set fixed sector size to FS_SECTOR_SIZE_INFORMATION
  ksmbd: increment reference count of parent fp
  ksmbd: remove filename in ksmbd_file

Merge tag 'arc-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc

Pull ARC fixes from Vineet Gupta:

- Assorted fixes

* tag 'arc-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
  ARC: remove redundant READ_ONCE() in cmpxchg loop
  ARC: atomic: cleanup atomic-llsc definitions
  arc: drop definitions of pgd_index() and pgd_offset{, _k}() entirely
  ARC: dts: align SPI NOR node name with dtschema
  ARC: Remove a redundant memset()
  ARC: fix typos in comments
  ARC: entry: fix syscall_trace_exit argument

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fix from James Bottomley:
"One fix for an information leak caused by copying a buffer to
userspace without checking for error first in the sr driver"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: sr: Do not leak information in ioctl

Merge tag 'for-linus-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:
"A simple cleanup patch and a refcount fix for Xen on Arm"

* tag 'for-linus-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
arm/xen: Fix some refcount leaks
xen: Convert kmap() to kmap_local_page()

Merge tag 'drm-fixes-2022-04-23' of git://anongit.freedesktop.org/drm/drm

Pull more drm fixes from Dave Airlie:
"Maarten was away, so Maxine stepped up and sent me the drm-fixes
  merge, so no point leaving it for another week.

  The big change is an OF revert around bridge/panels, it may have some
  driver fallout, but hopefully this revert gets them shook out in the
  next week easier.

  Otherwise it's a bunch of locking/refcounts across drivers, a radeon
  dma_resv logic fix and some raspberry pi panel fixes.

  panel:
   - revert of patch that broke panel/bridge issues

  dma-buf:
   - remove unused header file.

  amdgpu:
   - partial revert of locking change

  radeon:
   - fix dma_resv logic inversion

  panel:
   - pi touchscreen panel init fixes

  vc4:
   - build fix
   - runtime pm refcount fix

  vmwgfx:
   - refcounting fix"

* tag 'drm-fixes-2022-04-23' of git://anongit.freedesktop.org/drm/drm:
  drm/amdgpu: partial revert "remove ctx->lock" v2
  Revert "drm: of: Lookup if child node has panel or bridge"
  Revert "drm: of: Properly try all possible cases for bridge/panel detection"
  drm/vc4: Use pm_runtime_resume_and_get to fix pm_runtime_get_sync() usage
  drm/vmwgfx: Fix gem refcounting and memory evictions
  drm/vc4: Fix build error when CONFIG_DRM_VC4=y && CONFIG_RASPBERRYPI_FIRMWARE=m
  drm/panel/raspberrypi-touchscreen: Initialise the bridge in prepare
  drm/panel/raspberrypi-touchscreen: Avoid NULL deref if not initialised
  dma-buf-map: remove renamed header file
  drm/radeon: fix logic inversion in radeon_sync_resv

Merge tag 'input-for-v5.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input

Pull input fixes from Dmitry Torokhov:

- a new set of keycodes to be used by marine navigation systems

- minor fixes to omap4-keypad and cypress-sf drivers

* tag 'input-for-v5.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: add Marine Navigation Keycodes
  Input: omap4-keypad - fix pm_runtime_get_sync() error checking
  Input: cypress-sf - register a callback to disable the regulators

Merge tag 'block-5.18-2022-04-22' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"Just two small regression fixes for bcache"

* tag 'block-5.18-2022-04-22' of git://git.kernel.dk/linux-block:
bcache: fix wrong bdev parameter when calling bio_alloc_clone() in do_bio_hook()
bcache: put bch_bio_map() back to correct location in journal_write_unlocked()

Merge tag 'io_uring-5.18-2022-04-22' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
"Just two small fixes - one fixing a potential leak for the iovec for
  larger requests added in this cycle, and one fixing a theoretical leak
  with CQE_SKIP and IOPOLL"

* tag 'io_uring-5.18-2022-04-22' of git://git.kernel.dk/linux-block:
  io_uring: fix leaks on IOPOLL and CQE_SKIP
  io_uring: free iovec if file assignment fails

Merge tag 'perf-tools-fixes-for-v5.18-2022-04-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

Pull perf tools fixes from Arnaldo Carvalho de Melo:

- Fix header include for LLVM >= 14 when building with libclang.

- Allow access to 'data_src' for auxtrace in 'perf script' with ARM SPE
   perf.data files, fixing processing data with such attributes.

- Fix error message for test case 71 ("Convert perf time to TSC") on
   s390, where it is not supported.

* tag 'perf-tools-fixes-for-v5.18-2022-04-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
  perf test: Fix error message for test case 71 on s390, where it is not supported
  perf report: Set PERF_SAMPLE_DATA_SRC bit for Arm SPE event
  perf script: Always allow field 'data_src' for auxtrace
  perf clang: Fix header include for LLVM >= 14

sparc: cacheflush_32.h needs struct page

Add a struct page forward declaration to cacheflush_32.h.
Fixes this build warning:

    CC      drivers/crypto/xilinx/zynqmp-sha.o
  In file included from arch/sparc/include/asm/cacheflush.h:11,
                   from include/linux/cacheflush.h:5,
                   from drivers/crypto/xilinx/zynqmp-sha.c:6:
  arch/sparc/include/asm/cacheflush_32.h:38:37: warning: 'struct page' declared inside parameter list will not be visible outside of this definition or declaration
     38 | void sparc_flush_page_to_ram(struct page *page);

Exposed by commit 0e03b8fd2936 ("crypto: xilinx - Turn SHA into a
tristate and allow COMPILE_TEST") but not Fixes: that commit because the
underlying problem is older.

Signed-off-by: Randy Dunlap <[email protected]>
Reported-by: kernel test robot <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: [email protected]
Acked-by: Sam Ravnborg <[email protected]>
Acked-by: Herbert Xu <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'drm-misc-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

Two fixes for the raspberrypi panel initialisation, one fix for a logic
inversion in radeon, a build and pm refcounting fix for vc4, two reverts
for drm_of_get_bridge that caused a number of regression and a locking
regression for amdgpu.

Signed-off-by: Dave Airlie <[email protected]>
From: Maxime Ripard <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/20220422084403.2xrhf3jusdej5yo4@houat

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
"Fix some syzbot-detected bugs, as well as other bugs found by I/O
  injection testing.

  Change ext4's fallocate to consistently drop set[ug]id bits when an
  fallocate operation might possibly change the user-visible contents of
  a file.

  Also, improve handling of potentially invalid values in the the
  s_overhead_cluster superblock field to avoid ext4 returning a negative
  number of free blocks"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: fix a potential race while discarding reserved buffers after an abort
  ext4: update the cached overhead value in the superblock
  ext4: force overhead calculation if the s_overhead_cluster makes no sense
  ext4: fix overhead calculation to account for the reserved gdt blocks
  ext4, doc: fix incorrect h_reserved size
  ext4: limit length to bitmap_maxbytes - blocksize in punch_hole
  ext4: fix use-after-free in ext4_search_dir
  ext4: fix bug_on in start_this_handle during umount filesystem
  ext4: fix symlink file size not match to file content
  ext4: fix fallocate to use file_modified to update permissions consistently

Merge tag 'ata-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata

Pull ATA fix from Damien Le Moal:
"A single fix to avoid a NULL pointer dereference in the pata_marvell
driver with adapters not supporting DMA, from Zheyu"

* tag 'ata-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
ata: pata_marvell: Check the 'bmdma_addr' beforing reading

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"The main and larger change here is a workaround for AMD's lack of
  cache coherency for encrypted-memory guests.

  I have another patch pending, but it's waiting for review from the
  architecture maintainers.

  RISC-V:

   - Remove 's' & 'u' as valid ISA extension

   - Do not allow disabling the base extensions 'i'/'m'/'a'/'c'

  x86:

   - Fix NMI watchdog in guests on AMD

   - Fix for SEV cache incoherency issues

   - Don't re-acquire SRCU lock in complete_emulated_io()

   - Avoid NULL pointer deref if VM creation fails

   - Fix race conditions between APICv disabling and vCPU creation

   - Bugfixes for disabling of APICv

   - Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resume

  selftests:

   - Do not use bitfields larger than 32-bits, they differ between GCC
     and clang"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  kvm: selftests: introduce and use more page size-related constants
  kvm: selftests: do not use bitfields larger than 32-bits for PTEs
  KVM: SEV: add cache flush to solve SEV cache incoherency issues
  KVM: SVM: Flush when freeing encrypted pages even on SME_COHERENT CPUs
  KVM: SVM: Simplify and harden helper to flush SEV guest page(s)
  KVM: selftests: Silence compiler warning in the kvm_page_table_test
  KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdog
  x86/kvm: Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resume
  KVM: SPDX style and spelling fixes
  KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled
  KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race
  KVM: nVMX: Defer APICv updates while L2 is active until L1 is active
  KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled
  KVM: Initialize debugfs_dentry when a VM is created to avoid NULL deref
  KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abused
  KVM: RISC-V: Use kvm_vcpu.srcu_idx, drop RISC-V's unnecessary copy
  KVM: x86: Don't re-acquire SRCU lock in complete_emulated_io()
  RISC-V: KVM: Restrict the extensions that can be disabled
  RISC-V: KVM: Remove 's' & 'u' as valid ISA extension

perf test: Fix error message for test case 71 on s390, where it is not supported

Test case 71 'Convert perf time to TSC' is not supported on s390.

Subtest 71.1 is skipped with the correct message, but subtest 71.2 is
not skipped and fails.

The root cause is function evlist__open() called from
test__perf_time_to_tsc().  evlist__open() returns -ENOENT because the
event cycles:u is not supported by the selected PMU, for example
platform s390 on z/VM or an x86_64 virtual machine.

The PMU driver returns -ENOENT in this case. This error is leads to the
failure.

Fix this by returning TEST_SKIP on -ENOENT.

Output before:
71: Convert perf time to TSC:
71.1: TSC support:             Skip (This architecture does not support)
71.2: Perf time to TSC:        FAILED!

Output after:
71: Convert perf time to TSC:
71.1: TSC support:             Skip (This architecture does not support)
71.2: Perf time to TSC:        Skip (perf_read_tsc_conversion is not supported)

This also happens on an x86_64 virtual machine:
   # uname -m
   x86_64
   $ ./perf test -F 71
    71: Convert perf time to TSC  :
    71.1: TSC support             : Ok
    71.2: Perf time to TSC        : FAILED!
   $

Committer testing:

Continues to work on x86_64:

  $ perf test 71
   71: Convert perf time to TSC    :
   71.1: TSC support               : Ok
   71.2: Perf time to TSC          : Ok
  $

Fixes: 290fa68bdc458863 ("perf test tsc: Fix error message when not supported")
Signed-off-by: Thomas Richter <[email protected]>
Acked-by: Sumanth Korikkar <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Chengdong Li <[email protected]>
Cc: [email protected]
Cc: Heiko Carstens <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

perf report: Set PERF_SAMPLE_DATA_SRC bit for Arm SPE event

Since commit bb30acae4c4dacfa ("perf report: Bail out --mem-mode if mem
info is not available") "perf mem report" and "perf report --mem-mode"
don't report result if the PERF_SAMPLE_DATA_SRC bit is missed in sample
type.

The commit ffab487052054162 ("perf: arm-spe: Fix perf report
--mem-mode") partially fixes the issue. It adds PERF_SAMPLE_DATA_SRC
bit for Arm SPE event, this allows the perf data file generated by
kernel v5.18-rc1 or later version can be reported properly.

On the other hand, perf tool still fails to be backward compatibility
for a data file recorded by an older version's perf which contains Arm
SPE trace data. This patch is a workaround in reporting phase, when
detects ARM SPE PMU event and without PERF_SAMPLE_DATA_SRC bit, it will
force to set the bit in the sample type and give a warning info.

Fixes: bb30acae4c4dacfa ("perf report: Bail out --mem-mode if mem info is not available")
Reviewed-by: James Clark <[email protected]>
Signed-off-by: Leo Yan <[email protected]>
Tested-by: German Gomez <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

perf script: Always allow field 'data_src' for auxtrace

If use command 'perf script -F,+data_src' to dump memory samples with
Arm SPE trace data, it reports error:

# perf script -F,+data_src
Samples for 'dummy:u' event do not have DATA_SRC attribute set. Cannot print 'data_src' field.

This is because the 'dummy:u' event is absent DATA_SRC bit in its sample
type, so if a file contains AUX area tracing data then always allow
field 'data_src' to be selected as an option for perf script.

Fixes: e55ed3423c1bb29f ("perf arm-spe: Synthesize memory event")
Signed-off-by: Leo Yan <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: German Gomez <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

perf clang: Fix header include for LLVM >= 14

The header TargetRegistry.h has moved in LLVM/clang 14.

Committer notes:

The problem as noticed when building in ubuntu:22.04:

    90    98.61 ubuntu:22.04                  : FAIL gcc version 11.2.0 (Ubuntu 11.2.0-19ubuntu1)
      util/c++/clang.cpp:23:10: fatal error: llvm/Support/TargetRegistry.h: No such file or directory
         23 | #include "llvm/Support/TargetRegistry.h"
            |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      compilation terminated.

Fixed after applying this patch.

Reported-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Guilherme Amadio <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Link: https://twitter.com/GuilhermeAmadio/status/1514970524232921088
Link: http://lore.kernel.org/lkml/Ylp0M/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

gpio: Request interrupts after IRQ is initialized

Commit 5467801f1fcb ("gpio: Restrict usage of GPIO chip irq members
before initialization") attempted to fix a race condition that lead to a
NULL pointer, but in the process caused a regression for _AEI/_EVT
declared GPIOs.

This manifests in messages showing deferred probing while trying to
allocate IRQs like so:

  amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x0000 to IRQ, err -517
  amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x002C to IRQ, err -517
  amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x003D to IRQ, err -517
  [ .. more of the same .. ]

The code for walking _AEI doesn't handle deferred probing and so this
leads to non-functional GPIO interrupts.

Fix this issue by moving the call to `acpi_gpiochip_request_interrupts`
to occur after gc->irc.initialized is set.

Fixes: 5467801f1fcb ("gpio: Restrict usage of GPIO chip irq members before initialization")
Link: https://lore.kernel.org/linux-gpio/BL1PR12MB51577A77F000A008AA694675E2EF9@BL1PR12MB5157.namprd12.prod.outlook.com/
Link: https://bugzilla.suse.com/show_bug.cgi?id=1198697
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215850
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1979
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1976
Reported-by: Mario Limonciello <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
Reviewed-by: Shreeya Patel <[email protected]>
Tested-By: Samuel Čavoj <[email protected]>
Tested-By: [email protected] Link:
Reviewed-by: Andy Shevchenko <[email protected]>
Acked-by: Linus Walleij <[email protected]>
Reviewed-and-tested-by: Takashi Iwai <[email protected]>
Cc: Shreeya Patel <[email protected]>
Cc: [email protected]
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'riscv-for-linus-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V fixes Palmer Dabbelt:

- A pair of build fixes for the recent cpuidle driver

- A fix for systems without sv57 that manifests as a crash
   early in boot

* tag 'riscv-for-linus-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  RISC-V: cpuidle: fix Kconfig select for RISCV_SBI_CPUIDLE
  RISC-V: mm: Fix set_satp_mode() for platform not having Sv57
  cpuidle: riscv: support non-SMP config

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:
"There's no real pattern to the fixes, but the main one fixes our
  pmd_leaf() definition to resolve a NULL dereference on the migration
  path.

   - Fix PMU event validation in the absence of any event counters

   - Fix allmodconfig build using clang in conjunction with binutils

   - Fix definitions of pXd_leaf() to handle PROT_NONE entries

   - More typo fixes"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: mm: fix p?d_leaf()
  arm64: fix typos in comments
  arm64: Improve HAVE_DYNAMIC_FTRACE_WITH_REGS selection for clang
  arm_pmu: Validate single/group leader events

arm/xen: Fix some refcount leaks

The of_find_compatible_node() function returns a node pointer with
refcount incremented, We should use of_node_put() on it when done
Add the missing of_node_put() to release the refcount.

Fixes: 9b08aaa3199a ("ARM: XEN: Move xen_early_init() before efi_init()")
Fixes: b2371587fe0c ("arm/xen: Read extended regions from DT and init Xen resource")
Signed-off-by: Miaoqian Lin <[email protected]>
Reviewed-by: Stefano Stabellini <[email protected]>
Signed-off-by: Stefano Stabellini <[email protected]>

Merge tag 'xarray-5.18a' of git://git.infradead.org/users/willy/xarray

Pull xarray fixes from Matthew Wilcox:
"Syzbot found a nasty race between large page splitting and page
  lookup. Details in the commit log, but fortunately it has a reliable
  reproducer. I thought it better to send this one to you straight away.

  Also fix the test suite build for kmem_cache_alloc_lru()"

* tag 'xarray-5.18a' of git://git.infradead.org/users/willy/xarray:
  XArray: Disallow sibling entries of nodes
  tools: Add kmem_cache_alloc_lru()

Merge tag '5.18-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
"Four fixes, two of them for stable:

   - fcollapse fix

   - reconnect lock fix

   - DFS oops fix

   - minor cleanup patch"

* tag '5.18-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: destage any unwritten data to the server before calling copychunk_write
  cifs: use correct lock type in cifs_reconnect()
  cifs: fix NULL ptr dereference in refresh_mounts()
  cifs: Use kzalloc instead of kmalloc/memset

Merge tag 'fs.fixes.v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull mount_setattr fix from Christian Brauner:
"The recent cleanup in e257039f0fc7 ("mount_setattr(): clean the
  control flow and calling conventions") switched the mount attribute
  codepaths from do-while to for loops as they are more idiomatic when
  walking mounts.

  However, we did originally choose do-while constructs because if we
  request a mount or mount tree to be made read-only we need to hold
  writers in the following way: The mount attribute code will grab
  lock_mount_hash() and then call mnt_hold_writers() which will
  _unconditionally_ set MNT_WRITE_HOLD on the mount.

  Any callers that need write access have to call mnt_want_write(). They
  will immediately see that MNT_WRITE_HOLD is set on the mount and the
  caller will then either spin (on non-preempt-rt) or wait on
  lock_mount_hash() (on preempt-rt).

  The fact that MNT_WRITE_HOLD is set unconditionally means that once
  mnt_hold_writers() returns we need to _always_ pair it with
  mnt_unhold_writers() in both the failure and success paths.

  The do-while constructs did take care of this. But Al's change to a
  for loop in the failure path stops on the first mount we failed to
  change mount attributes _without_ going into the loop to call
  mnt_unhold_writers().

  This in turn means that once we failed to make a mount read-only via
  mount_setattr() - i.e. there are already writers on that mount - we
  will block any writers indefinitely. Fix this by ensuring that the for
  loop always unsets MNT_WRITE_HOLD including the first mount we failed
  to change to read-only. Also sprinkle a few comments into the cleanup
  code to remind people about what is happening including myself. After
  all, I didn't catch it during review.

  This is only relevant on mainline and was reported by syzbot. Details
  about the syzbot reports are all in the commit message"

* tag 'fs.fixes.v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fs: unset MNT_WRITE_HOLD on failure

Merge tag 'sound-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"At this time, the majority of changes are for pending ASoC fixes while
  a few usual HD-audio and USB-audio quirks are found.

  Almost all patches are small device-specific fixes, and nothing
  worrisome stands out, so far"

* tag 'sound-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (37 commits)
  ALSA: hda/realtek: Add quirk for Clevo NP70PNP
  ALSA: hda: intel-dsp-config: Add RaptorLake PCI IDs
  ALSA: hda/realtek: Enable mute/micmute LEDs and limit mic boost on EliteBook 845/865 G9
  ALSA: usb-audio: Clear MIDI port active flag after draining
  ALSA: usb-audio: add mapping for MSI MAG X570S Torpedo MAX.
  ALSA: hda/i915: Fix one too many pci_dev_put()
  ALSA: hda/hdmi: add HDMI codec VID for Raptorlake-P
  ALSA: hda/hdmi: fix warning about PCM count when used with SOF
  sound/oss/dmasound: fix 'dmasound_setup' defined but not used
  firmware: cs_dsp: Fix overrun of unterminated control name string
  ASoC: codecs: Fix an error handling path in (rx|tx|va)_macro_probe()
  ASoC: Intel: sof_es8336: Add a quirk for Huawei Matebook D15
  ASoC: Intel: sof_es8336: add a quirk for headset at mic1 port
  ASoC: Intel: sof_es8336: support a separate gpio to control headphone
  ASoC: Intel: sof_es8336: simplify speaker gpio naming
  ASoC: wm8731: Disable the regulator when probing fails
  ASoC: Intel: soc-acpi: correct device endpoints for max98373
  ASoC: codecs: wcd934x: do not switch off SIDO Buck when codec is in use
  ASoC: SOF: topology: Fix memory leak in sof_control_load()
  ASoC: SOF: topology: cleanup dailinks on widget unload
  ...

XArray: Disallow sibling entries of nodes

There is a race between xas_split() and xas_load() which can result in
the wrong page being returned, and thus data corruption.  Fortunately,
it's hard to hit (syzbot took three months to find it) and often guarded
with VM_BUG_ON().

The anatomy of this race is:

thread A thread B
order-9 page is stored at index 0x200
lookup of page at index 0x274
page split starts
load of sibling entry at offset 9
stores nodes at offsets 8-15
load of entry at offset 8

The entry at offset 8 turns out to be a node, and so we descend into it,
and load the page at index 0x234 instead of 0x274.  This is hard to fix
on the split side; we could replace the entire node that contains the
order-9 page instead of replacing the eight entries.  Fixing it on
the lookup side is easier; just disallow sibling entries that point
to nodes.  This cannot ever be a useful thing as the descent would not
know the correct offset to use within the new node.

The test suite continues to pass, but I have not added a new test for
this bug.

Reported-by: [email protected]
Tested-by: [email protected]
Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

tools: Add kmem_cache_alloc_lru()

Turn kmem_cache_alloc() into a wrapper around kmem_cache_alloc_lru().

Fixes: 9bbdc0f32409 ("xarray: use kmem_cache_alloc_lru to allocate xa_node")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reported-by: Liam R. Howlett <[email protected]>
Reported-by: Li Wang <[email protected]>

Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
"13 patches.

  Subsystems affected by this patch series: mm (memory-failure, memcg,
  userfaultfd, hugetlbfs, mremap, oom-kill, kasan, hmm), and kcov"

* emailed patches from Andrew Morton <[email protected]>:
  mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()
  kcov: don't generate a warning on vm_insert_page()'s failure
  MAINTAINERS: add Vincenzo Frascino to KASAN reviewers
  oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
  selftest/vm: add skip support to mremap_test
  selftest/vm: support xfail in mremap_test
  selftest/vm: verify remap destination address in mremap_test
  selftest/vm: verify mmap addr in mremap_test
  mm, hugetlb: allow for "high" userspace addresses
  userfaultfd: mark uffd_wp regardless of VM_WRITE flag
  memcg: sync flush only if periodic flush is delayed
  mm/memory-failure.c: skip huge_zero_page in memory_failure()
  mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()

mm/vmalloc: huge vmalloc backing pages should be split rather than compound

Huge vmalloc higher-order backing pages were allocated with __GFP_COMP
in order to allow the sub-pages to be refcounted by callers such as
"remap_vmalloc_page [sic]" (remap_vmalloc_range).

However a similar problem exists for other struct page fields callers
use, for example fb_deferred_io_fault() takes a vmalloc'ed page and
not only refcounts it but uses ->lru, ->mapping, ->index.

This is not compatible with compound sub-pages, and can cause bad page
state issues like

  BUG: Bad page state in process swapper/0  pfn:00743
  page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x743
  flags: 0x7ffff000000000(node=0|zone=0|lastcpupid=0x7ffff)
  raw: 007ffff000000000 c00c00000001d0c8 c00c00000001d0c8 0000000000000000
  raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: corrupted mapping in tail page
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc3-00082-gfc6fff4a7ce1-dirty #2810
  Call Trace:
    dump_stack_lvl+0x74/0xa8 (unreliable)
    bad_page+0x12c/0x170
    free_tail_pages_check+0xe8/0x190
    free_pcp_prepare+0x31c/0x4e0
    free_unref_page+0x40/0x1b0
    __vunmap+0x1d8/0x420
    ...

The correct approach is to use split high-order pages for the huge
vmalloc backing. These allow callers to treat them in exactly the same
way as individually-allocated order-0 pages.

Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Nicholas Piggin <[email protected]>
Cc: Paul Menzel <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm64: mm: fix p?d_leaf()

The pmd_leaf() is used to test a leaf mapped PMD, however, it misses
the PROT_NONE mapped PMD on arm64. Fix it. A real world issue [1]
caused by this was reported by Qian Cai. Also fix pud_leaf().

Link: https://patchwork.kernel.org/comment/24798260/
Fixes: 8aa82df3c123 ("arm64: mm: add p?d_leaf() definitions")
Reported-by: Qian Cai <[email protected]>
Signed-off-by: Muchun Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>

Merge tag 'drm-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
"Extra quiet after Easter, only have minor i915 and msm pulls. However
  I haven't seen a PR from our misc tree in a little while, I've cc'ed
  all the suspects. Once that unblocks I expect a bit larger bunch of
  patches to arrive.

  Otherwise as I said, one msm revert and two i915 fixes.

  msm:

   - revert iommu change that broke some platforms.

  i915:

   - Unset enable_psr2_sel_fetch if PSR2 detection fails

   - Fix to detect when VRR is turned off from panel settings"

* tag 'drm-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drm:
  drm/i915/display/psr: Unset enable_psr2_sel_fetch if other checks in intel_psr2_config_valid() fails
  drm/msm: Revert "drm/msm: Stop using iommu_present()"
  drm/i915/display/vrr: Reset VRR capable property on a long hpd

mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()

In some cases it is possible for mmu_interval_notifier_remove() to race
with mn_tree_inv_end() allowing it to return while the notifier data
structure is still in use.  Consider the following sequence:

  CPU0 - mn_tree_inv_end()            CPU1 - mmu_interval_notifier_remove()
  ----------------------------------- ------------------------------------
                                      spin_lock(subscriptions->lock);
                                      seq = subscriptions->invalidate_seq;
  spin_lock(subscriptions->lock);     spin_unlock(subscriptions->lock);
  subscriptions->invalidate_seq++;
                                      wait_event(invalidate_seq != seq);
                                      return;
  interval_tree_remove(interval_sub); kfree(interval_sub);
  spin_unlock(subscriptions->lock);
  wake_up_all();

As the wait_event() condition is true it will return immediately.  This
can lead to use-after-free type errors if the caller frees the data
structure containing the interval notifier subscription while it is
still on a deferred list.  Fix this by taking the appropriate lock when
reading invalidate_seq to ensure proper synchronisation.

I observed this whilst running stress testing during some development.
You do have to be pretty unlucky, but it leads to the usual problems of
use-after-free (memory corruption, kernel crash, difficult to diagnose
WARN_ON, etc).

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 99cb252f5e68 ("mm/mmu_notifier: add an interval tree notifier")
Signed-off-by: Alistair Popple <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Cc: Christian König <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kcov: don't generate a warning on vm_insert_page()'s failure

vm_insert_page()'s failure is not an unexpected condition, so don't do
WARN_ONCE() in such a case.

Instead, print a kernel message and just return an error code.

This flaw has been reported under an OOM condition by sysbot [1].

The message is mainly for the benefit of the test log, in this case the
fuzzer's log so that humans inspecting the log can figure out what was
going on. KCOV is a testing tool, so I think being a little more chatty
when KCOV unexpectedly is about to fail will save someone debugging
time.

We don't want the WARN, because it's not a kernel bug that syzbot should
report, and failure can happen if the fuzzer tries hard enough (as
above).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: b3d7fe86fbd0 ("kcov: properly handle subsequent mmap calls"),
Signed-off-by: Aleksandr Nogikh <[email protected]>
Acked-by: Marco Elver <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Taras Madan <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

MAINTAINERS: add Vincenzo Frascino to KASAN reviewers

Add my email address to KASAN reviewers list to make sure that I am
Cc'ed in all the KASAN changes that may affect arm64 MTE.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vincenzo Frascino <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup

The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which
can be targeted by the oom reaper.  This mapping is used to store the
futex robust list head; the kernel does not keep a copy of the robust
list and instead references a userspace address to maintain the
robustness during a process death.

A race can occur between exit_mm and the oom reaper that allows the oom
reaper to free the memory of the futex robust list before the exit path
has handled the futex death:

    CPU1                               CPU2
    --------------------------------------------------------------------
    page_fault
    do_exit "signal"
    wake_oom_reaper
                                        oom_reaper
                                        oom_reap_task_mm (invalidates mm)
    exit_mm
    exit_mm_release
    futex_exit_release
    futex_cleanup
    exit_robust_list
    get_user (EFAULT- can't access memory)

If the get_user EFAULT's, the kernel will be unable to recover the
waiters on the robust_list, leaving userspace mutexes hung indefinitely.

Delay the OOM reaper, allowing more time for the exit path to perform
the futex cleanup.

Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer

Based on a patch by Michal Hocko.

Link: https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Signed-off-by: Joel Savitz <[email protected]>
Signed-off-by: Nico Pache <[email protected]>
Co-developed-by: Joel Savitz <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Rafael Aquini <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Herton R. Krzesinski <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joel Savitz <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

selftest/vm: add skip support to mremap_test

Allow the mremap test to be skipped due to errors such as failing to
parse the mmap_min_addr sysctl.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

selftest/vm: support xfail in mremap_test

Use ksft_test_result_xfail for the tests which are expected to fail.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

selftest/vm: verify remap destination address in mremap_test

Because mremap does not have a MAP_FIXED_NOREPLACE flag, it can destroy
existing mappings. This causes a segfault when regions such as text are
remapped and the permissions are changed.

Verify the requested mremap destination address does not overlap any
existing mappings by using mmap's MAP_FIXED_NOREPLACE flag. Keep
incrementing the destination address until a valid mapping is found or
fail the current test once the max address is reached.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

selftest/vm: verify mmap addr in mremap_test

Avoid calling mmap with requested addresses that are less than the
system's mmap_min_addr. When run as root, mmap returns EACCES when
trying to map addresses < mmap_min_addr. This is not one of the error
codes for the condition to retry the mmap in the test.

Rather than arbitrarily retrying on EACCES, don't attempt an mmap until
addr > vm.mmap_min_addr.

Add a munmap call after an alignment check as the mappings are retained
after the retry and can reach the vm.max_map_count sysctl.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, hugetlb: allow for "high" userspace addresses

This is a fix for commit f6795053dac8 ("mm: mmap: Allow for "high"
userspace addresses") for hugetlb.

This patch adds support for "high" userspace addresses that are
optionally supported on the system and have to be requested via a hint
mechanism ("high" addr parameter to mmap).

Architectures such as powerpc and x86 achieve this by making changes to
their architectural versions of hugetlb_get_unmapped_area() function.
However, arm64 uses the generic version of that function.

So take into account arch_get_mmap_base() and arch_get_mmap_end() in
hugetlb_get_unmapped_area().  To allow that, move those two macros out
of mm/mmap.c into include/linux/sched/mm.h

If these macros are not defined in architectural code then they default
to (TASK_SIZE) and (base) so should not introduce any behavioural
changes to architectures that do not define them.

For the time being, only ARM64 is affected by this change.

Catalin (ARM64) said
"We should have fixed hugetlb_get_unmapped_area() as well when we added
  support for 52-bit VA. The reason for commit f6795053dac8 was to
  prevent normal mmap() from returning addresses above 48-bit by default
  as some user-space had hard assumptions about this.

  It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
  but I doubt anyone would notice. It's more likely that the current
  behaviour would cause issues, so I'd rather have them consistent.

  Basically when arm64 gained support for 52-bit addresses we did not
  want user-space calling mmap() to suddenly get such high addresses,
  otherwise we could have inadvertently broken some programs (similar
  behaviour to x86 here). Hence we added commit f6795053dac8. But we
  missed hugetlbfs which could still get such high mmap() addresses. So
  in theory that's a potential regression that should have bee addressed
  at the same time as commit f6795053dac8 (and before arm64 enabled
  52-bit addresses)"

Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.1650033747.git.christophe.leroy@csgroup.eu
Fixes: f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses")
Signed-off-by: Christophe Leroy <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Steve Capper <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: <[email protected]> [5.0.x]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>