Git Repo - linux.git/log

Merge tag 'generic-ticket-spinlocks-v6' into for-next

asm-generic: New generic ticket-based spinlock

This contains a new ticket-based spinlock that uses only generic
atomics and doesn't require as much from the memory system as qspinlock
does in order to be fair.  It also includes a bit of documentation about
the qspinlock and qrwlock fairness requirements.

This will soon be used by a handful of architectures that don't meet the
qspinlock requirements.

* tag 'generic-ticket-spinlocks-v6':
  csky: Move to generic ticket-spinlock
  RISC-V: Move to queued RW locks
  RISC-V: Move to generic spinlocks
  openrisc: Move to ticket-spinlock
  asm-generic: qrwlock: Document the spinlock fairness requirements
  asm-generic: qspinlock: Indicate the use of mixed-size atomics
  asm-generic: ticket-lock: New generic ticket-based spinlock

riscv: kexec: add kexec_file_load() support

This patch set implements kexec_file_load() for RISC-V, which is
currently only allowed on rv64 due to some minor build issues on 32-bit
platforms in the generic code.  This allows users to kexec() using an FD
as opposed to a buffer.

Link: https://lore.kernel.org/all/[email protected]/
* palmer/riscv-kexec_file:
  RISC-V: Load purgatory in kexec_file
  RISC-V: Add purgatory
  RISC-V: Support for kexec_file on panic
  RISC-V: Add kexec_file support
  RISC-V: use memcpy for kexec_file mode
  kexec_file: Fix kexec_file.c build error for riscv platform

RISC-V: Load purgatory in kexec_file

This patch supports kexec_file to load and relocate purgatory.
It works well on riscv64 QEMU, being tested with devmem.

Signed-off-by: Li Zhengyu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: Add purgatory

This patch adds purgatory, the name and concept have been taken
from kexec-tools. Purgatory runs between two kernels, and do
verify sha256 hash to ensure the kernel to jump to is fine and
has not been corrupted after loading. Makefile is modified based
on x86 platform.

Signed-off-by: Li Zhengyu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: Support for kexec_file on panic

This patch adds support for loading a kexec on panic (kdump) kernel.
It has been tested with vmcore-dmesg on riscv64 QEMU on both an smp
and a non-smp system.

Signed-off-by: Li Zhengyu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: Add kexec_file support

This patch adds support for kexec_file on RISC-V. I tested it on riscv64
QEMU with busybear-linux and single core along with the OpenSBI firmware
fw_jump.bin for generic platform.

On SMP system, it depends on CONFIG_{HOTPLUG_CPU, RISCV_SBI} to
resume/stop hart through OpenSBI firmware, it also needs a OpenSBI that
support the HSM extension.

Signed-off-by: Liao Chang <[email protected]>
Signed-off-by: Li Zhengyu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[Palmer: Make 64-bit only]
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: use memcpy for kexec_file mode

The pointer to buffer loading kernel binaries is in kernel space for
kexec_fil mode, When copy_from_user copies data from pointer to a block
of memory, it checkes that the pointer is in the user space range, on
RISCV-V that is:

static inline bool __access_ok(unsigned long addr, unsigned long size)
{
return size <= TASK_SIZE && addr <= TASK_SIZE - size;
}

and TASK_SIZE is 0x4000000000 for 64-bits, which now causes
copy_from_user to reject the access of the field 'buf' of struct
kexec_segment that is in range [CONFIG_PAGE_OFFSET - VMALLOC_SIZE,
CONFIG_PAGE_OFFSET), is invalid user space pointer.

This patch fixes this issue by skipping access_ok(), use mempcy() instead.

Signed-off-by: Liao Chang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

kexec_file: Fix kexec_file.c build error for riscv platform

When CONFIG_KEXEC_FILE is set for riscv platform, the compilation of
kernel/kexec_file.c generate build error:

kernel/kexec_file.c: In function 'crash_prepare_elf64_headers':
./arch/riscv/include/asm/page.h:110:71: error: request for member 'virt_addr' in something not a structure or union
  110 |  ((x) >= PAGE_OFFSET && (!IS_ENABLED(CONFIG_64BIT) || (x) < kernel_map.virt_addr))
      |                                                                       ^
./arch/riscv/include/asm/page.h:131:2: note: in expansion of macro 'is_linear_mapping'
  131 |  is_linear_mapping(_x) ?       \
      |  ^~~~~~~~~~~~~~~~~
./arch/riscv/include/asm/page.h:140:31: note: in expansion of macro '__va_to_pa_nodebug'
  140 | #define __phys_addr_symbol(x) __va_to_pa_nodebug(x)
      |                               ^~~~~~~~~~~~~~~~~~
./arch/riscv/include/asm/page.h:143:24: note: in expansion of macro '__phys_addr_symbol'
  143 | #define __pa_symbol(x) __phys_addr_symbol(RELOC_HIDE((unsigned long)(x), 0))
      |                        ^~~~~~~~~~~~~~~~~~
kernel/kexec_file.c:1327:36: note: in expansion of macro '__pa_symbol'
1327 |   phdr->p_offset = phdr->p_paddr = __pa_symbol(_text);

This occurs is because the "kernel_map" referenced in macro
is_linear_mapping()  is suppose to be the one of struct kernel_mapping
defined in arch/riscv/mm/init.c, but the 2nd argument of
crash_prepare_elf64_header() has same symbol name, in expansion of macro
is_linear_mapping in function crash_prepare_elf64_header(), "kernel_map"
actually is the local variable.

Signed-off-by: Liao Chang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: Add support for rv32 userspace via COMPAT

The RISC-V port supports the rv32i and rv64i base ISAs, but provides no
mechanism to run 32-bit userspace on 64-bit systems.  This adds that
support, via the COMPAT framework.  As the RISC-V ISAs (and uABIs) were
developed concurrently, the resulting compat support is mostly generic.

This includes a handful of cleanups to the generic compat infrastructure
to more cleanly support RISC-V, followed by the RISC-V implementation.

* palmer/riscv-compat:
  riscv: compat: Add COMPAT Kbuild skeletal support
  riscv: compat: ptrace: Add compat_arch_ptrace implement
  riscv: compat: signal: Add rt_frame implementation
  riscv: compat: vdso: Add setup additional pages implementation
  riscv: compat: vdso: Add COMPAT_VDSO base code implementation
  riscv: compat: Add hw capability check for elf
  riscv: compat: Add elf.h implementation
  riscv: compat: process: Add UXL_32 support in start_thread
  riscv: compat: syscall: Add entry.S implementation
  riscv: compat: syscall: Add compat_sys_call_table implementation
  riscv: compat: Support TASK_SIZE for compat mode
  riscv: compat: Add basic compat data type implementation
  riscv: Fixup difference with defconfig
  syscalls: compat: Fix the missing part for __SYSCALL_COMPAT
  asm-generic: compat: Cleanup duplicate definitions
  fs: stat: compat: Add __ARCH_WANT_COMPAT_STAT
  arch: Add SYSVIPC_COMPAT for all architectures
  compat: consolidate the compat_flock{,64} definition
  uapi: always define F_GETLK64/F_SETLK64/F_SETLKW64 in fcntl.h
  uapi: simplify __ARCH_FLOCK{,64}_PAD a little

riscv: compat: Add COMPAT Kbuild skeletal support

Adds initial skeletal COMPAT Kbuild (Running 32bit U-mode on
64bit S-mode) support.
- Setup kconfig & dummy functions for compiling.
- Implement compat_start_thread by the way.

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: compat: ptrace: Add compat_arch_ptrace implement

Now, you can use native gdb on riscv64 for rv32 app debugging.

$ uname -a
Linux buildroot 5.16.0-rc4-00036-gbef6b82fdf23-dirty #53 SMP Mon Dec 20 23:06:53 CST 2021 riscv64 GNU/Linux
$ cat /proc/cpuinfo
processor       : 0
hart            : 0
isa             : rv64imafdcsuh
mmu             : sv48

$ file /bin/busybox
/bin/busybox: setuid ELF 32-bit LSB shared object, UCB RISC-V, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-riscv32-ilp32d.so.1, for GNU/Linux 5.15.0, stripped
$ file /usr/bin/gdb
/usr/bin/gdb: ELF 32-bit LSB shared object, UCB RISC-V, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux-riscv32-ilp32d.so.1, for GNU/Linux 5.15.0, stripped
$ /usr/bin/gdb /bin/busybox
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
...
Reading symbols from /bin/busybox...
(No debugging symbols found in /bin/busybox)
(gdb) b main
Breakpoint 1 at 0x8ddc
(gdb) r
Starting program: /bin/busybox
Failed to read a valid object file image from memory.

Breakpoint 1, 0x555a8ddc in main ()
(gdb) i r
ra             0x77df0b74       0x77df0b74
sp             0x7fdd3d10       0x7fdd3d10
gp             0x5567e800       0x5567e800 <bb_common_bufsiz1+160>
tp             0x77f64280       0x77f64280
t0             0x0      0
t1             0x555a6fac       1431990188
t2             0x77dd8db4       2011008436
fp             0x7fdd3e34       0x7fdd3e34
s1             0x7fdd3e34       2145205812
a0             0xffffffff       -1
a1             0x2000   8192
a2             0x7fdd3e3c       2145205820
a3             0x0      0
a4             0x7fdd3d30       2145205552
a5             0x555a8dc0       1431997888
a6             0x77f2c170       2012397936
a7             0x6a7c7a2f       1786542639
s2             0x0      0
s3             0x0      0
s4             0x555a8dc0       1431997888
s5             0x77f8a3a8       2012783528
s6             0x7fdd3e3c       2145205820
s7             0x5567cecc       1432866508
--Type <RET> for more, q to quit, c to continue without paging--
s8             0x1      1
s9             0x0      0
s10            0x55634448       1432568904
s11            0x0      0
t3             0x77df0bb8       2011106232
t4             0x42fc   17148
t5             0x0      0
t6             0x40     64
pc             0x555a8ddc       0x555a8ddc <main+28>
(gdb) si
0x555a78f0 in mallopt@plt ()
(gdb) c
Continuing.
BusyBox v1.34.1 (2021-12-19 22:39:48 CST) multi-call binary.
BusyBox is copyrighted by many authors between 1998-2015.
Licensed under GPLv2. See source distribution for detailed
copyright notices.

Usage: busybox [function [arguments]...]
   or: busybox --list[-full]
...
[Inferior 1 (process 107) exited normally]
(gdb) q

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: compat: signal: Add rt_frame implementation

Implement compat_setup_rt_frame for sigcontext save & restore. The
main process is the same with signal, but the rv32 pt_regs' size
is different from rv64's, so we needs convert them.

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: support for Svpbmt and D1 memory types

Adds support for Svpbmt, the "Supervisor-mode: page-based memory types"
extension, which allows pages to be marked as non-cacheable and/or I/O.
This also includes support for the Allwinner D1's page table attributes
via the alternatives framework, which differ from Svpbmt in various ways
but are necessary to make the D1 function.

* palmer/riscv-d1:
  riscv: add memory-type errata for T-Head
  riscv: don't use global static vars to store alternative data
  riscv: remove FIXMAP_PAGE_IO and fall back to its default value
  riscv: add RISC-V Svpbmt extension support
  riscv: Fix accessing pfn bits in PTEs for non-32bit variants
  riscv: move boot alternatives to after fill_hwcap
  riscv: prevent compressed instructions in alternatives
  riscv: extend concatenated alternatives-lines to the same length
  riscv: implement ALTERNATIVE_2 macro
  riscv: implement module alternatives
  riscv: allow different stages with alternatives
  riscv: integrate alternatives better into the main architecture

riscv: add memory-type errata for T-Head

Some current cpus based on T-Head cores implement memory-types
way different than described in the svpbmt spec even going
so far as using PTE bits marked as reserved.

Add the T-Head vendor-id and necessary errata code to
replace the affected instructions.

Signed-off-by: Heiko Stuebner <[email protected]>
Tested-by: Samuel Holland <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: don't use global static vars to store alternative data

Right now the code uses a global struct to store vendor-ids
and another global variable to store the vendor-patch-function.

There exist specific cases where we'll need to patch the kernel
at an even earlier stage, where trying to write to a static
variable might actually result in hangs.

Also collecting the vendor-information consists of 3 sbi-ecalls
(or csr-reads) which is pretty negligible in the context of
booting a kernel.

So rework the code to not rely on static variables and instead
collect the vendor-information when a round of alternatives is
to be applied.

Signed-off-by: Heiko Stuebner <[email protected]>
Reviewed-by: Guo Ren <[email protected]>
Reviewed-by: Philipp Tomsich <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: remove FIXMAP_PAGE_IO and fall back to its default value

If not defined in the arch, FIXMAP_PAGE_IO defaults to PAGE_KERNEL_IO,
which we defined when adding the svpbmt implementation.

So drop the FIXMAP_PAGE_IO riscv define.

Signed-off-by: Heiko Stuebner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Philipp Tomsich <[email protected]>
Reviewed-by: Guo Ren <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: add RISC-V Svpbmt extension support

Svpbmt (the S should be capitalized) is the
"Supervisor-mode: page-based memory types" extension
that specifies attributes for cacheability, idempotency
and ordering.

The relevant settings are done in special bits in PTEs:

Here is the svpbmt PTE format:
| 63 | 62-61 | 60-8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
  N     MT     RSW    D   A   G   U   X   W   R   V
        ^

Of the Reserved bits [63:54] in a leaf PTE, the high bit is already
allocated (as the N bit), so bits [62:61] are used as the MT (aka
MemType) field. This field specifies one of three memory types that
are close equivalents (or equivalent in effect) to the three main x86
and ARMv8 memory types - as shown in the following table.

RISC-V
Encoding &
MemType     RISC-V Description
----------  ------------------------------------------------
00 - PMA    Normal Cacheable, No change to implied PMA memory type
01 - NC     Non-cacheable, idempotent, weakly-ordered Main Memory
10 - IO     Non-cacheable, non-idempotent, strongly-ordered I/O memory
11 - Rsvd   Reserved for future standard use

As the extension will not be present on all implementations,
implement a method to handle cpufeatures via alternatives
to not incur runtime penalties on cpu variants not supporting
specific extensions and patch relevant code parts at runtime.

Co-developed-by: Wei Fu <[email protected]>
Signed-off-by: Wei Fu <[email protected]>
Co-developed-by: Liu Shaohua <[email protected]>
Signed-off-by: Liu Shaohua <[email protected]>
Co-developed-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
[moved to use the alternatives mechanism]
Signed-off-by: Heiko Stuebner <[email protected]>
Reviewed-by: Philipp Tomsich <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Fix accessing pfn bits in PTEs for non-32bit variants

On rv32 the PFN part of PTEs is defined to use bits [xlen-1:10]
while on rv64 it is defined to use bits [53:10], leaving [63:54]
as reserved.

With upcoming optional extensions like svpbmt these previously
reserved bits will get used so simply right-shifting the PTE
to get the PFN won't be enough.

So introduce a _PAGE_PFN_MASK constant to mask the correct bits
for both rv32 and rv64 before shifting.

Signed-off-by: Heiko Stuebner <[email protected]>
Reviewed-by: Philipp Tomsich <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: move boot alternatives to after fill_hwcap

Move the application of boot alternatives to after the hw-capabilities
are populated. This allows to check for available extensions when
determining which alternatives to apply and also makes it actually
work if CONFIG_SMP is disabled for whatever reason.

Signed-off-by: Heiko Stuebner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Philipp Tomsich <[email protected]>
Reviewed-by: Guo Ren <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: prevent compressed instructions in alternatives

Instructions are opportunistically compressed by the RISC-V assembler
when possible, but in alternatives-blocks both the old and new content
need to be the same size, so having the toolchain do somewhat random
optimizations will cause strange side-effects like
"attempt to move .org backwards" compile-time errors.

Already a simple "and" used in alternatives assembly will cause these
mismatched code sizes.

So prevent compressed instructions to be generated in alternatives-
code and use option-push and -pop to only limit this to the relevant
code blocks

Signed-off-by: Heiko Stuebner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Philipp Tomsich <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: extend concatenated alternatives-lines to the same length

ALT_NEW_CONTENT already uses same-length assembler lines, so
extend this to the other elements as well.

This makes it more readable when these elements need to be extended
in the future.

Signed-off-by: Heiko Stuebner <[email protected]>
Reviewed-by: Philipp Tomsich <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: implement ALTERNATIVE_2 macro

When the alternatives were added the commit already provided a template
on how to implement 2 different alternatives for one piece of code.

Make this usable.

Signed-off-by: Heiko Stuebner <[email protected]>
Reviewed-by: Philipp Tomsich <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: implement module alternatives

This allows alternatives to also be applied when loading modules
and follows the implementation of other architectures (e.g. arm64).

Signed-off-by: Heiko Stuebner <[email protected]>
Reviewed-by: Philipp Tomsich <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: allow different stages with alternatives

Future features may need to be applied at a different
time during boot, so allow defining stages for alternatives
and handling them differently depending on the stage.

Also make the alternatives-location more flexible so that
future stages may provide their own location.

Signed-off-by: Heiko Stuebner <[email protected]>
Reviewed-by: Philipp Tomsich <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: integrate alternatives better into the main architecture

Right now the alternatives need to be explicitly enabled and
erratas are limited to SiFive ones.

We want to use alternatives not only for patching soc erratas,
but in the future also for handling different behaviour depending
on the existence of future extensions.

So move the core alternatives over to the kernel subdirectory
and move the CONFIG_RISCV_ALTERNATIVE to be a hidden symbol
which we expect relevant erratas and extensions to just select
if needed.

Signed-off-by: Heiko Stuebner <[email protected]>
Reviewed-by: Philipp Tomsich <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

csky: Move to generic ticket-spinlock

There is no benefit from custom implementation for ticket-spinlock,
so move to generic ticket-spinlock for easy maintenance.

Signed-off-by: Guo Ren <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: Move to queued RW locks

Now that we have fair spinlocks we can use the generic queued rwlocks,
so we might as well do so.

Reviewed-by: Arnd Bergmann <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: Move to generic spinlocks

Our existing spinlocks aren't fair and replacing them has been on the
TODO list for a long time. This moves to the recently-introduced ticket
spinlocks, which are simple enough that they are likely to be correct
and fast on the vast majority of extant implementations.

This introduces a horrible hack that allows us to split out the spinlock
conversion from the rwlock conversion. We have to do the spinlocks
first because qrwlock needs fair spinlocks, but we don't want to pollute
the asm-generic code to support the generic spinlocks without qrwlocks.
Thus we pollute the RISC-V code, but just until the next commit as it's
all going away.

Reviewed-by: Arnd Bergmann <[email protected]>
Reviewed-by: Guo Ren <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Tested-by: Conor Dooley <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

openrisc: Move to ticket-spinlock

We have no indications that openrisc meets the qspinlock requirements,
so move to ticket-spinlock as that is more likey to be correct.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Stafford Horne <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

asm-generic: qrwlock: Document the spinlock fairness requirements

I could only find the fairness requirements documented as the C code,
this calls them out in a comment just to be a bit more explicit.

Reviewed-by: Arnd Bergmann <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

asm-generic: qspinlock: Indicate the use of mixed-size atomics

The qspinlock implementation depends on having well behaved mixed-size
atomics. This is true on the more widely-used platforms, but these
requirements are somewhat subtle and may not be satisfied by all the
platforms that qspinlock is used on.

Document these requirements, so ports that use qspinlock can more easily
determine if they meet these requirements.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Waiman Long <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

asm-generic: ticket-lock: New generic ticket-based spinlock

This is a simple, fair spinlock. Specifically it doesn't have all the
subtle memory model dependencies that qspinlock has, which makes it more
suitable for simple systems as it is more likely to be correct. It is
implemented entirely in terms of standard atomics and thus works fine
without any arch-specific code.

This replaces the existing asm-generic/spinlock.h, which just errored
out on SMP systems.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Reviewed-by: Guo Ren <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: compat: vdso: Add setup additional pages implementation

Reconstruct __setup_additional_pages() by appending vdso info
pointer argument to meet compat_vdso_info requirement. And change
vm_special_mapping *dm, *cm initialization into static.

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: compat: vdso: Add COMPAT_VDSO base code implementation

There is no vgettimeofday supported in rv32 that makes simple to
generate rv32 vdso code which only needs riscv64 compiler. Other
architectures need change compiler or -m (machine parameter) to
support vdso32 compiling. If rv32 support vgettimeofday (which
cause C compile) in future, we would add CROSS_COMPILE to support
that makes more requirement on compiler enviornment.

linux-rv64/arch/riscv/kernel/compat_vdso/compat_vdso.so.dbg:
file format elf64-littleriscv

Disassembly of section .text:

0000000000000800 <__vdso_rt_sigreturn>:
800:   08b00893                li      a7,139
804:   00000073                ecall
808:   0000                    unimp
        ...

000000000000080c <__vdso_getcpu>:
80c:   0a800893                li      a7,168
810:   00000073                ecall
814:   8082                    ret
        ...

0000000000000818 <__vdso_flush_icache>:
818:   10300893                li      a7,259
81c:   00000073                ecall
820:   8082                    ret

linux-rv32/arch/riscv/kernel/vdso/vdso.so.dbg:
file format elf32-littleriscv

Disassembly of section .text:

00000800 <__vdso_rt_sigreturn>:
800:   08b00893                li      a7,139
804:   00000073                ecall
808:   0000                    unimp
        ...

0000080c <__vdso_getcpu>:
80c:   0a800893                li      a7,168
810:   00000073                ecall
814:   8082                    ret
        ...

00000818 <__vdso_flush_icache>:
818:   10300893                li      a7,259
81c:   00000073                ecall
820:   8082                    ret

Finally, reuse all *.S from vdso in compat_vdso that makes
implementation clear and readable.

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: compat: Add hw capability check for elf

Detect hardware COMPAT (32bit U-mode) capability in rv64. If not
support COMPAT mode in hw, compat_elf_check_arch would return
false by compat_binfmt_elf.c

Add CLASS to enhance (compat_)elf_check_arch to distinguish
32BIT/64BIT elf.

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: compat: Add elf.h implementation

Implement necessary type and macro for compat elf. See the code
comment for detail.

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: compat: process: Add UXL_32 support in start_thread

If the current task is in COMPAT mode, set SR_UXL_32 in status for
returning userspace. We need CONFIG _COMPAT to prevent compiling
errors with rv32 defconfig.

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: compat: syscall: Add entry.S implementation

Implement the entry of compat_sys_call_table[] in asm. Ref to
riscv-privileged spec 4.1.1 Supervisor Status Register (sstatus):

BIT[32:33] = UXL[1:0]:
- 1:32
- 2:64
- 3:128

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: compat: syscall: Add compat_sys_call_table implementation

Implement compat sys_call_table and some system call functions:
truncate64, ftruncate64, fallocate, pread64, pwrite64,
sync_file_range, readahead, fadvise64_64 which need argument
translation.

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: compat: Support TASK_SIZE for compat mode

Make TASK_SIZE from const to dynamic detect TIF_32BIT flag
function. Refer to arm64 to implement DEFAULT_MAP_WINDOW_64 for
efi-stub.

Limit 32-bit compatible process in 0-2GB virtual address range
(which is enough for real scenarios), because it could avoid
address sign extend problem when 32-bit enter 64-bit and ease
software design.

The standard 32-bit TASK_SIZE is 0x9dc00000:FIXADDR_START, and
compared to a compatible 32-bit, it increases 476MB for the
application's virtual address.

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: compat: Add basic compat data type implementation

Implement riscv asm/compat.h for struct compat_xxx,
is_compat_task, compat_user_regset, regset convert.

The rv64 compat.h has inherited most of the structs
from the generic one.

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Fixup difference with defconfig

Let's follow the origin patch's spirit:

The only difference between rv32_defconfig and defconfig is that
rv32_defconfig has CONFIG_ARCH_RV32I=y.

This is helpful to compare rv64-compat-rv32 v.s. rv32-linux.

Fixes: 1b937e8faa87ccfb ("RISC-V: Add separate defconfig for 32bit systems")
Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

syscalls: compat: Fix the missing part for __SYSCALL_COMPAT

Make "uapi asm unistd.h" could be used for architectures' COMPAT
mode. The __SYSCALL_COMPAT is first used in riscv.

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

asm-generic: compat: Cleanup duplicate definitions

There are 7 64bit architectures that support Linux COMPAT mode to
run 32bit applications. A lot of definitions are duplicate:
- COMPAT_USER_HZ
- COMPAT_RLIM_INFINITY
- COMPAT_OFF_T_MAX
- __compat_uid_t, __compat_uid_t
- compat_dev_t
- compat_ipc_pid_t
- struct compat_flock
- struct compat_flock64
- struct compat_statfs
- struct compat_ipc64_perm, compat_semid64_ds,
compat_msqid64_ds, compat_shmid64_ds

Cleanup duplicate definitions and merge them into asm-generic.

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Acked-by: Helge Deller <[email protected]> # parisc
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

fs: stat: compat: Add __ARCH_WANT_COMPAT_STAT

RISC-V doesn't neeed compat_stat, so using __ARCH_WANT_COMPAT_STAT
to exclude unnecessary SYSCALL functions.

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Acked-by: Helge Deller <[email protected]> # parisc
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

arch: Add SYSVIPC_COMPAT for all architectures

The existing per-arch definitions are pretty much historic cruft.
Move SYSVIPC_COMPAT into init/Kconfig.

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Acked-by: Helge Deller <[email protected]> # parisc
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

compat: consolidate the compat_flock{,64} definition

Provide a single common definition for the compat_flock and
compat_flock64 structures using the same tricks as for the native
variants. Another extra define is added for the packing required on
x86.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Acked-by: Helge Deller <[email protected]> # parisc
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

uapi: always define F_GETLK64/F_SETLK64/F_SETLKW64 in fcntl.h

The F_GETLK64/F_SETLK64/F_SETLKW64 fcntl opcodes are only implemented
for the 32-bit syscall APIs, but are also needed for compat handling
on 64-bit kernels.

Consolidate them in unistd.h instead of definining the internal compat
definitions in compat.h, which is rather error prone (e.g. parisc
gets the values wrong currently).

Note that before this change they were never visible to userspace due
to the fact that CONFIG_64BIT is only set for kernel builds.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

uapi: simplify __ARCH_FLOCK{,64}_PAD a little

Don't bother to define the symbols empty, just don't use them.
That makes the intent a little more clear.

Remove the unused HAVE_ARCH_STRUCT_FLOCK64 define and merge the
32-bit mips struct flock into the generic one.

Add a new __ARCH_FLOCK_EXTRA_SYSID macro following the style of
__ARCH_FLOCK_PAD to avoid having a separate definition just for
one architecture.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Tested-by: Heiko Stuebner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: dts: rename the node name of dma

Rename the node name by the generic DMA naming

Signed-off-by: Zong Li <[email protected]>
Reviewed-by: Bin Meng <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: dts: Add dma-channels property and modify compatible

Add dma-channels property, then we can determine how many channels there
by device tree, in addition, we add the pdma versioning scheme for
compatible.

Signed-off-by: Zong Li <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
Acked-by: Palmer Dabbelt <[email protected]>
Acked-by: Conor Dooley <[email protected]>
Reviewed-by: Bin Meng <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: mm: Remove the copy operation of pmd

Since all processes share the kernel address space,
we only need to copy pgd in case of a vmalloc page
fault exception, the other levels of page tables are
shared, so the operation of copying pmd is unnecessary.

Signed-off-by: Chuanhua Han <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

Linux 5.18-rc1

Merge tag 'trace-v5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull more tracing updates from Steven Rostedt:

- Rename the staging files to give them some meaning. Just
   stage1,stag2,etc, does not show what they are for

- Check for NULL from allocation in bootconfig

- Hold event mutex for dyn_event call in user events

- Mark user events to broken (to work on the API)

- Remove eBPF updates from user events

- Remove user events from uapi header to keep it from being installed.

- Move ftrace_graph_is_dead() into inline as it is called from hot
   paths and also convert it into a static branch.

* tag 'trace-v5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Move user_events.h temporarily out of include/uapi
  ftrace: Make ftrace_graph_is_dead() a static branch
  tracing: Set user_events to BROKEN
  tracing/user_events: Remove eBPF interfaces
  tracing/user_events: Hold event_mutex during dyn_event_add
  proc: bootconfig: Add null pointer check
  tracing: Rename the staging files for trace_events

Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk fix from Stephen Boyd:
"A single revert to fix a boot regression seen when clk_put() started
  dropping rate range requests. It's best to keep various systems
  booting so we'll kick this out and try again next time"

* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  Revert "clk: Drop the rate range on clk_put()"

Merge tag 'x86-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
"A set of x86 fixes and updates:

   - Make the prctl() for enabling dynamic XSTATE components correct so
     it adds the newly requested feature to the permission bitmap
     instead of overwriting it. Add a selftest which validates that.

   - Unroll string MMIO for encrypted SEV guests as the hypervisor
     cannot emulate it.

   - Handle supervisor states correctly in the FPU/XSTATE code so it
     takes the feature set of the fpstate buffer into account. The
     feature sets can differ between host and guest buffers. Guest
     buffers do not contain supervisor states. So far this was not an
     issue, but with enabling PASID it needs to be handled in the buffer
     offset calculation and in the permission bitmaps.

   - Avoid a gazillion of repeated CPUID invocations in by caching the
     values early in the FPU/XSTATE code.

   - Enable CONFIG_WERROR in x86 defconfig.

   - Make the X86 defconfigs more useful by adapting them to Y2022
     reality"

* tag 'x86-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu/xstate: Consolidate size calculations
  x86/fpu/xstate: Handle supervisor states in XSTATE permissions
  x86/fpu/xsave: Handle compacted offsets correctly with supervisor states
  x86/fpu: Cache xfeature flags from CPUID
  x86/fpu/xsave: Initialize offset/size cache early
  x86/fpu: Remove unused supervisor only offsets
  x86/fpu: Remove redundant XCOMP_BV initialization
  x86/sev: Unroll string mmio with CC_ATTR_GUEST_UNROLL_STRING_IO
  x86/config: Make the x86 defconfigs a bit more usable
  x86/defconfig: Enable WERROR
  selftests/x86/amx: Update the ARCH_REQ_XCOMP_PERM test
  x86/fpu/xstate: Fix the ARCH_REQ_XCOMP_PERM implementation

Merge tag 'core-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RT signal fix from Thomas Gleixner:
"Revert the RT related signal changes. They need to be reworked and
generalized"

* tag 'core-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "signal, x86: Delay calling signals in atomic on RT enabled kernels"

Merge tag 'dma-mapping-5.18-1' of git://git.infradead.org/users/hch/dma-mapping

Pull more dma-mapping updates from Christoph Hellwig:

- fix a regression in dma remap handling vs AMD memory encryption (me)

- finally kill off the legacy PCI DMA API (Christophe JAILLET)

* tag 'dma-mapping-5.18-1' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: move pgprot_decrypted out of dma_pgprot
  PCI/doc: cleanup references to the legacy PCI DMA API
  PCI: Remove the deprecated "pci-dma-compat.h" API

Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm

Pull ARM fixes from Russell King:

- avoid unnecessary rebuilds for library objects

- fix return value of __setup handlers

- fix invalid input check for "crashkernel=" kernel option

- silence KASAN warnings in unwind_frame

* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
  ARM: 9191/1: arm/stacktrace, kasan: Silence KASAN warnings in unwind_frame()
  ARM: 9190/1: kdump: add invalid input check for 'crashkernel=0'
  ARM: 9187/1: JIVE: fix return value of __setup handler
  ARM: 9189/1: decompressor: fix unneeded rebuilds of library objects

Revert "clk: Drop the rate range on clk_put()"

This reverts commit 7dabfa2bc4803eed83d6f22bd6f045495f40636b. There are
multiple reports that this breaks boot on various systems. The common
theme is that orphan clks are having rates set on them when that isn't
expected. Let's revert it out for now so that -rc1 boots.

Reported-by: Marek Szyprowski <[email protected]>
Reported-by: Tony Lindgren <[email protected]>
Reported-by: Alexander Stein <[email protected]>
Reported-by: Naresh Kamboju <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Cc: Maxime Ripard <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

Merge tag 'perf-tools-for-v5.18-2022-04-02' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

Pull more perf tools updates from Arnaldo Carvalho de Melo:

- Avoid SEGV if core.cpus isn't set in 'perf stat'.

- Stop depending on .git files for building PERF-VERSION-FILE, used in
   'perf --version', fixing some perf tools build scenarios.

- Convert tracepoint.py example to python3.

- Update UAPI header copies from the kernel sources: socket,
   mman-common, msr-index, KVM, i915 and cpufeatures.

- Update copy of libbpf's hashmap.c.

- Directly return instead of using local ret variable in
   evlist__create_syswide_maps(), found by coccinelle.

* tag 'perf-tools-for-v5.18-2022-04-02' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
  perf python: Convert tracepoint.py example to python3
  perf evlist: Directly return instead of using local ret variable
  perf cpumap: More cpu map reuse by merge.
  perf cpumap: Add is_subset function
  perf evlist: Rename cpus to user_requested_cpus
  perf tools: Stop depending on .git files for building PERF-VERSION-FILE
  tools headers cpufeatures: Sync with the kernel sources
  tools headers UAPI: Sync drm/i915_drm.h with the kernel sources
  tools headers UAPI: Sync linux/kvm.h with the kernel sources
  tools kvm headers arm64: Update KVM headers from the kernel sources
  tools arch x86: Sync the msr-index.h copy with the kernel sources
  tools headers UAPI: Sync asm-generic/mman-common.h with the kernel
  perf beauty: Update copy of linux/socket.h with the kernel sources
  perf tools: Update copy of libbpf's hashmap.c
  perf stat: Avoid SEGV if core.cpus isn't set

Merge tag 'kbuild-fixes-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

- Fix empty $(PYTHON) expansion.

- Fix UML, which got broken by the attempt to suppress Clang warnings.

- Fix warning message in modpost.

* tag 'kbuild-fixes-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  modpost: restore the warning message for missing symbol versions
  Revert "um: clang: Strip out -mno-global-merge from USER_CFLAGS"
  kbuild: Remove '-mno-global-merge'
  kbuild: fix empty ${PYTHON} in scripts/link-vmlinux.sh
  kconfig: remove stale comment about removed kconfig_print_symbol()

Merge tag 'mips_5.18_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux

Pull MIPS fixes from Thomas Bogendoerfer:

- build fix for gpio

- fix crc32 build problems

- check for failed memory allocations

* tag 'mips_5.18_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  MIPS: crypto: Fix CRC32 code
  MIPS: rb532: move GPIOD definition into C-files
  MIPS: lantiq: check the return value of kzalloc()
  mips: sgi-ip22: add a check for the return of kzalloc()

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:

- Only do MSR filtering for MSRs accessed by rdmsr/wrmsr

- Documentation improvements

- Prevent module exit until all VMs are freed

- PMU Virtualization fixes

- Fix for kvm_irq_delivery_to_apic_fast() NULL-pointer dereferences

- Other miscellaneous bugfixes

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (42 commits)
  KVM: x86: fix sending PV IPI
  KVM: x86/mmu: do compare-and-exchange of gPTE via the user address
  KVM: x86: Remove redundant vm_entry_controls_clearbit() call
  KVM: x86: cleanup enter_rmode()
  KVM: x86: SVM: fix tsc scaling when the host doesn't support it
  kvm: x86: SVM: remove unused defines
  KVM: x86: SVM: move tsc ratio definitions to svm.h
  KVM: x86: SVM: fix avic spec based definitions again
  KVM: MIPS: remove reference to trap&emulate virtualization
  KVM: x86: document limitations of MSR filtering
  KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr
  KVM: x86/emulator: Emulate RDPID only if it is enabled in guest
  KVM: x86/pmu: Fix and isolate TSX-specific performance event logic
  KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set
  KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs
  KVM: x86: Trace all APICv inhibit changes and capture overall status
  KVM: x86: Add wrappers for setting/clearing APICv inhibits
  KVM: x86: Make APICv inhibit reasons an enum and cleanup naming
  KVM: X86: Handle implicit supervisor access with SMAP
  KVM: X86: Rename variable smap to not_smap in permission_fault()
  ...

modpost: restore the warning message for missing symbol versions

This log message was accidentally chopped off.

I was wondering why this happened, but checking the ML log, Mark
precisely followed my suggestion [1].

I just used "..." because I was too lazy to type the sentence fully.
Sorry for the confusion.

[1]: https://lore.kernel.org/all/CAK7LNAR6bXXk9-ZzZYpTqzFqdYbQsZHmiWspu27rtsFxvfRuVA@mail.gmail.com/

Fixes: 4a6795933a89 ("kbuild: modpost: Explicitly warn about unprototyped symbols")
Signed-off-by: Masahiro Yamada <[email protected]>
Acked-by: Mark Brown <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>

Merge tag 'for-5.18/drivers-2022-04-02' of git://git.kernel.dk/linux-block

Pull block driver fix from Jens Axboe:
"Got two reports on nbd spewing warnings on load now, which is a
  regression from a commit that went into your tree yesterday.

  Revert the problematic change for now"

* tag 'for-5.18/drivers-2022-04-02' of git://git.kernel.dk/linux-block:
  Revert "nbd: fix possible overflow on 'first_minor' in nbd_dev_add()"

Merge tag 'pci-v5.18-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull pci fix from Bjorn Helgaas:

- Fix Hyper-V "defined but not used" build issue added during merge
window (YueHaibing)

* tag 'pci-v5.18-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: hv: Remove unused hv_set_msi_entry_from_desc()

Merge tag 'tag-chrome-platform-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux

Pull chrome platform updates from Benson Leung:
"cros_ec_typec:

   - Check for EC device - Fix a crash when using the cros_ec_typec
     driver on older hardware not capable of typec commands

   - Make try power role optional

   - Mux configuration reorganization series from Prashant

  cros_ec_debugfs:

   - Fix use after free. Thanks Tzung-bi

  sensorhub:

   - cros_ec_sensorhub fixup - Split trace include file

  misc:

   - Add new mailing list for chrome-platform development:

[email protected]

     Now with patchwork!"

* tag 'tag-chrome-platform-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
  platform/chrome: cros_ec_debugfs: detach log reader wq from devm
  platform: chrome: Split trace include file
  platform/chrome: cros_ec_typec: Update mux flags during partner removal
  platform/chrome: cros_ec_typec: Configure muxes at start of port update
  platform/chrome: cros_ec_typec: Get mux state inside configure_mux
  platform/chrome: cros_ec_typec: Move mux flag checks
  platform/chrome: cros_ec_typec: Check for EC device
  platform/chrome: cros_ec_typec: Make try power role optional
  MAINTAINERS: platform-chrome: Add new [email protected] list

Revert "nbd: fix possible overflow on 'first_minor' in nbd_dev_add()"

This reverts commit 6d35d04a9e18990040e87d2bbf72689252669d54.

Both Gabriel and Borislav report that this commit casues a regression
with nbd:

sysfs: cannot create duplicate filename '/dev/block/43:0'

Revert it before 5.18-rc1 and we'll investigage this separately in
due time.

Link: https://lore.kernel.org/all/[email protected]/
Reported-by: Gabriel L. Somlo <[email protected]>
Reported-by: Borislav Petkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

watch_queue: Free the page array when watch_queue is dismantled

Commit 7ea1a0124b6d ("watch_queue: Free the alloc bitmap when the
watch_queue is torn down") took care of the bitmap, but not the page
array.

  BUG: memory leak
  unreferenced object 0xffff88810d9bc140 (size 32):
  comm "syz-executor335", pid 3603, jiffies 4294946994 (age 12.840s)
  hex dump (first 32 bytes):
    40 a7 40 04 00 ea ff ff 00 00 00 00 00 00 00 00  @.@.............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
     kmalloc_array include/linux/slab.h:621 [inline]
     kcalloc include/linux/slab.h:652 [inline]
     watch_queue_set_size+0x12f/0x2e0 kernel/watch_queue.c:251
     pipe_ioctl+0x82/0x140 fs/pipe.c:632
     vfs_ioctl fs/ioctl.c:51 [inline]
     __do_sys_ioctl fs/ioctl.c:874 [inline]
     __se_sys_ioctl fs/ioctl.c:860 [inline]
     __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:860
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]

Reported-by: [email protected]
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David Howells <[email protected]>
Cc: Jann Horn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]/
Signed-off-by: Linus Torvalds <[email protected]>

tracing: mark user_events as BROKEN

After being merged, user_events become more visible to a wider audience
that have concerns with the current API.

It is too late to fix this for this release, but instead of a full
revert, just mark it as BROKEN (which prevents it from being selected in
make config). Then we can work finding a better API. If that fails,
then it will need to be completely reverted.

To not have the code silently bitrot, still allow building it with
COMPILE_TEST.

And to prevent the uapi header from being installed, then later changed,
and then have an old distro user space see the old version, move the
header file out of the uapi directory.

Surround the include with CONFIG_COMPILE_TEST to the current location,
but when the BROKEN tag is taken off, it will use the uapi directory,
and fail to compile. This is a good way to remind us to move the header
back.

Link: https://lore.kernel.org/all/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

tracing: Move user_events.h temporarily out of include/uapi

While user_events API is under development and has been marked for broken
to not let the API become fixed, move the header file out of the uapi
directory. This is to prevent it from being installed, then later changed,
and then have an old distro user space update with a new kernel, where
applications see the user_events being available, but the old header is in
place, and then they get compiled incorrectly.

Also, surround the include with CONFIG_COMPILE_TEST to the current
location, but when the BROKEN tag is taken off, it will use the uapi
directory, and fail to compile. This is a good way to remind us to move
the header back.

Link: https://lore.kernel.org/all/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

ftrace: Make ftrace_graph_is_dead() a static branch

ftrace_graph_is_dead() is used on hot paths, it just reads a variable
in memory and is not worth suffering function call constraints.

For instance, at entry of prepare_ftrace_return(), inlining it avoids
saving prepare_ftrace_return() parameters to stack and restoring them
after calling ftrace_graph_is_dead().

While at it using a static branch is even more performant and is
rather well adapted considering that the returned value will almost
never change.

Inline ftrace_graph_is_dead() and replace 'kill_ftrace_graph' bool
by a static branch.

The performance improvement is noticeable.

Link: https://lkml.kernel.org/r/e0411a6a0ed3eafff0ad2bc9cd4b0e202b4617df.1648623570.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

tracing: Set user_events to BROKEN

After being merged, user_events become more visible to a wider audience
that have concerns with the current API. It is too late to fix this for
this release, but instead of a full revert, just mark it as BROKEN (which
prevents it from being selected in make config). Then we can work finding
a better API. If that fails, then it will need to be completely reverted.

Link: https://lore.kernel.org/all/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Steven Rostedt (Google) <[email protected]>

tracing/user_events: Remove eBPF interfaces

Remove eBPF interfaces within user_events to ensure they are fully
reviewed.

Link: https://lore.kernel.org/all/20220329165718.GA10381@kbox/
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Beau Belgrave <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

tracing/user_events: Hold event_mutex during dyn_event_add

Make sure the event_mutex is properly held during dyn_event_add call.
This is required when adding dynamic events.

Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Beau Belgrave <[email protected]>
Acked-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

proc: bootconfig: Add null pointer check

kzalloc is a memory allocation function which can return NULL when some
internal memory errors happen. It is safer to add null pointer check.

Link: https://lkml.kernel.org/r/[email protected]
Cc: [email protected]
Fixes: c1a3c36017d4 ("proc: bootconfig: Add /proc/bootconfig to show boot config list")
Acked-by: Masami Hiramatsu <[email protected]>
Reported-by: Zeal Robot <[email protected]>
Signed-off-by: Lv Ruyi <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

tracing: Rename the staging files for trace_events

When looking for implementation of different phases of the creation of the
TRACE_EVENT() macro, it is pretty useless when all helper macro
redefinitions are in files labeled "stageX_defines.h". Rename them to
state which phase the files are for. For instance, when looking for the
defines that are used to create the event fields, seeing
"stage4_event_fields.h" gives the developer a good idea that the defines
are in that file.

Signed-off-by: Steven Rostedt (Google) <[email protected]>

KVM: x86: fix sending PV IPI

If apic_id is less than min, and (max - apic_id) is greater than
KVM_IPI_CLUSTER_SIZE, then the third check condition is satisfied but
the new apic_id does not fit the bitmask. In this case __send_ipi_mask
should send the IPI.

This is mostly theoretical, but it can happen if the apic_ids on three
iterations of the loop are for example 1, KVM_IPI_CLUSTER_SIZE, 0.

Fixes: aaffcfd1e82 ("KVM: X86: Implement PV IPIs in linux guest")
Signed-off-by: Li RongQing <[email protected]>
Message-Id: <1646814944 [email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: do compare-and-exchange of gPTE via the user address

FNAME(cmpxchg_gpte) is an inefficient mess.  It is at least decent if it
can go through get_user_pages_fast(), but if it cannot then it tries to
use memremap(); that is not just terribly slow, it is also wrong because
it assumes that the VM_PFNMAP VMA is contiguous.

The right way to do it would be to do the same thing as
hva_to_pfn_remapped() does since commit add6a0cd1c5b ("KVM: MMU: try to
fix up page faults before giving up", 2016-07-05), using follow_pte()
and fixup_user_fault() to determine the correct address to use for
memremap().  To do this, one could for example extract hva_to_pfn()
for use outside virt/kvm/kvm_main.c.  But really there is no reason to
do that either, because there is already a perfectly valid address to
do the cmpxchg() on, only it is a userspace address.  That means doing
user_access_begin()/user_access_end() and writing the code in assembly
to handle exceptions correctly.  Worse, the guest PTE can be 8-byte
even on i686 so there is the extra complication of using cmpxchg8b to
account for.  But at least it is an efficient mess.

(Thanks to Linus for suggesting improvement on the inline assembly).

Reported-by: Qiuhao Li <[email protected]>
Reported-by: Gaoning Pan <[email protected]>
Reported-by: Yongkang Jia <[email protected]>
Reported-by: [email protected]
Debugged-by: Tadeusz Struk <[email protected]>
Tested-by: Maxim Levitsky <[email protected]>
Cc: [email protected]
Fixes: bd53cb35a3e9 ("X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs")
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Remove redundant vm_entry_controls_clearbit() call

When emulating exit from long mode, EFER_LMA is cleared with
vmx_set_efer(). This will already unset the VM_ENTRY_IA32E_MODE control
bit as requested by SDM, so there is no need to unset VM_ENTRY_IA32E_MODE
again in exit_lmode() explicitly. In case EFER isn't supported by
hardware, long mode isn't supported, so exit_lmode() cannot be reached.

Note that, thanks to the shadow controls mechanism, this change doesn't
eliminate vmread or vmwrite.

Signed-off-by: Zhenzhong Duan <[email protected]>
Message-Id: <20220311102643 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: cleanup enter_rmode()

vmx_set_efer() sets uret->data but, in fact if the value of uret->data
will be used vmx_setup_uret_msrs() will have rewritten it with the value
returned by update_transition_efer(). uret->data is consumed if and only
if uret->load_into_hardware is true, and vmx_setup_uret_msrs() takes care
of (a) updating uret->data before setting uret->load_into_hardware to true
(b) setting uret->load_into_hardware to false if uret->data isn't updated.

Opportunistically use "vmx" directly instead of redoing to_vmx().

Signed-off-by: Zhenzhong Duan <[email protected]>
Message-Id: <20220311102643 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: SVM: fix tsc scaling when the host doesn't support it

It was decided that when TSC scaling is not supported,
the virtual MSR_AMD64_TSC_RATIO should still have the default '1.0'
value.

However in this case kvm_max_tsc_scaling_ratio is not set,
which breaks various assumptions.

Fix this by always calculating kvm_max_tsc_scaling_ratio regardless of
host support. For consistency, do the same for VMX.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20220322172449 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: SVM: remove unused defines

Remove some unused #defines from svm.c

Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20220322172449 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: SVM: move tsc ratio definitions to svm.h

Another piece of SVM spec which should be in the header file

Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20220322172449 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: SVM: fix avic spec based definitions again

Due to wrong rebase, commit
4a204f7895878 ("KVM: SVM: Allow AVIC support on system w/ physical APIC ID > 255")

moved avic spec #defines back to avic.c.

Move them back, and while at it extend AVIC_DOORBELL_PHYSICAL_ID_MASK to 12
bits as well (it will be used in nested avic)

Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20220322172449 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MIPS: remove reference to trap&emulate virtualization

Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20220313140522.1307751 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: document limitations of MSR filtering

MSR filtering requires an exit to userspace that is hard to implement and
would be very slow in the case of nested VMX vmexit and vmentry MSR
accesses. Document the limitation.

Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr

If MSR access is rejected by MSR filtering,
kvm_set_msr()/kvm_get_msr() would return KVM_MSR_RET_FILTERED,
and the return value is only handled well for rdmsr/wrmsr.
However, some instruction emulation and state transition also
use kvm_set_msr()/kvm_get_msr() to do msr access but may trigger
some unexpected results if MSR access is rejected, E.g. RDPID
emulation would inject a #UD but RDPID wouldn't cause a exit
when RDPID is supported in hardware and ENABLE_RDTSCP is set.
And it would also cause failure when load MSR at nested entry/exit.
Since msr filtering is based on MSR bitmap, it is better to only
do MSR filtering for rdmsr/wrmsr.

Signed-off-by: Hou Wenlong <[email protected]>
Message-Id: <2b2774154f7532c96a6f04d71c82a8bec7d9e80b.1646655860 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/emulator: Emulate RDPID only if it is enabled in guest

When RDTSCP is supported but RDPID is not supported in host,
RDPID emulation is available. However, __kvm_get_msr() would
only fail when RDTSCP/RDPID both are disabled in guest, so
the emulator wouldn't inject a #UD when RDPID is disabled but
RDTSCP is enabled in guest.

Fixes: fb6d4d340e05 ("KVM: x86: emulate RDPID")
Signed-off-by: Hou Wenlong <[email protected]>
Message-Id: <1dfd46ae5b76d3ed87bde3154d51c64ea64c99c1.1646226788 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/pmu: Fix and isolate TSX-specific performance event logic

HSW_IN_TX* bits are used in generic code which are not supported on
AMD. Worse, these bits overlap with AMD EventSelect[11:8] and hence
using HSW_IN_TX* bits unconditionally in generic code is resulting in
unintentional pmu behavior on AMD. For example, if EventSelect[11:8]
is 0x2, pmc_reprogram_counter() wrongly assumes that
HSW_IN_TX_CHECKPOINTED is set and thus forces sampling period to be 0.

Also per the SDM, both bits 32 and 33 "may only be set if the processor
supports HLE or RTM" and for "IN_TXCP (bit 33): this bit may only be set
for IA32_PERFEVTSEL2."

Opportunistically eliminate code redundancy, because if the HSW_IN_TX*
bit is set in pmc->eventsel, it is already set in attr.config.

Reported-by: Ravi Bangoria <[email protected]>
Reported-by: Jim Mattson <[email protected]>
Fixes: 103af0a98788 ("perf, kvm: Support the in_tx/in_tx_cp modifiers in KVM arch perfmon emulation v5")
Co-developed-by: Ravi Bangoria <[email protected]>
Signed-off-by: Ravi Bangoria <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Message-Id: <20220309084257 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set

It makes more sense to print new SPTE value than the
old value.

Signed-off-by: Maxim Levitsky <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Message-Id: <20220302102457 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs

AMD EPYC CPUs never raise a #GP for a WRMSR to a PerfEvtSeln MSR. Some
reserved bits are cleared, and some are not. Specifically, on
Zen3/Milan, bits 19 and 42 are not cleared.

When emulating such a WRMSR, KVM should not synthesize a #GP,
regardless of which bits are set. However, undocumented bits should
not be passed through to the hardware MSR. So, rather than checking
for reserved bits and synthesizing a #GP, just clear the reserved
bits.

This may seem pedantic, but since KVM currently does not support the
"Host/Guest Only" bits (41:40), it is necessary to clear these bits
rather than synthesizing #GP, because some popular guests (e.g Linux)
will set the "Host Only" bit even on CPUs that don't support
EFER.SVME, and they don't expect a #GP.

For example,

root@Ubuntu1804:~# perf stat -e r26 -a sleep 1

Performance counter stats for 'system wide':

                 0      r26

       1.001070977 seconds time elapsed

Feb 23 03:59:58 Ubuntu1804 kernel: [  405.379957] unchecked MSR access error: WRMSR to 0xc0010200 (tried to write 0x0000020000130026) at rIP: 0xffffffff9b276a28 (native_write_msr+0x8/0x30)
Feb 23 03:59:58 Ubuntu1804 kernel: [  405.379958] Call Trace:
Feb 23 03:59:58 Ubuntu1804 kernel: [  405.379963]  amd_pmu_disable_event+0x27/0x90

Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Reported-by: Lotus Fenn <[email protected]>
Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Like Xu <[email protected]>
Reviewed-by: David Dunn <[email protected]>
Message-Id: <20220226234131.2167175 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Trace all APICv inhibit changes and capture overall status

Trace all APICv inhibit changes instead of just those that result in
APICv being (un)inhibited, and log the current state. Debugging why
APICv isn't working is frustrating as it's hard to see why APICv is still
inhibited, and logging only the first inhibition means unnecessary onion
peeling.

Opportunistically drop the export of the tracepoint, it is not and should
not be used by vendor code due to the need to serialize toggling via
apicv_update_lock.

Note, using the common flow means kvm_apicv_init() switched from atomic
to non-atomic bitwise operations. The VM is unreachable at init, so
non-atomic is perfectly ok.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20220311043517 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Add wrappers for setting/clearing APICv inhibits

Add set/clear wrappers for toggling APICv inhibits to make the call sites
more readable, and opportunistically rename the inner helpers to align
with the new wrappers and to make them more readable as well. Invert the
flag from "activate" to "set"; activate is painfully ambiguous as it's
not obvious if the inhibit is being activated, or if APICv is being
activated, in which case the inhibit is being deactivated.

For the functions that take @set, swap the order of the inhibit reason
and @set so that the call sites are visually similar to those that bounce
through the wrapper.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20220311043517 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Make APICv inhibit reasons an enum and cleanup naming

Use an enum for the APICv inhibit reasons, there is no meaning behind
their values and they most definitely are not "unsigned longs". Rename
the various params to "reason" for consistency and clarity (inhibit may
be confused as a command, i.e. inhibit APICv, instead of the reason that
is getting toggled/checked).

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20220311043517 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Handle implicit supervisor access with SMAP

There are two kinds of implicit supervisor access
implicit supervisor access when CPL = 3
implicit supervisor access when CPL < 3

Current permission_fault() handles only the first kind for SMAP.

But if the access is implicit when SMAP is on, data may not be read
nor write from any user-mode address regardless the current CPL.

So the second kind should be also supported.

The first kind can be detect via CPL and access mode: if it is
supervisor access and CPL = 3, it must be implicit supervisor access.

But it is not possible to detect the second kind without extra
information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS
into @access. This extra information also works for the first kind, so
the logic is changed to use this information for both cases.

The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48
which is in the most significant 16 bits of u64 and less likely to be
forced to change due to future hardware uses it.

This patch removes the call to ->get_cpl() for access mode is determined
by @access.  Not only does it reduce a function call, but also remove
confusions when the permission is checked for nested TDP.  The nested
TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing
on it.  The original code works just because it is always user walk for
NPT and SMAP fault is not set for EPT in update_permission_bitmask.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20220311070346 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Rename variable smap to not_smap in permission_fault()

Comments above the variable says the bit is set when SMAP is overridden
or the same meaning in update_permission_bitmask(): it is not subjected
to SMAP restriction.

Renaming it to reflect the negative implication and make the code better
readability.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20220311070346 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Fix comments in update_permission_bitmask

The commit 09f037aa48f3 ("KVM: MMU: speedup update_permission_bitmask")
refactored the code of update_permission_bitmask() and change the
comments. It added a condition into a list to match the new code,
so the number/order for conditions in the comments should be updated
too.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20220311070346 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Change the type of access u32 to u64

Change the type of access u32 to u64 for FNAME(walk_addr) and
->gva_to_gpa().

The kinds of accesses are usually combinations of UWX, and VMX/SVM's
nested paging adds a new factor of access: is it an access for a guest
page table or for a final guest physical address.

And SMAP relies a factor for supervisor access: explicit or implicit.

So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include
all these information to do the walk.

Although @access(u32) has enough bits to encode all the kinds, this
patch extends it to u64:
o Extra bits will be in the higher 32 bits, so that we can
  easily obtain the traditional access mode (UWX) by converting
  it to u32.
o Reuse the value for the access kind defined by SVM's nested
  paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as
  @error_code in kvm_handle_page_fault().

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20220311070346 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>