Git Repo - linux.git/log

lib/Kconfig.debug: Restrict DEBUG_INFO_SPLIT for RISC-V

When building for ARCH=riscv using LLVM < 14, there is an error with
CONFIG_DEBUG_INFO_SPLIT=y:

  error: A dwo section may not contain relocations

This was worked around in LLVM 15 by disallowing '-gsplit-dwarf' with
'-mrelax' (the default), so CONFIG_DEBUG_INFO_SPLIT is not selectable
with newer versions of LLVM:

  $ clang --target=riscv64-linux-gnu -gsplit-dwarf -c -o /dev/null -x c /dev/null
  clang: error: -gsplit-dwarf is unsupported with RISC-V linker relaxation (-mrelax)

GCC silently had a similar issue that was resolved with GCC 12.x.
Restrict CONFIG_DEBUG_INFO_SPLIT for RISC-V when using LLVM or GCC <
12.x to avoid these known issues.

Link: https://github.com/ClangBuiltLinux/linux/issues/1914
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99090
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Nathan Chancellor <[email protected]>
Reviewed-by: Fangrui Song <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

Merge patch series "RISC-V: mm: Make SV48 the default address space"

Charlie Jenkins <[email protected]> says:

Make sv48 the default address space for mmap as some applications
currently depend on this assumption. Users can now select a
desired address space using a non-zero hint address to mmap. Previously,
requesting the default address space from mmap by passing zero as the hint
address would result in using the largest address space possible. Some
applications depend on empty bits in the virtual address space, like Go and
Java, so this patch provides more flexibility for application developers.

* b4-shazam-merge:
  RISC-V: mm: Document mmap changes
  RISC-V: mm: Update pgtable comment documentation
  RISC-V: mm: Add tests for RISC-V mm
  RISC-V: mm: Restrict address space for sv39,sv48,sv57

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

Merge patch series "riscv: Reduce ARCH_KMALLOC_MINALIGN to 8"

Jisheng Zhang <[email protected]> says:

Currently, riscv defines ARCH_DMA_MINALIGN as L1_CACHE_BYTES, I.E
64Bytes, if CONFIG_RISCV_DMA_NONCOHERENT=y. To support unified kernel
Image, usually we have to enable CONFIG_RISCV_DMA_NONCOHERENT, thus
it brings some bad effects to coherent platforms:

Firstly, it wastes memory, kmalloc-96, kmalloc-32, kmalloc-16 and
kmalloc-8 slab caches don't exist any more, they are replaced with
either kmalloc-128 or kmalloc-64.

Secondly, larger than necessary kmalloc aligned allocations results
in unnecessary cache/TLB pressure.

This issue also exists on arm64 platforms. From last year, Catalin
tried to solve this issue by decoupling ARCH_KMALLOC_MINALIGN from
ARCH_DMA_MINALIGN, limiting kmalloc() minimum alignment to
dma_get_cache_alignment() and replacing ARCH_KMALLOC_MINALIGN usage
in various drivers with ARCH_DMA_MINALIGN etc.[1]

One fact we can make use of for riscv: if the CPU doesn't support
ZICBOM or T-HEAD CMO, we know the platform is coherent. Based on
Catalin's work and above fact, we can easily solve the kmalloc align
issue for riscv: we can override dma_get_cache_alignment(), then let
it return ARCH_DMA_MINALIGN at the beginning and return 1 once we know
the underlying HW neither supports ZICBOM nor supports T-HEAD CMO.

So what about if the CPU supports ZICBOM or T-HEAD CMO, but all the
devices are dma coherent? Well, we use ARCH_DMA_MINALIGN as the
kmalloc minimum alignment, nothing changed in this case. This case
can be improved in the future once we see such platforms in mainline.

After this patch, a simple test of booting to a small buildroot rootfs
on qemu shows:

kmalloc-96           5041    5041     96  ...
kmalloc-64           9606    9606     64  ...
kmalloc-32           5128    5128     32  ...
kmalloc-16           7682    7682     16  ...
kmalloc-8           10246   10246      8  ...

So we save about 1268KB memory. The saving will be much larger in normal
OS env on real HW platforms.

patch1 allows kmalloc() caches aligned to the smallest value.
patch2 enables DMA_BOUNCE_UNALIGNED_KMALLOC.

After this series:

As for coherent platforms, kmalloc-{8,16,32,96} caches come back on
coherent both RV32 and RV64 platforms, I.E !ZICBOM and !THEAD_CMO.

As for noncoherent RV32 platforms, nothing changed.

As for noncoherent RV64 platforms, I.E either ZICBOM or THEAD_CMO, the
above kmalloc caches also come back if > 4GB memory or users pass
"swiotlb=mmnn,force" to force swiotlb creation if <= 4GB memory. How
much mmnn should be depends on the specific platform, it needs to be
tried and tested all possible usage case on the specific hardware. For
example, I can use the minimal I/O TLB slabs on Sipeed M1S Dock.

* b4-shazam-merge:
  riscv: enable DMA_BOUNCE_UNALIGNED_KMALLOC for !dma_coherent
  riscv: allow kmalloc() caches aligned to the smallest value

Link: https://lore.kernel.org/linux-arm-kernel/[email protected]/
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: support PREEMPT_DYNAMIC with static keys

Currently, each architecture can support PREEMPT_DYNAMIC through
either static calls or static keys. To support PREEMPT_DYNAMIC on
riscv, we face three choices:

1. only add static calls support to riscv
As Mark pointed out in commit 99cf983cc8bc ("sched/preempt: Add
PREEMPT_DYNAMIC using static keys"), static keys "...should have
slightly lower overhead than non-inline static calls, as this
effectively inlines each trampoline into the start of its callee. This
may avoid redundant work, and may integrate better with CFI schemes."
So even we add static calls(without inline static calls) to riscv,
static keys is still a better choice.

2. add static calls and inline static calls to riscv
Per my understanding, inline static calls requires objtool support
which is not easy.

3. use static keys

While riscv doesn't have static calls support, it supports static keys
perfectly. So this patch selects HAVE_PREEMPT_DYNAMIC_KEY to enable
support for PREEMPT_DYNAMIC on riscv, so that the preemption model can
be chosen at boot time. It also patches asm-generic/preempt.h, mainly
to add __preempt_schedule() and __preempt_schedule_notrace() macros
for PREEMPT_DYNAMIC case. Other architectures which use generic
preempt.h can also benefit from this patch by simply selecting
HAVE_PREEMPT_DYNAMIC_KEY to enable PREEMPT_DYNAMIC if they supports
static keys.

Signed-off-by: Jisheng Zhang <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

Merge patch series "riscv: support ELF format binaries in nommu mode"

Greg Ungerer <[email protected]> says:

The following changes add the ability to run ELF format binaries when
running RISC-V in nommu mode. That support is actually part of the
ELF-FDPIC loader, so these changes are all about making that work on
RISC-V.

The first issue to deal with is making the ELF-FDPIC loader capable of
handling 64-bit ELF files. As coded right now it only supports 32-bit
ELF files.

Secondly some changes are required to enable and compile the ELF-FDPIC
loader on RISC-V and to pass the ELF-FDPIC mapping addresses through to
user space when execing the new program.

These changes have not been used to run actual ELF-FDPIC binaries.
It is used to load and run normal ELF - compiled -pie format. Though the
underlying changes are expected to work with full ELF-FDPIC binaries if
or when that is supported on RISC-V in gcc.

To avoid needing changes to the C-library (tested with uClibc-ng
currently) there is a simple runtime dynamic loader (interpreter)
available to do the final relocations, https://github.com/gregungerer/uldso.
The nice thing about doing it this way is that the same program
binary can also be loaded with the usual ELF loader in MMU linux.

The motivation here is to provide an easy to use alternative to the
flat format binaries normally used for RISC-V nommu based systems.

* b4-shazam-merge:
riscv: support the elf-fdpic binfmt loader
binfmt_elf_fdpic: support 64-bit systems

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

Merge patch series "riscv: KCFI support"

Sami Tolvanen <[email protected]> says:

This series adds KCFI support for RISC-V. KCFI is a fine-grained
forward-edge control-flow integrity scheme supported in Clang >=16,
which ensures indirect calls in instrumented code can only branch to
functions whose type matches the function pointer type, thus making
code reuse attacks more difficult.

Patch 1 implements a pt_regs based syscall wrapper to address
function pointer type mismatches in syscall handling. Patches 2 and 3
annotate indirectly called assembly functions with CFI types. Patch 4
implements error handling for indirect call checks. Patch 5 disables
CFI for arch/riscv/purgatory. Patch 6 finally allows CONFIG_CFI_CLANG
to be enabled for RISC-V.

Note that Clang 16 has a generic architecture-agnostic KCFI
implementation, which does work with the kernel, but doesn't produce
a stable code sequence for indirect call checks, which means
potential failures just trap and won't result in informative error
messages. Clang 17 includes a RISC-V specific back-end implementation
for KCFI, which emits a predictable code sequence for the checks and a
.kcfi_traps section with locations of the traps, which patch 5 uses to
produce more useful errors.

The type mismatch fixes and annotations in the first three patches
also become necessary in future if the kernel decides to support
fine-grained CFI implemented using the hardware landing pad
feature proposed in the in-progress Zicfisslp extension. Once the
specification is ratified and hardware support emerges, implementing
runtime patching support that replaces KCFI instrumentation with
Zicfisslp landing pads might also be feasible (similarly to KCFI to
FineIBT patching on x86_64), allowing distributions to ship a unified
kernel binary for all devices.

* b4-shazam-merge:
  riscv: Allow CONFIG_CFI_CLANG to be selected
  riscv/purgatory: Disable CFI
  riscv: Add CFI error handling
  riscv: Add ftrace_stub_graph
  riscv: Add types to indirectly called assembly functions
  riscv: Implement syscall wrappers

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Move create_tmp_mapping() to init sections

This function is only used at boot time so mark it as __init.

Fixes: 96f9d4daf745 ("riscv: Rework kasan population functions")
Signed-off-by: Alexandre Ghiti <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Mark KASAN tmp* page tables variables as static

tmp_pg_dir, tmp_p4d and tmp_pud are only used in kasan_init.c so they
should be declared as static.

Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Fixes: 96f9d4daf745 ("riscv: Rework kasan population functions")
Signed-off-by: Alexandre Ghiti <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: mm: use bitmap_zero() API

bitmap_zero() is faster than bitmap_clear(), so use bitmap_zero()
instead of bitmap_clear().

Signed-off-by: Ye Xingchen <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

Merge patch series "support allocating crashkernel above 4G explicitly on riscv"

Chen Jiahao <[email protected]> says:

On riscv, the current crash kernel allocation logic is trying to
allocate within 32bit addressible memory region by default, if
failed, try to allocate without 4G restriction.

In need of saving DMA zone memory while allocating a relatively large
crash kernel region, allocating the reserved memory top down in
high memory, without overlapping the DMA zone, is a mature solution.
Hence this patchset introduces the parameter option crashkernel=X,[high,low].

One can reserve the crash kernel from high memory above DMA zone range
by explicitly passing "crashkernel=X,high"; or reserve a memory range
below 4G with "crashkernel=X,low". Besides, there are few rules need
to take notice:
1. "crashkernel=X,[high,low]" will be ignored if "crashkernel=size"
   is specified.
2. "crashkernel=X,low" is valid only when "crashkernel=X,high" is passed
   and there is enough memory to be allocated under 4G.
3. When allocating crashkernel above 4G and no "crashkernel=X,low" is
   specified, a 128M low memory will be allocated automatically for
   swiotlb bounce buffer.
See Documentation/admin-guide/kernel-parameters.txt for more information.

To verify loading the crashkernel, adapted kexec-tools is attached below:
https://github.com/chenjh005/kexec-tools/tree/build-test-riscv-v2

Following test cases have been performed as expected:
1) crashkernel=256M                          //low=256M
2) crashkernel=1G                            //low=1G
3) crashkernel=4G                            //high=4G, low=128M(default)
4) crashkernel=4G crashkernel=256M,high      //high=4G, low=128M(default), high is ignored
5) crashkernel=4G crashkernel=256M,low       //high=4G, low=128M(default), low is ignored
6) crashkernel=4G,high                       //high=4G, low=128M(default)
7) crashkernel=256M,low                      //low=0M, invalid
8) crashkernel=4G,high crashkernel=256M,low  //high=4G, low=256M
9) crashkernel=4G,high crashkernel=4G,low    //high=0M, low=0M, invalid
10) crashkernel=512M@0xd0000000              //low=512M
11) crashkernel=1G,high crashkernel=0M,low   //high=1G, low=0M

* b4-shazam-merge:
  docs: kdump: Update the crashkernel description for riscv
  riscv: kdump: Implement crashkernel=X,[high,low]

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

Merge patch series "riscv: kprobes: simulate some instructions"

Nam Cao <[email protected]> says:

Simulate some currently rejected instructions. Still to be simulated are:
    - c.jal
    - c.ebreak

* b4-shazam-merge:
  riscv: kprobes: simulate c.beqz and c.bnez
  riscv: kprobes: simulate c.jr and c.jalr instructions
  riscv: kprobes: simulate c.j instruction

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: enable DEBUG_FORCE_FUNCTION_ALIGN_64B

Allow to force all function address 64B aligned as it is possible for
other architectures. This may be useful when verify if performance
bump is caused by function alignment changes.

Before commit 1bf18da62106 ("lib/Kconfig.debug: add ARCH dependency
for FUNCTION_ALIGN option"), riscv supports enabling the
DEBUG_FORCE_FUNCTION_ALIGN_64B option, but after that commit, each
arch needs to claim the support explicitly.

Signed-off-by: Jisheng Zhang <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: remove redundant mv instructions

Some mv instructions were useful when first introduced to preserve a0 and
a1 before function calls. However the code has changed and they are now
redundant. Remove them.

Signed-off-by: Nam Cao <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: mm: Document mmap changes

The behavior of mmap is modified with this patch series, so explain the
changes to the mmap hint address behavior.

Signed-off-by: Charlie Jenkins <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: mm: Update pgtable comment documentation

sv57 is supported in the kernel so pgtable.h should reflect that.

Signed-off-by: Charlie Jenkins <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: mm: Add tests for RISC-V mm

Add tests that enforce mmap hint address behavior. mmap should default
to sv48. mmap will provide an address at the highest address space that
can fit into the hint address, unless the hint address is less than sv39
and not 0, then it will return a sv39 address.

These tests are split into two files: mmap_default.c and mmap_bottomup.c
because a new process must be exec'd in order to change the mmap layout.
The run_mmap.sh script sets the stack to be unlimited for the
mmap_bottomup.c test which triggers a bottomup layout.

Signed-off-by: Charlie Jenkins <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: mm: Restrict address space for sv39,sv48,sv57

Make sv48 the default address space for mmap as some applications
currently depend on this assumption. A hint address passed to mmap will
cause the largest address space that fits entirely into the hint to be
used. If the hint is less than or equal to 1<<38, an sv39 address will
be used. An exception is that if the hint address is 0, then a sv48
address will be used. After an address space is completely full, the next
smallest address space will be used.

Signed-off-by: Charlie Jenkins <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: enable DMA_BOUNCE_UNALIGNED_KMALLOC for !dma_coherent

With the DMA bouncing of unaligned kmalloc() buffers now in place,
enable it for riscv when RISCV_DMA_NONCOHERENT=y to allow the
kmalloc-{8,16,32,96} caches. Since RV32 doesn't enable SWIOTLB
yet, and I didn't see any dma noncoherent RV32 platforms in the
mainline, so skip RV32 now by only enabling
DMA_BOUNCE_UNALIGNED_KMALLOC if SWIOTLB is available. Once we see
such requirement on RV32, we can enable it then.

NOTE: we didn't force to create the swiotlb buffer even when the
end of RAM is within the 32-bit physical address range. That's to
say:
For RV64 with > 4GB memory, the feature is enabled.
For RV64 with <= 4GB memory, the feature isn't enabled by default. We
rely on users to pass "swiotlb=mmnn,force" where mmnn is the Number of
I/O TLB slabs, see kernel-parameters.txt for details.

Tested on Sipeed Lichee Pi 4A with 8GB DDR and Sipeed M1S BL808 Dock
board.

Signed-off-by: Jisheng Zhang <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: allow kmalloc() caches aligned to the smallest value

Currently, riscv defines ARCH_DMA_MINALIGN as L1_CACHE_BYTES, I.E
64Bytes, if CONFIG_RISCV_DMA_NONCOHERENT=y. To support unified kernel
Image, usually we have to enable CONFIG_RISCV_DMA_NONCOHERENT, thus
it brings some bad effects to coherent platforms:

Firstly, it wastes memory, kmalloc-96, kmalloc-32, kmalloc-16 and
kmalloc-8 slab caches don't exist any more, they are replaced with
either kmalloc-128 or kmalloc-64.

Secondly, larger than necessary kmalloc aligned allocations results
in unnecessary cache/TLB pressure.

This issue also exists on arm64 platforms. From last year, Catalin
tried to solve this issue by decoupling ARCH_KMALLOC_MINALIGN from
ARCH_DMA_MINALIGN, limiting kmalloc() minimum alignment to
dma_get_cache_alignment() and replacing ARCH_KMALLOC_MINALIGN usage
in various drivers with ARCH_DMA_MINALIGN etc.[1]

One fact we can make use of for riscv: if the CPU doesn't support
ZICBOM or T-HEAD CMO, we know the platform is coherent. Based on
Catalin's work and above fact, we can easily solve the kmalloc align
issue for riscv: we can override dma_get_cache_alignment(), then let
it return ARCH_DMA_MINALIGN at the beginning and return 1 once we know
the underlying HW neither supports ZICBOM nor supports T-HEAD CMO.

So what about if the CPU supports ZICBOM or T-HEAD CMO, but all the
devices are dma coherent? Well, we use ARCH_DMA_MINALIGN as the
kmalloc minimum alignment, nothing changed in this case. This case
can be improved in the future.

After this patch, a simple test of booting to a small buildroot rootfs
on qemu shows:

kmalloc-96           5041    5041     96  ...
kmalloc-64           9606    9606     64  ...
kmalloc-32           5128    5128     32  ...
kmalloc-16           7682    7682     16  ...
kmalloc-8           10246   10246      8  ...

So we save about 1268KB memory. The saving will be much larger in normal
OS env on real HW platforms.

Link: https://lore.kernel.org/linux-arm-kernel/[email protected]/
Signed-off-by: Jisheng Zhang <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: support the elf-fdpic binfmt loader

Add support for enabling and using the binfmt_elf_fdpic program loader
on RISC-V platforms. The most important change is to setup registers
during program load to pass the mapping addresses to the new process.

One of the interesting features of the elf-fdpic loader is that it
also allows appropriately compiled ELF format binaries to be loaded on
nommu systems. Appropriate being those compiled with -pie.

Signed-off-by: Greg Ungerer <[email protected]>
Acked-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

binfmt_elf_fdpic: support 64-bit systems

The binfmt_flat_fdpic code has a number of 32-bit specific data
structures associated with it. Extend it to be able to support and
be used on 64-bit systems as well.

The new code defines a number of key 64-bit variants of the core
elf-fdpic data structures - along side the existing 32-bit sized ones.
A common set of generic named structures are defined to be either
the 32-bit or 64-bit ones as required at compile time. This is a
similar technique to that used in the ELF binfmt loader.

For example:

elf_fdpic_loadseg is either elf32_fdpic_loadseg or elf64_fdpic_loadseg
elf_fdpic_loadmap is either elf32_fdpic_loadmap or elf64_fdpic_loadmap

the choice based on ELFCLASS32 or ELFCLASS64.

Signed-off-by: Greg Ungerer <[email protected]>
Acked-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Allow CONFIG_CFI_CLANG to be selected

Select ARCH_SUPPORTS_CFI_CLANG to allow CFI_CLANG to be selected
on riscv.

Reviewed-by: Kees Cook <[email protected]>
Tested-by: Nathan Chancellor <[email protected]>
Signed-off-by: Sami Tolvanen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv/purgatory: Disable CFI

Filter out CC_FLAGS_CFI when CONFIG_CFI_CLANG.

Reviewed-by: Kees Cook <[email protected]>
Tested-by: Nathan Chancellor <[email protected]>
Signed-off-by: Sami Tolvanen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Add CFI error handling

With CONFIG_CFI_CLANG, the compiler injects a type preamble immediately
before each function and a check to validate the target function type
before indirect calls:

  ; type preamble
    .word <id>
  function:
    ...
  ; indirect call check
    lw      t1, -4(a0)
    lui     t2, <hi20>
    addiw   t2, t2, <lo12>
    beq     t1, t2, .Ltmp0
    ebreak
  .Ltmp0:
    jarl    a0

Implement error handling code for the ebreak traps emitted for the
checks. This produces the following oops on a CFI failure (generated
using lkdtm):

[   21.177245] CFI failure at lkdtm_indirect_call+0x22/0x32 [lkdtm]
(target: lkdtm_increment_int+0x0/0x18 [lkdtm]; expected type: 0x3ad55aca)
[   21.178483] Kernel BUG [#1]
[   21.178671] Modules linked in: lkdtm
[   21.179037] CPU: 1 PID: 104 Comm: sh Not tainted
6.3.0-rc6-00037-g37d5ec6297ab #1
[   21.179511] Hardware name: riscv-virtio,qemu (DT)
[   21.179818] epc : lkdtm_indirect_call+0x22/0x32 [lkdtm]
[   21.180106]  ra : lkdtm_CFI_FORWARD_PROTO+0x48/0x7c [lkdtm]
[   21.180426] epc : ffffffff01387092 ra : ffffffff01386f14 sp : ff20000000453cf0
[   21.180792]  gp : ffffffff81308c38 tp : ff6000000243f080 t0 : ff20000000453b78
[   21.181157]  t1 : 000000003ad55aca t2 : 000000007e0c52a5 s0 : ff20000000453d00
[   21.181506]  s1 : 0000000000000001 a0 : ffffffff0138d170 a1 : ffffffff013870bc
[   21.181819]  a2 : b5fea48dd89aa700 a3 : 0000000000000001 a4 : 0000000000000fff
[   21.182169]  a5 : 0000000000000004 a6 : 00000000000000b7 a7 : 0000000000000000
[   21.182591]  s2 : ff20000000453e78 s3 : ffffffffffffffea s4 : 0000000000000012
[   21.183001]  s5 : ff600000023c7000 s6 : 0000000000000006 s7 : ffffffff013882a0
[   21.183653]  s8 : 0000000000000008 s9 : 0000000000000002 s10: ffffffff0138d878
[   21.184245]  s11: ffffffff0138d878 t3 : 0000000000000003 t4 : 0000000000000000
[   21.184591]  t5 : ffffffff8133df08 t6 : ffffffff8133df07
[   21.184858] status: 0000000000000120 badaddr: 0000000000000000
cause: 0000000000000003
[   21.185415] [<ffffffff01387092>] lkdtm_indirect_call+0x22/0x32 [lkdtm]
[   21.185772] [<ffffffff01386f14>] lkdtm_CFI_FORWARD_PROTO+0x48/0x7c [lkdtm]
[   21.186093] [<ffffffff01383552>] lkdtm_do_action+0x22/0x34 [lkdtm]
[   21.186445] [<ffffffff0138350c>] direct_entry+0x128/0x13a [lkdtm]
[   21.186817] [<ffffffff8033ed8c>] full_proxy_write+0x58/0xb2
[   21.187352] [<ffffffff801d4fe8>] vfs_write+0x14c/0x33a
[   21.187644] [<ffffffff801d5328>] ksys_write+0x64/0xd4
[   21.187832] [<ffffffff801d53a6>] sys_write+0xe/0x1a
[   21.188171] [<ffffffff80003996>] ret_from_syscall+0x0/0x2
[   21.188595] Code: 0513 0f65 a303 ffc5 53b7 7e0c 839b 2a53 0363 0073 (9002) 9582
[   21.189178] ---[ end trace 0000000000000000 ]---
[   21.189590] Kernel panic - not syncing: Fatal exception

Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Conor Dooley <[email protected]> # ISA bits
Tested-by: Nathan Chancellor <[email protected]>
Signed-off-by: Sami Tolvanen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Add ftrace_stub_graph

Commit 883bbbffa5a4 ("ftrace,kcfi: Separate ftrace_stub() and
ftrace_stub_graph()") added a separate ftrace_stub_graph function for
CFI_CLANG. Add the stub to fix FUNCTION_GRAPH_TRACER compatibility
with CFI.

Reviewed-by: Kees Cook <[email protected]>
Tested-by: Nathan Chancellor <[email protected]>
Signed-off-by: Sami Tolvanen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Add types to indirectly called assembly functions

With CONFIG_CFI_CLANG, assembly functions indirectly called
from C code must be annotated with type identifiers to pass CFI
checking. Use the SYM_TYPED_START macro to add types to the
relevant functions.

Reviewed-by: Kees Cook <[email protected]>
Tested-by: Nathan Chancellor <[email protected]>
Signed-off-by: Sami Tolvanen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Implement syscall wrappers

Commit f0bddf50586d ("riscv: entry: Convert to generic entry") moved
syscall handling to C code, which exposed function pointer type
mismatches that trip fine-grained forward-edge Control-Flow Integrity
(CFI) checks as syscall handlers are all called through the same
syscall_t pointer type. To fix the type mismatches, implement pt_regs
based syscall wrappers similarly to x86 and arm64.

This patch is based on arm64 syscall wrappers added in commit
4378a7d4be30 ("arm64: implement syscall wrappers"), where the main goal
was to minimize the risk of userspace-controlled values being used
under speculation. This may be a concern for riscv in future as well.

Following other architectures, the syscall wrappers generate three
functions for each syscall; __riscv_<compat_>sys_<name> takes a pt_regs
pointer and extracts arguments from registers, __se_<compat_>sys_<name>
is a sign-extension wrapper that casts the long arguments to the
correct types for the real syscall implementation, which is named
__do_<compat_>sys_<name>.

Reviewed-by: Kees Cook <[email protected]>
Tested-by: Nathan Chancellor <[email protected]>
Signed-off-by: Sami Tolvanen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

Merge patch series "riscv: Allow userspace to directly access perf counters"

Alexandre Ghiti <[email protected]> says:

riscv used to allow direct access to cycle/time/instret counters,
bypassing the perf framework, this patchset intends to allow the user to
mmap any counter when accessed through perf.

**Important**: The default mode is now user access through perf only, not
the legacy so some applications will break. However, we introduce a sysctl
perf_user_access like arm64 does, which will allow to switch to the legacy
mode described above.

This version needs openSBI v1.3 *and* a kernel fix that went upstream lately
(https://lore.kernel.org/lkml/20230616114831.3186980 [email protected]/T/).

* b4-shazam-merge:
  perf: tests: Adapt mmap-basic.c for riscv
  tools: lib: perf: Implement riscv mmap support
  Documentation: admin-guide: Add riscv sysctl_perf_user_access
  drivers: perf: Implement perf event mmap support in the SBI backend
  drivers: perf: Implement perf event mmap support in the legacy backend
  riscv: Prepare for user-space perf event mmap support
  drivers: perf: Rename riscv pmu sbi driver
  riscv: Make legacy counter enum match the HW numbering
  include: riscv: Fix wrong include guard in riscv_pmu.h
  perf: Fix wrong comment about default event_idx

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Require FRAME_POINTER for some configurations

Some V configurations implicitly turn on '-fno-omit-frame-pointer',
but leaving FRAME_POINTER disabled. This makes it hard to reason about
the FRAME_POINTER config, and also triggers build failures introduced
in by the commit in the Fixes: tag.

Select FRAME_POINTER explicitly for these configurations.

Fixes: ebc9cb03b21e ("riscv: stack: Fixup independent softirq stack for CONFIG_FRAME_POINTER=n")
Signed-off-by: Björn Töpel <[email protected]>
Tested-by: Randy Dunlap <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

docs: kdump: Update the crashkernel description for riscv

Now "crashkernel=" parameter on riscv has been updated to support
crashkernel=X,[high,low]. Through which we can reserve memory region
above/within 32bit addressible DMA zone.

Here update the parameter description accordingly.

Signed-off-by: Chen Jiahao <[email protected]>
Reviewed-by: Guo Ren <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Zhen Lei <[email protected]>
Acked-by: Baoquan He <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: kdump: Implement crashkernel=X,[high,low]

On riscv, the current crash kernel allocation logic is trying to
allocate within 32bit addressible memory region by default, if
failed, try to allocate without 4G restriction.

In need of saving DMA zone memory while allocating a relatively large
crash kernel region, allocating the reserved memory top down in
high memory, without overlapping the DMA zone, is a mature solution.
Here introduce the parameter option crashkernel=X,[high,low].

One can reserve the crash kernel from high memory above DMA zone range
by explicitly passing "crashkernel=X,high"; or reserve a memory range
below 4G with "crashkernel=X,low".

Signed-off-by: Chen Jiahao <[email protected]>
Acked-by: Guo Ren <[email protected]>
Acked-by: Baoquan He <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: kprobes: simulate c.beqz and c.bnez

kprobes currently rejects instruction c.beqz and c.bnez. Implement them.

Signed-off-by: Nam Cao <[email protected]>
Reviewed-by: Charlie Jenkins <[email protected]>
Link: https://lore.kernel.org/r/1d879dba4e4ee9a82e27625d6483b5c9cfed684f.1690704360.git.namcaov@gmail.com
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: kprobes: simulate c.jr and c.jalr instructions

kprobes currently rejects c.jr and c.jalr instructions. Implement them.

Signed-off-by: Nam Cao <[email protected]>
Reviewed-by: Charlie Jenkins <[email protected]>
Link: https://lore.kernel.org/r/db8b7787e9208654cca50484f68334f412be2ea9.1690704360.git.namcaov@gmail.com
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: kprobes: simulate c.j instruction

kprobes currently rejects c.j instruction. Implement it.

Signed-off-by: Nam Cao <[email protected]>
Reviewed-by: Charlie Jenkins <[email protected]>
Link: https://lore.kernel.org/r/6ef76cd9984b8015826649d13f870f8ac45a2d0d.1690704360.git.namcaov@gmail.com
Signed-off-by: Palmer Dabbelt <[email protected]>

perf: tests: Adapt mmap-basic.c for riscv

riscv now supports mmaping hardware counters to userspace so adapt the test
to run on this architecture.

Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>

tools: lib: perf: Implement riscv mmap support

riscv now supports mmaping hardware counters so add what's needed to
take advantage of that in libperf.

Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>

Documentation: admin-guide: Add riscv sysctl_perf_user_access

riscv now uses this sysctl so document its usage for this architecture.

Signed-off-by: Alexandre Ghiti <[email protected]>

drivers: perf: Implement perf event mmap support in the SBI backend

We used to unconditionnally expose the cycle and instret csrs to
userspace, which gives rise to security concerns.

So now we only allow access to hw counters from userspace through the perf
framework which will handle context switches, per-task events...etc. A
sysctl allows to revert the behaviour to the legacy mode so that userspace
applications which are not ready for this change do not break.

But the default value is to allow userspace only through perf: this will
break userspace applications which rely on direct access to rdcycle.
This choice was made for security reasons [1][2]: most of the applications
which use rdcycle can instead use rdtime to count the elapsed time.

[1] https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/REWcwYnzsKE?pli=1
[2] https://www.youtube.com/watch?v=3-c4C_L2PRQ&ab_channel=IEEESymposiumonSecurityandPrivacy

Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>

drivers: perf: Implement perf event mmap support in the legacy backend

Implement the needed callbacks in the legacy driver so that we can
directly access the counters through perf in userspace.

Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Atish Patra <[email protected]>

riscv: Prepare for user-space perf event mmap support

Provide all the necessary bits in the generic riscv pmu driver to be
able to mmap perf events in userspace: the heavy lifting lies in the
driver backend, namely the legacy and sbi implementations.

Note that arch_perf_update_userpage is almost a copy of arm64 code.

Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Atish Patra <[email protected]>

drivers: perf: Rename riscv pmu sbi driver

That's just cosmetic, no functional changes.

Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Atish Patra <[email protected]>

riscv: Make legacy counter enum match the HW numbering

RISCV_PMU_LEGACY_INSTRET used to be set to 1 whereas the offset of this
hardware counter from CSR_CYCLE is actually 2: make this offset match the
real hw offset so that we can directly expose those values to userspace.

Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Atish Patra <[email protected]>

include: riscv: Fix wrong include guard in riscv_pmu.h

The current include guard prevents the inclusion of asm/perf_event.h
which uses the same include guard: fix the one in riscv_pmu.h so that it
matches the file name.

Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Atish Patra <[email protected]>

perf: Fix wrong comment about default event_idx

Since commit c719f56092ad ("perf: Fix and clean up initialization of
pmu::event_idx"), event_idx default implementation has returned 0, not
idx + 1, so fix the comment that can be misleading.

Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Atish Patra <[email protected]>

riscv: alternatives: fix a typo in comment

In the usage of ALTERNATIVE, "always" is misspelled as "alwyas".

Signed-off-by: Yuan Tan <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: cpu: refactor deprecated strncpy

`strncpy` is deprecated for use on NUL-terminated destination strings [1].

Favor not copying strings onto stack and instead use strings directly.
This avoids hard-coding sizes and buffer lengths all together.

Link: https://github.com/KSPP/linux/issues/90
Cc: [email protected]
Suggested-by: Kees Cook <[email protected]>
Signed-off-by: Justin Stitt <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
Acked-by: Palmer Dabbelt <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

Merge patch series "RISC-V: Probe DT extension support using riscv,isa-extensions & riscv,isa-base"

Conor Dooley <[email protected]> says:

Based on my latest iteration of deprecating riscv,isa [1], here's an
implementation of the new properties for Linux. The first few patches,
up to "RISC-V: split riscv_fill_hwcap() in 3", are all prep work that
further tames some of the extension related code, on top of my already
applied series that cleans up the ISA string parser.
Perhaps "RISC-V: shunt isa_ext_arr to cpufeature.c" is a bit gratuitous,
but I figured a bit of coalescing of extension related data structures
would be a good idea. Note that riscv,isa will still be used in the
absence of the new properties. Palmer suggested adding a Kconfig option
to turn off the fallback for DT, which I have gone and done. It's locked
behind the NONPORTABLE option for good reason.

* b4-shazam-merge:
  RISC-V: provide Kconfig & commandline options to control parsing "riscv,isa"
  RISC-V: try new extension properties in of_early_processor_hartid()
  RISC-V: enable extension detection from dedicated properties
  RISC-V: split riscv_fill_hwcap() in 3
  RISC-V: add single letter extensions to riscv_isa_ext
  RISC-V: add missing single letter extension definitions
  RISC-V: repurpose riscv_isa_ext array in riscv_fill_hwcap()
  RISC-V: shunt isa_ext_arr to cpufeature.c
  RISC-V: drop a needless check in print_isa_ext()
  RISC-V: don't parse dt/acpi isa string to get rv32/rv64
  RISC-V: Provide a more helpful error message on invalid ISA strings

Link: https://lore.kernel.org/r/20230713-target-much-8ac624e90df8@wendy
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: provide Kconfig & commandline options to control parsing "riscv,isa"

As it says on the tin, provide Kconfig option to control parsing the
"riscv,isa" devicetree property. If either option is used, the kernel
will fall back to parsing "riscv,isa", where "riscv,isa-base" and
"riscv,isa-extensions" are not present.
The Kconfig options are set up so that the default kernel configuration
will enable the fallback path, without needing the commandline option.

Suggested-by: Andrew Jones <[email protected]>
Suggested-by: Palmer Dabbelt <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/20230713-aviator-plausibly-a35662485c2c@wendy
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: try new extension properties in of_early_processor_hartid()

To fully deprecate the kernel's use of "riscv,isa",
of_early_processor_hartid() needs to first try using the new properties,
before falling back to "riscv,isa".

Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/20230713-tablet-jimmy-987fea0eb2e1@wendy
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: enable extension detection from dedicated properties

Add support for parsing the new riscv,isa-extensions property in
riscv_fill_hwcap(), by means of a new "property" member of the
riscv_isa_ext_data struct. For now, this shadows the name of the
extension for all users, however this may not be the case for all
extensions, based on how the dt-binding is written.
For the sake of backwards compatibility, fall back to the old scheme
if the new properties are not detected. For now, just inform, rather
than warn, when that happens.

Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/20230713-vocation-profane-39a74b3c2649@wendy
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: split riscv_fill_hwcap() in 3

Before adding more complexity to it, split riscv_fill_hwcap() into 3
distinct sections:
- riscv_fill_hwcap() still is the top level function, into which the
  additional complexity will be added.
- riscv_fill_hwcap_from_isa_string() handles getting the information
  from the riscv,isa/ACPI equivalent across harts & the various quirks
  there
- riscv_parse_isa_string() does what it says on the tin.

Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/20230713-daylight-puritan-37aeb41a4d9b@wendy
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: add single letter extensions to riscv_isa_ext

So that riscv_fill_hwcap() can use riscv_isa_ext to probe for single
letter extensions, add them to it.
As a result, what gets spat out in /proc/cpuinfo will become borked, as
single letter extensions will be printed as part of the base extensions
and while printing from riscv_isa_arr. Take the opportunity to unify the
printing of the isa string, using the new member of riscv_isa_ext_data
in the process.

Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/20230713-despite-bright-de00ac888cc7@wendy
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: add missing single letter extension definitions

To facilitate adding single letter extensions to riscv_isa_ext, add
definitions for the extensions present in base_riscv_exts that do not
already have them.

Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Evan Green <[email protected]>
Signed-off-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/20230713-train-feisty-93de38250f98@wendy
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: repurpose riscv_isa_ext array in riscv_fill_hwcap()

In riscv_fill_hwcap() riscv_isa_ext array can be looped over, rather
than duplicating the list of extensions with individual
SET_ISA_EXT_MAP() usage. While at it, drop the statement-of-the-obvious
comments from the struct, rename uprop to something more suitable for
its new use & constify the members.

Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/20230713-dastardly-affiliate-4cf819dccde2@wendy
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: shunt isa_ext_arr to cpufeature.c

To facilitate using one struct to define extensions, rather than having
several, shunt isa_ext_arr to cpufeature.c, where it will be used for
probing extension presence also.
As that scope of the array as widened, prefix it with riscv & drop the
type from the variable name.

Since the new array is const, print_isa() needs a wee bit of cleanup to
avoid complaints about losing the const qualifier.

Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Evan Green <[email protected]>
Signed-off-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/20230713-spirits-upside-a2c61c65fd5a@wendy
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: drop a needless check in print_isa_ext()

isa_ext_arr cannot be empty, as some of the extensions within it are
always built into the kernel. When this code was first added, back in
commit a9b202606c69 ("RISC-V: Improve /proc/cpuinfo output for ISA
extensions"), the array was empty and needed a dummy item & thus there
could be no extensions present. When the first multi-letter ones did
get added, it was Sscofpmf - which didn't have a Kconfig symbol to
disable it.

Remove this check, as it has been redundant since Sscofpmf was added.

Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/20230713-veggie-mug-3d3bf6787ae2@wendy
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: don't parse dt/acpi isa string to get rv32/rv64

When filling hwcap the kernel already expects the isa string to start with
rv32 if CONFIG_32BIT and rv64 if CONFIG_64BIT.

So when recreating the runtime isa-string we can also just go the other way
to get the correct starting point for it.

Signed-off-by: Heiko Stuebner <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Evan Green <[email protected]>
Co-developed-by: Conor Dooley <[email protected]>
Signed-off-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/20230713-masculine-saddlebag-67a94966b091@wendy
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: Provide a more helpful error message on invalid ISA strings

Right now we provide a somewhat unhelpful error message on systems with
invalid error messages, something along the lines of

CPU with hartid=0 is not available
------------[ cut here ]------------
kernel BUG at arch/riscv/kernel/smpboot.c:174!
Kernel BUG [#1]
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 6.4.0-rc1-00096-ge0097d2c62d5-dirty #1
Hardware name: Microchip PolarFire-SoC Icicle Kit (DT)
epc : of_parse_and_init_cpus+0x16c/0x16e
ra : of_parse_and_init_cpus+0x9a/0x16e
epc : ffffffff80c04e0a ra : ffffffff80c04d38 sp : ffffffff81603e20
gp : ffffffff8182d658 tp : ffffffff81613f80 t0 : 000000000000006e
t1 : 0000000000000064 t2 : 0000000000000000 s0 : ffffffff81603e80
s1 : 0000000000000000 a0 : 0000000000000000 a1 : 0000000000000000
a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000000
a5 : 0000000000001fff a6 : 0000000000001fff a7 : ffffffff816148b0
s2 : 0000000000000001 s3 : ffffffff81492a4c s4 : ffffffff81a4b090
s5 : ffffffff81506030 s6 : 0000000000000040 s7 : 0000000000000000
s8 : 00000000bfb6f046 s9 : 0000000000000001 s10: 0000000000000000
s11: 00000000bf389700 t3 : 0000000000000000 t4 : 0000000000000000
t5 : ffffffff824dd188 t6 : ffffffff824dd187
status: 0000000200000100 badaddr: 0000000000000000 cause: 0000000000000003
[<ffffffff80c04e0a>] of_parse_and_init_cpus+0x16c/0x16e
[<ffffffff80c04c96>] setup_smp+0x1e/0x26
[<ffffffff80c03ffe>] setup_arch+0x6e/0xb2
[<ffffffff80c00384>] start_kernel+0x72/0x400
Code: 80e7 4a00 a603 0009 b795 1097 ffe5 80e7 92c0 9002 (9002) 715d
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Fatal exception in interrupt

Add a warning for the cases where the ISA string isn't valid. It's still
above the BUG_ON cut, but hopefully it's at least a bit easier for users.

Reviewed-by: Evan Green <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/20230713-endless-spearhead-62a5a4b149bd@wendy
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: sigcontext: Correct the comment of sigreturn

The real-time signals enlarged the sigset_t type, and most architectures
have changed to using rt_sigreturn as the only way. The riscv is one of
them, and there is no sys_sigreturn in it. Only some old architecture
preserved sys_sigreturn as part of the historical burden.

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

Linux 6.5-rc1

MAINTAINERS 2: Electric Boogaloo

We just sorted the entries and fields last release, so just out of a
perverse sense of curiosity, I decided to see if we can keep things
ordered for even just one release.

The answer is "No. No we cannot".

I suggest that all kernel developers will need weekly training sessions,
involving a lot of Big Bird and Sesame Street. And at the yearly
maintainer summit, we will all sing the alphabet song together.

I doubt I will keep doing this. At some point "perverse sense of
curiosity" turns into just a cold dark place filled with sadness and
despair.

Repeats: 80e62bc8487b ("MAINTAINERS: re-sort all entries and fields")
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'dma-mapping-6.5-2023-07-09' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

- swiotlb area sizing fixes (Petr Tesarik)

* tag 'dma-mapping-6.5-2023-07-09' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: reduce the number of areas to match actual memory pool size
swiotlb: always set the number of areas before allocating the pool

Merge tag 'irq_urgent_for_v6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq update from Borislav Petkov:

- Optimize IRQ domain's name assignment

* tag 'irq_urgent_for_v6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqdomain: Use return value of strreplace()

Merge tag 'x86_urgent_for_v6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fpu fix from Borislav Petkov:

- Do FPU AP initialization on Xen PV too which got missed by the recent
boot reordering work

* tag 'x86_urgent_for_v6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/xen: Fix secondary processors' FPU initialization

Merge tag 'x86-core-2023-07-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fix from Thomas Gleixner:
"A single fix for the mechanism to park CPUs with an INIT IPI.

  On shutdown or kexec, the kernel tries to park the non-boot CPUs with
  an INIT IPI. But the same code path is also used by the crash utility.
  If the CPU which panics is not the boot CPU then it sends an INIT IPI
  to the boot CPU which resets the machine.

  Prevent this by validating that the CPU which runs the stop mechanism
  is the boot CPU. If not, leave the other CPUs in HLT"

* tag 'x86-core-2023-07-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/smp: Don't send INIT to boot CPU

Merge tag 'mips_6.5_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux

Pull MIPS fixes from Thomas Bogendoerfer:

- fixes for KVM

- fix for loongson build and cpu probing

- DT fixes

* tag 'mips_6.5_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  MIPS: kvm: Fix build error with KVM_MIPS_DEBUG_COP0_COUNTERS enabled
  MIPS: dts: add missing space before {
  MIPS: Loongson: Fix build error when make modules_install
  MIPS: KVM: Fix NULL pointer dereference
  MIPS: Loongson: Fix cpu_probe_loongson() again

Merge tag 'xfs-6.5-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fix from Darrick Wong:
"Nothing exciting here, just getting rid of a gcc warning that I got
tired of seeing when I turn on gcov"

* tag 'xfs-6.5-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix uninit warning in xfs_growfs_data

Merge tag '6.5-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull more smb client updates from Steve French:

- fix potential use after free in unmount

- minor cleanup

- add worker to cleanup stale directory leases

* tag '6.5-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Add a laundromat thread for cached directories
  smb: client: remove redundant pointer 'server'
  cifs: fix session state transition to avoid use-after-free issue

Merge tag 'ntb-6.5' of https://github.com/jonmason/ntb

Pull NTB updates from Jon Mason:
"Fixes for pci_clean_master, error handling in driver inits, and
  various other issues/bugs"

* tag 'ntb-6.5' of https://github.com/jonmason/ntb:
  ntb: hw: amd: Fix debugfs_create_dir error checking
  ntb.rst: Fix copy and paste error
  ntb_netdev: Fix module_init problem
  ntb: intel: Remove redundant pci_clear_master
  ntb: epf: Remove redundant pci_clear_master
  ntb_hw_amd: Remove redundant pci_clear_master
  ntb: idt: drop redundant pci_enable_pcie_error_reporting()
  MAINTAINERS: git://github -> https://github.com for jonmason
  NTB: EPF: fix possible memory leak in pci_vntb_probe()
  NTB: ntb_tool: Add check for devm_kcalloc
  NTB: ntb_transport: fix possible memory leak while device_register() fails
  ntb: intel: Fix error handling in intel_ntb_pci_driver_init()
  NTB: amd: Fix error handling in amd_ntb_pci_driver_init()
  ntb: idt: Fix error handling in idt_pci_driver_init()

mm: lock newly mapped VMA with corrected ordering

Lockdep is certainly right to complain about

  (&vma->vm_lock->lock){++++}-{3:3}, at: vma_start_write+0x2d/0x3f
                 but task is already holding lock:
  (&mapping->i_mmap_rwsem){+.+.}-{3:3}, at: mmap_region+0x4dc/0x6db

Invert those to the usual ordering.

Fixes: 33313a747e81 ("mm: lock newly mapped VMA which can be modified after it becomes visible")
Cc: [email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Tested-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'mm-hotfixes-stable-2023-07-08-10-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
"16 hotfixes. Six are cc:stable and the remainder address post-6.4
  issues"

The merge undoes the disabling of the CONFIG_PER_VMA_LOCK feature, since
it was all hopefully fixed in mainline.

* tag 'mm-hotfixes-stable-2023-07-08-10-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  lib: dhry: fix sleeping allocations inside non-preemptable section
  kasan, slub: fix HW_TAGS zeroing with slub_debug
  kasan: fix type cast in memory_is_poisoned_n
  mailmap: add entries for Heiko Stuebner
  mailmap: update manpage link
  bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page
  MAINTAINERS: add linux-next info
  mailmap: add Markus Schneider-Pargmann
  writeback: account the number of pages written back
  mm: call arch_swap_restore() from do_swap_page()
  squashfs: fix cache race with migration
  mm/hugetlb.c: fix a bug within a BUG(): inconsistent pte comparison
  docs: update ocfs2-devel mailing list address
  MAINTAINERS: update ocfs2-devel mailing list address
  mm: disable CONFIG_PER_VMA_LOCK until its fixed
  fork: lock VMAs of the parent process when forking

fork: lock VMAs of the parent process when forking

When forking a child process, the parent write-protects anonymous pages
and COW-shares them with the child being forked using copy_present_pte().

We must not take any concurrent page faults on the source vma's as they
are being processed, as we expect both the vma and the pte's behind it
to be stable.  For example, the anon_vma_fork() expects the parents
vma->anon_vma to not change during the vma copy.

A concurrent page fault on a page newly marked read-only by the page
copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
source vma, defeating the anon_vma_clone() that wasn't done because the
parent vma originally didn't have an anon_vma, but we now might end up
copying a pte entry for a page that has one.

Before the per-vma lock based changes, the mmap_lock guaranteed
exclusion with concurrent page faults.  But now we need to do a
vma_start_write() to make sure no concurrent faults happen on this vma
while it is being processed.

This fix can potentially regress some fork-heavy workloads.  Kernel
build time did not show noticeable regression on a 56-core machine while
a stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~5% regression.  If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance.  Further
optimizations are possible if this regression proves to be problematic.

Suggested-by: David Hildenbrand <[email protected]>
Reported-by: Jiri Slaby <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Reported-by: Holger Hoffstätte <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Reported-by: Jacob Young <[email protected]>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Cc: [email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: lock newly mapped VMA which can be modified after it becomes visible

mmap_region adds a newly created VMA into VMA tree and might modify it
afterwards before dropping the mmap_lock. This poses a problem for page
faults handled under per-VMA locks because they don't take the mmap_lock
and can stumble on this VMA while it's still being modified. Currently
this does not pose a problem since post-addition modifications are done
only for file-backed VMAs, which are not handled under per-VMA lock.
However, once support for handling file-backed page faults with per-VMA
locks is added, this will become a race.

Fix this by write-locking the VMA before inserting it into the VMA tree.
Other places where a new VMA is added into VMA tree do not modify it
after the insertion, so do not need the same locking.

Cc: [email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: lock a vma before stack expansion

With recent changes necessitating mmap_lock to be held for write while
expanding a stack, per-VMA locks should follow the same rules and be
write-locked to prevent page faults into the VMA being expanded. Add
the necessary locking.

Cc: [email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull more SCSI updates from James Bottomley:
"A few late arriving patches that missed the initial pull request. It's
  mostly bug fixes (the dt-bindings is a fix for the initial pull)"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: ufs: core: Remove unused function declaration
  scsi: target: docs: Remove tcm_mod_builder.py
  scsi: target: iblock: Quiet bool conversion warning with pr_preempt use
  scsi: dt-bindings: ufs: qcom: Fix ICE phandle
  scsi: core: Simplify scsi_cdl_check_cmd()
  scsi: isci: Fix comment typo
  scsi: smartpqi: Replace one-element arrays with flexible-array members
  scsi: target: tcmu: Replace strlcpy() with strscpy()
  scsi: ncr53c8xx: Replace strlcpy() with strscpy()
  scsi: lpfc: Fix lpfc_name struct packing

Merge tag 'i2c-for-6.5-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull more i2c updates from Wolfram Sang:

- xiic patch should have been in the original pull but slipped through

- mpc patch fixes a build regression

- nomadik cleanup

* tag 'i2c-for-6.5-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: mpc: Drop unused variable
  i2c: nomadik: Remove a useless call in the remove function
  i2c: xiic: Don't try to handle more interrupt events after error

Merge tag 'hardening-v6.5-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening fixes from Kees Cook:

- Check for NULL bdev in LoadPin (Matthias Kaehlcke)

- Revert unwanted KUnit FORTIFY build default

- Fix 1-element array causing boot warnings with xhci-hub

* tag 'hardening-v6.5-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  usb: ch9: Replace bmSublinkSpeedAttr 1-element array with flexible array
  Revert "fortify: Allow KUnit test to build without FORTIFY"
  dm: verity-loadpin: Add NULL pointer check for 'bdev' parameter

ntb: hw: amd: Fix debugfs_create_dir error checking

The debugfs_create_dir function returns ERR_PTR in case of error, and the
only correct way to check if an error occurred is 'IS_ERR' inline function.
This patch will replace the null-comparison with IS_ERR.

Signed-off-by: Anup Sharma <[email protected]>
Suggested-by: Ivan Orlov <[email protected]>
Signed-off-by: Jon Mason <[email protected]>

Merge tag 'perf-tools-for-v6.5-2-2023-07-06' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next

Pull more perf tools updates from Namhyung Kim:
"These are remaining changes and fixes for this cycle.

  Build:

   - Allow generating vmlinux.h from BTF using `make GEN_VMLINUX_H=1`
     and skip if the vmlinux has no BTF.

   - Replace deprecated clang -target xxx option by --target=xxx.

  perf record:

   - Print event attributes with well known type and config symbols in
     the debug output like below:

       # perf record -e cycles,cpu-clock -C0 -vv true
       <SNIP>
       ------------------------------------------------------------
       perf_event_attr:
         type                             0 (PERF_TYPE_HARDWARE)
         size                             136
         config                           0 (PERF_COUNT_HW_CPU_CYCLES)
         { sample_period, sample_freq }   4000
         sample_type                      IP|TID|TIME|CPU|PERIOD|IDENTIFIER
         read_format                      ID
         disabled                         1
         inherit                          1
         freq                             1
         sample_id_all                    1
         exclude_guest                    1
       ------------------------------------------------------------
       sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8 = 5
       ------------------------------------------------------------
       perf_event_attr:
         type                             1 (PERF_TYPE_SOFTWARE)
         size                             136
         config                           0 (PERF_COUNT_SW_CPU_CLOCK)
         { sample_period, sample_freq }   4000
         sample_type                      IP|TID|TIME|CPU|PERIOD|IDENTIFIER
         read_format                      ID
         disabled                         1
         inherit                          1
         freq                             1
         sample_id_all                    1
         exclude_guest                    1

   - Update AMD IBS event error message since it now support per-process
     profiling but no priviledge filters.

       $ sudo perf record -e ibs_op//k -C 0
       Error:
       AMD IBS doesn't support privilege filtering. Try again without
       the privilege modifiers (like 'k') at the end.

  perf lock contention:

   - Support CSV style output using -x option

       $ sudo perf lock con -ab -x, sleep 1
       # output: contended, total wait, max wait, avg wait, type, caller
       19, 194232, 21415, 10222, spinlock, process_one_work+0x1f0
       15, 162748, 23843, 10849, rwsem:R, do_user_addr_fault+0x40e
       4, 86740, 23415, 21685, rwlock:R, ep_poll_callback+0x2d
       1, 84281, 84281, 84281, mutex, iwl_mvm_async_handlers_wk+0x135
       8, 67608, 27404, 8451, spinlock, __queue_work+0x174
       3, 58616, 31125, 19538, rwsem:W, do_mprotect_pkey+0xff
       3, 52953, 21172, 17651, rwlock:W, do_epoll_wait+0x248
       2, 30324, 19704, 15162, rwsem:R, do_madvise+0x3ad
       1, 24619, 24619, 24619, spinlock, rcu_core+0xd4

   - Add --output option to save the data to a file not to be interfered
     by other debug messages.

  Test:

   - Fix event parsing test on ARM where there's no raw PMU nor supports
     PERF_PMU_CAP_EXTENDED_HW_TYPE.

   - Update the lock contention test case for CSV output.

   - Fix a segfault in the daemon command test.

  Vendor events (JSON):

   - Add has_event() to check if the given event is available on system
     at runtime. On Intel machines, some transaction events may not be
     present when TSC extensions are disabled.

   - Update Intel event metrics.

  Misc:

   - Sort symbols by name using an external array of pointers instead of
     a rbtree node in the symbol. This will save 16-bytes or 24-bytes
     per symbol whether the sorting is actually requested or not.

   - Fix unwinding DWARF callstacks using libdw when --symfs option is
     used"

* tag 'perf-tools-for-v6.5-2-2023-07-06' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next: (38 commits)
  perf test: Fix event parsing test when PERF_PMU_CAP_EXTENDED_HW_TYPE isn't supported.
  perf test: Fix event parsing test on Arm
  perf evsel amd: Fix IBS error message
  perf: unwind: Fix symfs with libdw
  perf symbol: Fix uninitialized return value in symbols__find_by_name()
  perf test: Test perf lock contention CSV output
  perf lock contention: Add --output option
  perf lock contention: Add -x option for CSV style output
  perf lock: Remove stale comments
  perf vendor events intel: Update tigerlake to 1.13
  perf vendor events intel: Update skylakex to 1.31
  perf vendor events intel: Update skylake to 57
  perf vendor events intel: Update sapphirerapids to 1.14
  perf vendor events intel: Update icelakex to 1.21
  perf vendor events intel: Update icelake to 1.19
  perf vendor events intel: Update cascadelakex to 1.19
  perf vendor events intel: Update meteorlake to 1.03
  perf vendor events intel: Add rocketlake events/metrics
  perf vendor metrics intel: Make transaction metrics conditional
  perf jevents: Support for has_event function
  ...

Merge tag 'bitmap-6.5-rc1' of https://github.com/norov/linux

Pull bitmap updates from Yury Norov:
"Fixes for different bitmap pieces:

   - lib/test_bitmap: increment failure counter properly

     The tests that don't use expect_eq() macro to determine that a test
     is failured must increment failed_tests explicitly.

   - lib/bitmap: drop optimization of bitmap_{from,to}_arr64

     bitmap_{from,to}_arr64() optimization is overly optimistic
     on 32-bit LE architectures when it's wired to
     bitmap_copy_clear_tail().

   - nodemask: Drop duplicate check in for_each_node_mask()

     As the return value type of first_node() became unsigned, the node
     >= 0 became unnecessary.

   - cpumask: fix function description kernel-doc notation

   - MAINTAINERS: Add bits.h and bitfield.h to the BITMAP API record

     Add linux/bits.h and linux/bitfield.h for visibility"

* tag 'bitmap-6.5-rc1' of https://github.com/norov/linux:
  MAINTAINERS: Add bitfield.h to the BITMAP API record
  MAINTAINERS: Add bits.h to the BITMAP API record
  cpumask: fix function description kernel-doc notation
  nodemask: Drop duplicate check in for_each_node_mask()
  lib/bitmap: drop optimization of bitmap_{from,to}_arr64
  lib/test_bitmap: increment failure counter properly

lib: dhry: fix sleeping allocations inside non-preemptable section

The Smatch static checker reports the following warnings:

lib/dhry_run.c:38 dhry_benchmark() warn: sleeping in atomic context
lib/dhry_run.c:43 dhry_benchmark() warn: sleeping in atomic context

Indeed, dhry() does sleeping allocations inside the non-preemptable
section delimited by get_cpu()/put_cpu().

Fix this by using atomic allocations instead.
Add error handling, as atomic these allocations may fail.

Link: https://lkml.kernel.org/r/bac6d517818a7cd8efe217c1ad649fffab9cc371.1688568764.git.geert+renesas@glider.be
Fixes: 13684e966d46283e ("lib: dhry: fix unstable smp_processor_id(_) usage")
Reported-by: Dan Carpenter <[email protected]>
Closes: https://lore.kernel.org/r/[email protected]
Signed-off-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

kasan, slub: fix HW_TAGS zeroing with slub_debug

Commit 946fa0dbf2d8 ("mm/slub: extend redzone check to extra allocated
kmalloc space than requested") added precise kmalloc redzone poisoning to
the slub_debug functionality.

However, this commit didn't account for HW_TAGS KASAN fully initializing
the object via its built-in memory initialization feature.  Even though
HW_TAGS KASAN memory initialization contains special memory initialization
handling for when slub_debug is enabled, it does not account for in-object
slub_debug redzones.  As a result, HW_TAGS KASAN can overwrite these
redzones and cause false-positive slub_debug reports.

To fix the issue, avoid HW_TAGS KASAN memory initialization when
slub_debug is enabled altogether.  Implement this by moving the
__slub_debug_enabled check to slab_post_alloc_hook.  Common slab code
seems like a more appropriate place for a slub_debug check anyway.

Link: https://lkml.kernel.org/r/678ac92ab790dba9198f9ca14f405651b97c8502.1688561016.git.andreyknvl@google.com
Fixes: 946fa0dbf2d8 ("mm/slub: extend redzone check to extra allocated kmalloc space than requested")
Signed-off-by: Andrey Konovalov <[email protected]>
Reported-by: Will Deacon <[email protected]>
Acked-by: Marco Elver <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: [email protected]
Cc: Pekka Enberg <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

kasan: fix type cast in memory_is_poisoned_n

Commit bb6e04a173f0 ("kasan: use internal prototypes matching gcc-13
builtins") introduced a bug into the memory_is_poisoned_n implementation:
it effectively removed the cast to a signed integer type after applying
KASAN_GRANULE_MASK.

As a result, KASAN started failing to properly check memset, memcpy, and
other similar functions.

Fix the bug by adding the cast back (through an additional signed integer
variable to make the code more readable).

Link: https://lkml.kernel.org/r/8c9e0251c2b8b81016255709d4ec42942dcaf018.1688431866.git.andreyknvl@google.com
Fixes: bb6e04a173f0 ("kasan: use internal prototypes matching gcc-13 builtins")
Signed-off-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mailmap: add entries for Heiko Stuebner

I am going to lose my vrull.eu address at the end of july, and while
adding it to mailmap I also realised that there are more old addresses
from me dangling, so update .mailmap for all of them.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Heiko Stuebner <[email protected]>
Signed-off-by: Heiko Stuebner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mailmap: update manpage link

Patch series "Update .mailmap for my work address and fix manpage".

While updating mailmap for the going-away address, I also found that on
current systems the manpage linked from the header comment changed.

And in fact it looks like the git mailmap feature got its own manpage.

This patch (of 2):

On recent systems the git-shortlog manpage only tells people to
See gitmailmap(5)

So instead of sending people on a scavenger hunt, put that info into the
header directly. Though keep the old reference around for older systems.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Heiko Stuebner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page

commit dd0ff4d12dd2 ("bootmem: remove the vmemmap pages from kmemleak in
put_page_bootmem") fix an overlaps existing problem of kmemleak. But the
problem still existed when HAVE_BOOTMEM_INFO_NODE is disabled, because in
this case, free_bootmem_page() will call free_reserved_page() directly.

Fix the problem by adding kmemleak_free_part() in free_bootmem_page() when
HAVE_BOOTMEM_INFO_NODE is disabled.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
Signed-off-by: Liu Shixin <[email protected]>
Acked-by: Muchun Song <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

MAINTAINERS: add linux-next info

Add linux-next info to MAINTAINERS for ease of finding this data.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Acked-by: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mailmap: add Markus Schneider-Pargmann

Add my old mail address and update my name.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Markus Schneider-Pargmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

writeback: account the number of pages written back

nr_to_write is a count of pages, so we need to decrease it by the number
of pages in the folio we just wrote, not by 1. Most callers specify
either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes() might
end up writing 512x as many pages as it asked for.

Dave added:

: XFS is the only filesystem this would affect, right? AFAIA, nothing
: else enables large folios and uses writeback through
: write_cache_pages() at this point...
:
: In which case, I'd be surprised if much difference, if any, gets
: noticed by anyone.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Dave Chinner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: call arch_swap_restore() from do_swap_page()

Commit c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") moved
the call to swap_free() before the call to set_pte_at(), which meant that
the MTE tags could end up being freed before set_pte_at() had a chance to
restore them. Fix it by adding a call to the arch_swap_restore() hook
before the call to swap_free().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c510678965
Fixes: c145e0b47c77 ("mm: streamline COW logic in do_swap_page()")
Signed-off-by: Peter Collingbourne <[email protected]>
Reported-by: Qun-wei Lin <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Acked-by: David Hildenbrand <[email protected]>
Acked-by: "Huang, Ying" <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: <[email protected]> [6.1+]
Signed-off-by: Andrew Morton <[email protected]>

squashfs: fix cache race with migration

Migration replaces the page in the mapping before copying the contents and
the flags over from the old page, so check that the page in the page cache
is really up to date before using it. Without this, stressing squashfs
reads with parallel compaction sometimes results in squashfs reporting
data corruption.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: e994f5b677ee ("squashfs: cache partial compressed blocks")
Signed-off-by: Vincent Whitchurch <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Phillip Lougher <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/hugetlb.c: fix a bug within a BUG(): inconsistent pte comparison

The following crash happens for me when running the -mm selftests (below).
Specifically, it happens while running the uffd-stress subtests:

kernel BUG at mm/hugetlb.c:7249!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 3238 Comm: uffd-stress Not tainted 6.4.0-hubbard-github+ #109
Hardware name: ASUS X299-A/PRIME X299-A, BIOS 1503 08/03/2018
RIP: 0010:huge_pte_alloc+0x12c/0x1a0
...
Call Trace:
<TASK>
? __die_body+0x63/0xb0
? die+0x9f/0xc0
? do_trap+0xab/0x180
? huge_pte_alloc+0x12c/0x1a0
? do_error_trap+0xc6/0x110
? huge_pte_alloc+0x12c/0x1a0
? handle_invalid_op+0x2c/0x40
? huge_pte_alloc+0x12c/0x1a0
? exc_invalid_op+0x33/0x50
? asm_exc_invalid_op+0x16/0x20
? __pfx_put_prev_task_idle+0x10/0x10
? huge_pte_alloc+0x12c/0x1a0
hugetlb_fault+0x1a3/0x1120
? finish_task_switch+0xb3/0x2a0
? lock_is_held_type+0xdb/0x150
handle_mm_fault+0xb8a/0xd40
? find_vma+0x5d/0xa0
do_user_addr_fault+0x257/0x5d0
exc_page_fault+0x7b/0x1f0
asm_exc_page_fault+0x22/0x30

That happens because a BUG() statement in huge_pte_alloc() attempts to
check that a pte, if present, is a hugetlb pte, but it does so in a
non-lockless-safe manner that leads to a false BUG() report.

We got here due to a couple of bugs, each of which by itself was not quite
enough to cause a problem:

First of all, before commit c33c794828f2("mm: ptep_get() conversion"), the
BUG() statement in huge_pte_alloc() was itself fragile: it relied upon
compiler behavior to only read the pte once, despite using it twice in the
same conditional.

Next, commit c33c794828f2 ("mm: ptep_get() conversion") broke that
delicate situation, by causing all direct pte reads to be done via
READ_ONCE(). And so READ_ONCE() got called twice within the same BUG()
conditional, leading to comparing (potentially, occasionally) different
versions of the pte, and thus to false BUG() reports.

Fix this by taking a single snapshot of the pte before using it in the
BUG conditional.

Now, that commit is only partially to blame here but, people doing
bisections will invariably land there, so this will help them find a fix
for a real crash. And also, the previous behavior was unlikely to ever
expose this bug--it was fragile, yet not actually broken.

So that's why I chose this commit for the Fixes tag, rather than the
commit that created the original BUG() statement.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: c33c794828f2 ("mm: ptep_get() conversion")
Signed-off-by: John Hubbard <[email protected]>
Acked-by: James Houghton <[email protected]>
Acked-by: Muchun Song <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Acked-by: Mike Kravetz <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dimitri Sivanich <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oleksandr Tyshchenko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

docs: update ocfs2-devel mailing list address

The ocfs2-devel mailing list has been migrated to the kernel.org
infrastructure, update all related documentation pointers to reflect the
change.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anthony Iliopoulos <[email protected]>
Acked-by: Joseph Qi <[email protected]>
Acked-by: Joel Becker <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

MAINTAINERS: update ocfs2-devel mailing list address

The ocfs2-devel mailing list has been migrated to the kernel.org
infrastructure, update the related entry to reflect the change.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anthony Iliopoulos <[email protected]>
Acked-by: Joseph Qi <[email protected]>
Acked-by: Joel Becker <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: disable CONFIG_PER_VMA_LOCK until its fixed

A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86. Disable per-VMA locks config to
prevent this issue until the fix is confirmed. This is expected to be a
temporary measure.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
[2] https://lore.kernel.org/all/20230227173632.3292573 [email protected]

Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Jiri Slaby <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Reported-by: Jacob Young <[email protected]>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Signed-off-by: Suren Baghdasaryan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Holger Hoffstätte <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

fork: lock VMAs of the parent process when forking

Patch series "Avoid memory corruption caused by per-VMA locks", v4.

A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86.  Based on the reproducer
provided in [1] we suspect this is caused by the lack of VMA locking while
forking a child process.

Patch 1/2 in the series implements proper VMA locking during fork.  I
tested the fix locally using the reproducer and was unable to reproduce
the memory corruption problem.

This fix can potentially regress some fork-heavy workloads.  Kernel build
time did not show noticeable regression on a 56-core machine while a
stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~7% regression.  If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance.  Further
optimizations are possible if this regression proves to be problematic.

Patch 2/2 disables per-VMA locks until the fix is tested and verified.

This patch (of 2):

When forking a child process, parent write-protects an anonymous page and
COW-shares it with the child being forked using copy_present_pte().
Parent's TLB is flushed right before we drop the parent's mmap_lock in
dup_mmap().  If we get a write-fault before that TLB flush in the parent,
and we end up replacing that anonymous page in the parent process in
do_wp_page() (because, COW-shared with the child), this might lead to some
stale writable TLB entries targeting the wrong (old) page.  Similar issue
happened in the past with userfaultfd (see flush_tlb_page() call inside
do_wp_page()).

Lock VMAs of the parent process when forking a child, which prevents
concurrent page faults during fork operation and avoids this issue.  This
fix can potentially regress some fork-heavy workloads.  Kernel build time
did not show noticeable regression on a 56-core machine while a stress
test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~7%
regression.  If such fork time regression is unacceptable, disabling
CONFIG_PER_VMA_LOCK should restore its performance.  Further optimizations
are possible if this regression proves to be problematic.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Signed-off-by: Suren Baghdasaryan <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Reported-by: Jiri Slaby <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Reported-by: Holger Hoffstätte <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Reported-by: Jacob Young <[email protected]>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=3D217624
Reviewed-by: Liam R. Howlett <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Tested-by: Holger Hoffsttte <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

ntb.rst: Fix copy and paste error

It seems the text for the NTB MSI Test Client section was copied from the
NTB Tool Test Client, but was not updated for the new section. Corrects
the NTB MSI Test Client section text.

Reviewed-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Signed-off-by: Geoff Levand <[email protected]>
Signed-off-by: Jon Mason <[email protected]>

ntb_netdev: Fix module_init problem

With both the ntb_transport_init and the ntb_netdev_init_module routines in the
module_init init group, the ntb_netdev_init_module routine can be called before
the ntb_transport_init routine that it depends on is called. To assure the
proper initialization order put ntb_netdev_init_module in the late_initcall
group.

Fixes runtime errors where the ntb_netdev_init_module call fails with ENODEV.

Signed-off-by: Geoff Levand <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Signed-off-by: Jon Mason <[email protected]>

ntb: intel: Remove redundant pci_clear_master

Remove pci_clear_master to simplify the code,
the bus-mastering is also cleared in do_pci_disable_device,
like this:
./drivers/pci/pci.c:2197
static void do_pci_disable_device(struct pci_dev *dev)
{
u16 pci_command;

pci_read_config_word(dev, PCI_COMMAND, &pci_command);
if (pci_command & PCI_COMMAND_MASTER) {
pci_command &= ~PCI_COMMAND_MASTER;
pci_write_config_word(dev, PCI_COMMAND, pci_command);
}

pcibios_disable_device(dev);
}.
And dev->is_busmaster is set to 0 in pci_disable_device.

Signed-off-by: Cai Huoqing <[email protected]>
Acked-by: Dave Jiang <[email protected]>
Signed-off-by: Jon Mason <[email protected]>

ntb: epf: Remove redundant pci_clear_master

Remove pci_clear_master to simplify the code,
the bus-mastering is also cleared in do_pci_disable_device,
like this:
./drivers/pci/pci.c:2197
static void do_pci_disable_device(struct pci_dev *dev)
{
u16 pci_command;

pci_read_config_word(dev, PCI_COMMAND, &pci_command);
if (pci_command & PCI_COMMAND_MASTER) {
pci_command &= ~PCI_COMMAND_MASTER;
pci_write_config_word(dev, PCI_COMMAND, pci_command);
}

pcibios_disable_device(dev);
}.
And dev->is_busmaster is set to 0 in pci_disable_device.

Signed-off-by: Cai Huoqing <[email protected]>
Signed-off-by: Jon Mason <[email protected]>