Revert "kvm: selftests: move base kvm_util.h declarations to kvm_util_base.h"
Effectively revert the movement of code from kvm_util.h => kvm_util_base.h,
as the TL;DR of the justification for the move was to avoid #idefs and/or
circular dependencies between what ended up being ucall_common.h and what
was (and now again, is), kvm_util.h.
But avoiding #ifdef and circular includes is trivial: don't do that. The
cost of removing kvm_util_base.h is a few extra includes of ucall_common.h,
but that cost is practically nothing. On the other hand, having a "base"
version of a header that is really just the header itself is confusing,
and makes it weird/hard to choose names for headers that actually are
"base" headers, e.g. to hold core KVM selftests typedefs.
KVM: selftests: Randomly force emulation on x86 writes from guest code
Override vcpu_arch_put_guest() to randomly force emulation on supported
accesses. Force emulation of LOCK CMPXCHG as well as a regular MOV to
stress KVM's emulation of atomic accesses, which has a unique path in
KVM's emulator.
Arbitrarily give all the decisions 50/50 odds; absent much, much more
sophisticated infrastructure for generating random numbers, it's highly
unlikely that doing more than a coin flip with affect selftests' ability
to find KVM bugs.
This is effectively a regression test for commit 910c57dfa4d1 ("KVM: x86:
Mark target gfn of emulated atomic instruction as dirty").
KVM: selftests: Add vcpu_arch_put_guest() to do writes from guest code
Introduce a macro, vcpu_arch_put_guest(), for "putting" values to memory
from guest code in "interesting" situations, e.g. when writing memory that
is being dirty logged. Structure the macro so that arch code can provide
a custom implementation, e.g. x86 will use the macro to force emulation of
the access.
Use the helper in dirty_log_test, which is of particular interest (see
above), and in xen_shinfo_test, which isn't all that interesting, but
provides a second usage of the macro with a different size operand
(uint8_t versus uint64_t), i.e. to help verify that the macro works for
more than just 64-bit values.
Use "put" as the verb to align with the kernel's {get,put}_user()
terminology.
KVM: selftests: Add global snapshot of kvm_is_forced_emulation_enabled()
Add a global snapshot of kvm_is_forced_emulation_enabled() and sync it to
all VMs by default so that core library code can force emulation, e.g. to
allow for easier testing of the intersections between emulation and other
features in KVM.
KVM: selftests: Provide an API for getting a random bool from an RNG
Move memstress' random bool logic into common code to avoid reinventing
the wheel for basic yes/no decisions. Provide an outer wrapper to handle
the basic/common case of just wanting a 50/50 chance of something
happening.
KVM: selftests: Provide a global pseudo-RNG instance for all tests
Add a global guest_random_state instance, i.e. a pseudo-RNG, so that an
RNG is available for *all* tests. This will allow randomizing behavior
in core library code, e.g. x86 will utilize the pRNG to conditionally
force emulation of writes from within common guest code.
To allow for deterministic runs, and to be compatible with existing tests,
allow tests to override the seed used to initialize the pRNG.
Note, the seed *must* be overwritten before a VM is created in order for
the seed to take effect, though it's perfectly fine for a test to
initialize multiple VMs with different seeds.
And as evidenced by memstress_guest_code(), it's also a-ok to instantiate
more RNGs using the global seed (or a modified version of it). The goal
of the global RNG is purely to ensure that _a_ source of random numbers is
available, it doesn't have to be the _only_ RNG.
KVM: selftests: Define _GNU_SOURCE for all selftests code
Define _GNU_SOURCE is the base CFLAGS instead of relying on selftests to
manually #define _GNU_SOURCE, which is repetitive and error prone. E.g.
kselftest_harness.h requires _GNU_SOURCE for asprintf(), but if a selftest
includes kvm_test_harness.h after stdio.h, the include guards result in
the effective version of stdio.h consumed by kvm_test_harness.h not
defining asprintf():
In file included from x86_64/fix_hypercall_test.c:12:
In file included from include/kvm_test_harness.h:11:
../kselftest_harness.h:1169:2: error: call to undeclared function
'asprintf'; ISO C99 and later do not support implicit function declarations
[-Wimplicit-function-declaration]
1169 | asprintf(&test_name, "%s%s%s.%s", f->name,
| ^
When including the rseq selftest's "library" code, #undef _GNU_SOURCE so
that rseq.c controls whether or not it wants to build with _GNU_SOURCE.
KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument
TDX uses different ABI to get information about VM exit. Pass intr_info to
the NMI and INTR handlers instead of pulling it from vcpu_vmx in
preparation for sharing the bulk of the handlers with TDX.
When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
exit qualification etc rather than the VMCS fields because VMM doesn't have
access to the VMCS. The eventual code will be
VMX:
- get exit reason, intr_info, exit_qualification, and etc from VMCS
- call NMI/INTR handlers (common code)
TDX:
- get exit reason, intr_info, exit_qualification, and etc from guest
registers
- call NMI/INTR handlers (common code)
Paolo Bonzini [Mon, 18 Mar 2024 15:39:16 +0000 (11:39 -0400)]
KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDX
KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
to operate on VM. TDX doesn't allow VMM to operate VMCS directly.
Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to
indirectly operate those data structures. This means we must have a TDX
version of kvm_x86_ops.
The existing global struct kvm_x86_ops already defines an interface which
can be adapted to TDX, but kvm_x86_ops is a system-wide, not per-VM
structure. To allow VMX to coexist with TDs, the kvm_x86_ops callbacks
will have wrappers "if (tdx) tdx_op() else vmx_op()" to pick VMX or
TDX at run time.
To split the runtime switch, the VMX implementation, and the TDX
implementation, add main.c, and move out the vmx_x86_ops hooks in
preparation for adding TDX. Use 'vt' for the naming scheme as a nod to
VT-x and as a concatenation of VmxTdx.
The eventually converted code will look like this:
Paolo Bonzini [Thu, 4 Apr 2024 13:17:09 +0000 (09:17 -0400)]
Merge branch 'kvm-sev-init2' into HEAD
The idea that no parameter would ever be necessary when enabling SEV or
SEV-ES for a VM was decidedly optimistic. The first source of variability
that was encountered is the desired set of VMSA features, as that affects
the measurement of the VM's initial state and cannot be changed
arbitrarily by the hypervisor.
This series adds all the APIs that are needed to customize the features,
with room for future enhancements:
- a new /dev/kvm device attribute to retrieve the set of supported
features (right now, only debug swap)
- a new sub-operation for KVM_MEM_ENCRYPT_OP that can take a struct,
replacing the existing KVM_SEV_INIT and KVM_SEV_ES_INIT
It then puts the new op to work by including the VMSA features as a field
of the The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of
supported VMSA features for backwards compatibility; but I am considering
also making them use zero as the feature mask, and will gladly adjust the
patches if so requested.
In order to avoid creating *two* new KVM_MEM_ENCRYPT_OPs, I decided that
I could as well make SEV and SEV-ES use VM types. This allows SEV-SNP
to reuse the KVM_SEV_INIT2 ioctl.
And while at it, KVM_SEV_INIT2 also includes two bugfixes. First of all,
SEV-ES VM, when created with the new VM type instead of KVM_SEV_ES_INIT,
reject KVM_GET_REGS/KVM_SET_REGS and friends on the vCPU file descriptor
once the VMSA has been encrypted... which is how the API should have
always behaved. Second, they also synchronize the FPU and AVX state.
Paolo Bonzini [Thu, 11 Apr 2024 17:06:24 +0000 (13:06 -0400)]
Merge branch 'mm-delete-change-gpte' into HEAD
The .change_pte() MMU notifier callback was intended as an optimization
and for this reason it was initially called without a surrounding
mmu_notifier_invalidate_range_{start,end}() pair. It was only ever
implemented by KVM (which was also the original user of MMU notifiers)
and the rules on when to call set_pte_at_notify() rather than set_pte_at()
have always been pretty obscure.
It may seem a miracle that it has never caused any hard to trigger
bugs, but there's a good reason for that: KVM's implementation has
been nonfunctional for a good part of its existence. Already in
2012, commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with
invalidate_range_start and invalidate_range_end", 2012-10-09) changed the
.change_pte() callback to occur within an invalidate_range_start/end()
pair; and because KVM unmaps the sPTEs during .invalidate_range_start(),
.change_pte() has no hope of finding a sPTE to change.
Therefore, all the code for .change_pte() can be removed from both KVM
and mm/, and set_pte_at_notify() can be replaced with just set_pte_at().
Paolo Bonzini [Fri, 5 Apr 2024 11:58:15 +0000 (07:58 -0400)]
mm: replace set_pte_at_notify() with just set_pte_at()
With the demise of the .change_pte() MMU notifier callback, there is no
notification happening in set_pte_at_notify(). It is a synonym of
set_pte_at() and can be replaced with it.
Paolo Bonzini [Fri, 5 Apr 2024 11:58:14 +0000 (07:58 -0400)]
mmu_notifier: remove the .change_pte() callback
The scope of set_pte_at_notify() has reduced more and more through the
years. Initially, it was meant for when the change to the PTE was
not bracketed by mmu_notifier_invalidate_range_{start,end}(). However,
that has not been so for over ten years. During all this period
the only implementation of .change_pte() was KVM and it
had no actual functionality, because it was called after
mmu_notifier_invalidate_range_start() zapped the secondary PTE.
Now that this (nonfunctional) user of the .change_pte() callback is
gone, the whole callback can be removed. For now, leave in place
set_pte_at_notify() even though it is just a synonym for set_pte_at().
Paolo Bonzini [Fri, 5 Apr 2024 11:58:12 +0000 (07:58 -0400)]
KVM: delete .change_pte MMU notifier callback
The .change_pte() MMU notifier callback was intended as an
optimization. The original point of it was that KSM could tell KVM to flip
its secondary PTE to a new location without having to first zap it. At
the time there was also an .invalidate_page() callback; both of them were
*not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
and .invalidate_page() also doubled as a fallback implementation of
.change_pte().
Later on, however, both callbacks were changed to occur within an
invalidate_range_start/end() block.
In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
set_pte_at_notify with invalidate_range_start and invalidate_range_end",
2012-10-09) did so to remove the fallback from .invalidate_page() to
.change_pte() and allow sleepable .invalidate_page() hooks.
This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of finding a sPTE to change.
Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
well as all the architecture specific implementations.
Paolo Bonzini [Thu, 4 Apr 2024 12:13:26 +0000 (08:13 -0400)]
selftests: kvm: split "launch" phase of SEV VM creation
Allow the caller to set the initial state of the VM. Doing this
before sev_vm_launch() matters for SEV-ES, since that is the
place where the VMSA is updated and after which the guest state
becomes sealed.
Paolo Bonzini [Thu, 4 Apr 2024 12:13:25 +0000 (08:13 -0400)]
selftests: kvm: switch to using KVM_X86_*_VM
This removes the concept of "subtypes", instead letting the tests use proper
VM types that were recently added. While the sev_init_vm() and sev_es_init_vm()
are still able to operate with the legacy KVM_SEV_INIT and KVM_SEV_ES_INIT
ioctls, this is limited to VMs that are created manually with
vm_create_barebones().
Paolo Bonzini [Thu, 4 Apr 2024 12:13:23 +0000 (08:13 -0400)]
KVM: SEV: allow SEV-ES DebugSwap again
The DebugSwap feature of SEV-ES provides a way for confidential guests
to use data breakpoints. Its status is record in VMSA, and therefore
attestation signatures depend on whether it is enabled or not. In order
to avoid invalidating the signatures depending on the host machine, it
was disabled by default (see commit 5abf6dceb066, "SEV: disable SEV-ES
DebugSwap by default", 2024-03-09).
However, we now have a new API to create SEV VMs that allows enabling
DebugSwap based on what the user tells KVM to do, and we also changed the
legacy KVM_SEV_ES_INIT API to never enable DebugSwap. It is therefore
possible to re-enable the feature without breaking compatibility with
kernels that pre-date the introduction of DebugSwap, so go ahead.
Paolo Bonzini [Thu, 4 Apr 2024 12:13:22 +0000 (08:13 -0400)]
KVM: SEV: introduce KVM_SEV_INIT2 operation
The idea that no parameter would ever be necessary when enabling SEV or
SEV-ES for a VM was decidedly optimistic. In fact, in some sense it's
already a parameter whether SEV or SEV-ES is desired. Another possible
source of variability is the desired set of VMSA features, as that affects
the measurement of the VM's initial state and cannot be changed
arbitrarily by the hypervisor.
Create a new sub-operation for KVM_MEMORY_ENCRYPT_OP that can take a struct,
and put the new op to work by including the VMSA features as a field of the
struct. The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of
supported VMSA features for backwards compatibility.
The struct also includes the usual bells and whistles for future
extensibility: a flags field that must be zero for now, and some padding
at the end.
Paolo Bonzini [Thu, 4 Apr 2024 12:13:21 +0000 (08:13 -0400)]
KVM: SEV: sync FPU and AVX state at LAUNCH_UPDATE_VMSA time
SEV-ES allows passing custom contents for x87, SSE and AVX state into the VMSA.
Allow userspace to do that with the usual KVM_SET_XSAVE API and only mark
FPU contents as confidential after it has been copied and encrypted into
the VMSA.
Since the XSAVE state for AVX is the first, it does not need the
compacted-state handling of get_xsave_addr(). However, there are other
parts of XSAVE state in the VMSA that currently are not handled, and
the validation logic of get_xsave_addr() is pointless to duplicate
in KVM, so move get_xsave_addr() to public FPU API; it is really just
a facility to operate on XSAVE state and does not expose any internal
details of arch/x86/kernel/fpu.
Paolo Bonzini [Thu, 4 Apr 2024 12:13:18 +0000 (08:13 -0400)]
KVM: x86: Add supported_vm_types to kvm_caps
This simplifies the implementation of KVM_CHECK_EXTENSION(KVM_CAP_VM_TYPES),
and also allows the vendor module to specify which VM types are supported.
Paolo Bonzini [Thu, 4 Apr 2024 12:13:17 +0000 (08:13 -0400)]
KVM: x86: add fields to struct kvm_arch for CoCo features
Some VM types have characteristics in common; in fact, the only use
of VM types right now is kvm_arch_has_private_mem and it assumes that
_all_ nonzero VM types have private memory.
We will soon introduce a VM type for SEV and SEV-ES VMs, and at that
point we will have two special characteristics of confidential VMs
that depend on the VM type: not just if memory is private, but
also whether guest state is protected. For the latter we have
kvm->arch.guest_state_protected, which is only set on a fully initialized
VM.
For VM types with protected guest state, we can actually fix a problem in
the SEV-ES implementation, where ioctls to set registers do not cause an
error even if the VM has been initialized and the guest state encrypted.
Make sure that when using VM types that will become an error.
Paolo Bonzini [Thu, 4 Apr 2024 12:13:16 +0000 (08:13 -0400)]
KVM: SEV: store VMSA features in kvm_sev_info
Right now, the set of features that are stored in the VMSA upon
initialization is fixed and depends on the module parameters for
kvm-amd.ko. However, the hypervisor cannot really change it at will
because the feature word has to match between the hypervisor and whatever
computes a measurement of the VMSA for attestation purposes.
Add a field to kvm_sev_info that holds the set of features to be stored
in the VMSA; and query it instead of referring to the module parameters.
Because KVM_SEV_INIT and KVM_SEV_ES_INIT accept no parameters, this
does not yet introduce any functional change, but it paves the way for
an API that allows customization of the features per-VM.
Paolo Bonzini [Thu, 4 Apr 2024 12:13:15 +0000 (08:13 -0400)]
KVM: SEV: publish supported VMSA features
Compute the set of features to be stored in the VMSA when KVM is
initialized; move it from there into kvm_sev_info when SEV is initialized,
and then into the initial VMSA.
The new variable can then be used to return the set of supported features
to userspace, via the KVM_GET_DEVICE_ATTR ioctl.
Paolo Bonzini [Thu, 4 Apr 2024 12:13:14 +0000 (08:13 -0400)]
KVM: introduce new vendor op for KVM_GET_DEVICE_ATTR
Allow vendor modules to provide their own attributes on /dev/kvm.
To avoid proliferation of vendor ops, implement KVM_HAS_DEVICE_ATTR
and KVM_GET_DEVICE_ATTR in terms of the same function. You're not
supposed to use KVM_GET_DEVICE_ATTR to do complicated computations,
especially on /dev/kvm.
Paolo Bonzini [Thu, 4 Apr 2024 12:13:13 +0000 (08:13 -0400)]
KVM: x86: use u64_to_user_ptr()
There is no danger to the kernel if 32-bit userspace provides a 64-bit
value that has the high bits set, but for whatever reason happens to
resolve to an address that has something mapped there. KVM uses the
checked version of get_user() and put_user(), so any faults are caught
properly.
Paolo Bonzini [Thu, 4 Apr 2024 12:13:12 +0000 (08:13 -0400)]
KVM: SVM: Compile sev.c if and only if CONFIG_KVM_AMD_SEV=y
Stop compiling sev.c when CONFIG_KVM_AMD_SEV=n, as the number of #ifdefs
in sev.c is getting ridiculous, and having #ifdefs inside of SEV helpers
is quite confusing.
To minimize #ifdefs in code flows, #ifdef away only the kvm_x86_ops hooks
and the #VMGEXIT handler. Stubs are also restricted to functions that
check sev_enabled and to the destruction functions sev_free_cpu() and
sev_vm_destroy(), where the style of their callers is to leave checks
to the callers. Most call sites instead rely on dead code elimination
to take care of functions that are guarded with sev_guest() or
sev_es_guest().
KVM: SVM: Invert handling of SEV and SEV_ES feature flags
Leave SEV and SEV_ES '0' in kvm_cpu_caps by default, and instead set them
in sev_set_cpu_caps() if SEV and SEV-ES support are fully enabled. Aside
from the fact that sev_set_cpu_caps() is wildly misleading when it *clears*
capabilities, this will allow compiling out sev.c without falsely
advertising SEV/SEV-ES support in KVM_GET_SUPPORTED_CPUID.
x86/retpoline: Add NOENDBR annotation to the SRSO dummy return thunk
srso_alias_untrain_ret() is special code, even if it is a dummy
which is called in the !SRSO case, so annotate it like its real
counterpart, to address the following objtool splat:
vmlinux.o: warning: objtool: .export_symbol+0x2b290: data relocation to !ENDBR: srso_alias_untrain_ret+0x0
Merge tag 'firewire-fixes-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394
Pull firewire fixes from Takashi Sakamoto:
"The firewire-ohci kernel module has a parameter for verbose kernel
logging. It is well-known that it logs the spurious IRQ for bus-reset
event due to the unmasked register for IRQ event. This update fixes
the issue"
* tag 'firewire-fixes-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
firewire: ohci: mask bus reset interrupts between ISR and bottom half
Adam Goldman [Sun, 24 Mar 2024 22:38:41 +0000 (07:38 +0900)]
firewire: ohci: mask bus reset interrupts between ISR and bottom half
In the FireWire OHCI interrupt handler, if a bus reset interrupt has
occurred, mask bus reset interrupts until bus_reset_work has serviced and
cleared the interrupt.
Normally, we always leave bus reset interrupts masked. We infer the bus
reset from the self-ID interrupt that happens shortly thereafter. A
scenario where we unmask bus reset interrupts was introduced in 2008 in a007bb857e0b26f5d8b73c2ff90782d9c0972620: If
OHCI_PARAM_DEBUG_BUSRESETS (8) is set in the debug parameter bitmask, we
will unmask bus reset interrupts so we can log them.
irq_handler logs the bus reset interrupt. However, we can't clear the bus
reset event flag in irq_handler, because we won't service the event until
later. irq_handler exits with the event flag still set. If the
corresponding interrupt is still unmasked, the first bus reset will
usually freeze the system due to irq_handler being called again each
time it exits. This freeze can be reproduced by loading firewire_ohci
with "modprobe firewire_ohci debug=-1" (to enable all debugging output).
Apparently there are also some cases where bus_reset_work will get called
soon enough to clear the event, and operation will continue normally.
This freeze was first reported a few months after a007bb85 was committed,
but until now it was never fixed. The debug level could safely be set
to -1 through sysfs after the module was loaded, but this would be
ineffectual in logging bus reset interrupts since they were only
unmasked during initialization.
irq_handler will now leave the event flag set but mask bus reset
interrupts, so irq_handler won't be called again and there will be no
freeze. If OHCI_PARAM_DEBUG_BUSRESETS is enabled, bus_reset_work will
unmask the interrupt after servicing the event, so future interrupts
will be caught as desired.
As a side effect to this change, OHCI_PARAM_DEBUG_BUSRESETS can now be
enabled through sysfs in addition to during initial module loading.
However, when enabled through sysfs, logging of bus reset interrupts will
be effective only starting with the second bus reset, after
bus_reset_work has executed.
Merge tag 'spi-fix-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A few small driver specific fixes, the most important being the
s3c64xx change which is likely to be hit during normal operation"
* tag 'spi-fix-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: mchp-pci1xxx: Fix a possible null pointer dereference in pci1xxx_spi_probe
spi: spi-fsl-lpspi: remove redundant spi_controller_put call
spi: s3c64xx: Use DMA mode from fifo size
Merge tag 'regmap-fix-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap fixes from Mark Brown:
"Richard found a nasty corner case in the maple tree code which he
fixed, and also fixed a compiler warning which was showing up with the
toolchain he uses and helpfully identified a possible incorrect error
code which could have runtime impacts"
* tag 'regmap-fix-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: maple: Fix uninitialized symbol 'ret' warnings
regmap: maple: Fix cache corruption in regcache_maple_drop()
* tag 'block-6.9-20240405' of git://git.kernel.dk/linux:
nvme-fc: rename free_ctrl callback to match name pattern
nvmet-fc: move RCU read lock to nvmet_fc_assoc_exists
nvmet: implement unique discovery NQN
nvme: don't create a multipath node for zero capacity devices
nvme: split nvme_update_zone_info
nvme-multipath: don't inherit LBA-related fields for the multipath node
block: fix overflow in blk_ioctl_discard()
nullblk: Fix cleanup order in null_add_dev() error path
Merge tag 'io_uring-6.9-20240405' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Backport of some fixes that came up during development of the 6.10
io_uring patches. This includes some kbuf cleanups and reference
fixes.
- Disable multishot read if we don't have NOWAIT support on the target
- Fix for a dependency issue with workqueue flushing
* tag 'io_uring-6.9-20240405' of git://git.kernel.dk/linux:
io_uring/kbuf: hold io_buffer_list reference over mmap
io_uring/kbuf: protect io_buffer_list teardown with a reference
io_uring/kbuf: get rid of bl->is_ready
io_uring/kbuf: get rid of lower BGID lists
io_uring: use private workqueue for exit work
io_uring: disable io-wq execution of multishot NOWAIT requests
io_uring/rw: don't allow multishot reads without NOWAIT support
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"The most important is the libsas fix, which is a problem for DMA to a
kmalloc'd structure too small causing cache line interference. The
other fixes (all in drivers) are mostly for allocation length fixes,
error leg unwinding, suspend races and a missing retry"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Fix MCQ mode dev command timeout
scsi: libsas: Align SMP request allocation to ARCH_DMA_MINALIGN
scsi: sd: Unregister device if device_add_disk() failed in sd_probe()
scsi: ufs: core: WLUN suspend dev/link state error recovery
scsi: mylex: Fix sysfs buffer lengths
Merge tag 'mm-hotfixes-stable-2024-04-05-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"8 hotfixes, 3 are cc:stable
There are a couple of fixups for this cycle's vmalloc changes and one
for the stackdepot changes. And a fix for a very old x86 PAT issue
which can cause a warning splat"
* tag 'mm-hotfixes-stable-2024-04-05-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
stackdepot: rename pool_index to pool_index_plus_1
x86/mm/pat: fix VM_PAT handling in COW mappings
MAINTAINERS: change vmware.com addresses to broadcom.com
selftests/mm: include strings.h for ffsl
mm: vmalloc: fix lockdep warning
mm: vmalloc: bail out early in find_vmap_area() if vmap is not init
init: open output files from cpio unpacking with O_LARGEFILE
mm/secretmem: fix GUP-fast succeeding on secretmem folios
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Catalin Marinas:
"arm64/ptrace fix to use the correct SVE layout based on the saved
floating point state rather than the TIF_SVE flag. The latter may be
left on during syscalls even if the SVE state is discarded"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64/ptrace: Use saved floating point state type to determine SVE layout
Merge tag 'riscv-for-linus-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A fix for an __{get,put}_kernel_nofault to avoid an uninitialized
value causing spurious failures
- compat_vdso.so.dbg is now installed to the standard install location
- A fix to avoid initializing PERF_SAMPLE_BRANCH_*-related events, as
they aren't supported and will just later fail
- A fix to make AT_VECTOR_SIZE_ARCH correct now that we're providing
AT_MINSIGSTKSZ
- pgprot_nx() is now implemented, which fixes vmap W^X protection
- A fix for the vector save/restore code, which at least manifests as
corrupted vector state when a signal is taken
- A fix for a race condition in instruction patching
- A fix to avoid leaking the kernel-mode GP to userspace, which is a
kernel pointer leak that can be used to defeat KASLR in various ways
- A handful of smaller fixes to build warnings, an overzealous printk,
and some missing tracing annotations
* tag 'riscv-for-linus-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: process: Fix kernel gp leakage
riscv: Disable preemption when using patch_map()
riscv: Fix warning by declaring arch_cpu_idle() as noinstr
riscv: use KERN_INFO in do_trap
riscv: Fix vector state restore in rt_sigreturn()
riscv: mm: implement pgprot_nx
riscv: compat_vdso: align VDSOAS build log
RISC-V: Update AT_VECTOR_SIZE_ARCH for new AT_MINSIGSTKSZ
riscv: Mark __se_sys_* functions __used
drivers/perf: riscv: Disable PERF_SAMPLE_BRANCH_* while not supported
riscv: compat_vdso: install compat_vdso.so.dbg to /lib/modules/*/vdso/
riscv: hwprobe: do not produce frtace relocation
riscv: Fix spurious errors from __get/put_kernel_nofault
riscv: mm: Fix prototype to avoid discarding const
Merge tag 's390-6.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:
- Fix missing NULL pointer check when determining guest/host fault
- Mark all functions in asm/atomic_ops.h, asm/atomic.h and
asm/preempt.h as __always_inline to avoid unwanted instrumentation
- Fix removal of a Processor Activity Instrumentation (PAI) sampling
event in PMU device driver
- Align system call table on 8 bytes
* tag 's390-6.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/entry: align system call table on 8 bytes
s390/pai: fix sampling event removal for PMU device driver
s390/preempt: mark all functions __always_inline
s390/atomic: mark all functions __always_inline
s390/mm: fix NULL pointer dereference
Merge tag 'pm-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Fix a recent Energy Model change that went against a recent scheduler
change made independently (Vincent Guittot)"
* tag 'pm-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: EM: fix wrong utilization estimation in em_cpu_energy()
Merge tag 'thermal-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control fixes from Rafael Wysocki:
"These fix two power allocator thermal governor issues and an ACPI
thermal driver regression that all were introduced during the 6.8
development cycle.
Specifics:
- Allow the power allocator thermal governor to bind to a thermal
zone without cooling devices and/or without trip points (Nikita
Travkin)
- Make the ACPI thermal driver register a tripless thermal zone when
it cannot find any usable trip points instead of returning an error
from acpi_thermal_add() (Stephen Horvath)"
* tag 'thermal-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: gov_power_allocator: Allow binding without trip points
thermal: gov_power_allocator: Allow binding without cooling devices
ACPI: thermal: Register thermal zones without valid trip points
Merge tag 'gpio-fixes-for-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- make sure GPIO devices are registered with the subsystem before
trying to return them to a caller of gpio_device_find()
- fix two issues with incorrect sanitization of the interrupt labels
* tag 'gpio-fixes-for-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: cdev: fix missed label sanitizing in debounce_setup()
gpio: cdev: check for NULL labels when sanitizing them for irqs
gpiolib: Fix triggering "kobject: 'gpiochipX' is not initialized, yet" kobject_get() errors
Merge tag 'ata-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fixes from Damien Le Moal:
- Compilation warning fixes from Arnd: one in the sata_sx4 driver due
to an incorrect calculation of the parameters passed to memcpy() and
another one in the sata_mv driver when CONFIG_PCI is not set
- Drop the owner driver field assignment in the pata_macio driver. That
is not needed as the PCI core code does that already (Krzysztof)
- Remove an unusued field in struct st_ahci_drv_data of the ahci_st
driver (Christophe)
- Add a missing clock probe error check in the sata_gemini driver
(Chen)
* tag 'ata-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: sata_gemini: Check clk_enable() result
ata: sata_mv: Fix PCI device ID table declaration compilation warning
ata: ahci_st: Remove an unused field in struct st_ahci_drv_data
ata: pata_macio: drop driver owner assignment
ata: sata_sx4: fix pdc20621_get_from_dimm() on 64-bit
Merge tag 'sound-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This became a bit bigger collection of patches, but almost all are
about device-specific fixes, and should be safe for 6.9:
- Lots of ASoC Intel SOF-related fixes/updates
- Locking fixes in SoundWire drivers
- ASoC AMD ACP/SOF updates
- ASoC ES8326 codec fixes
- HD-audio codec fixes and quirks
- A regression fix in emu10k1 synth code"
* tag 'sound-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (49 commits)
ASoC: SOF: Core: Add remove_late() to sof_init_environment failure path
ASoC: SOF: amd: fix for false dsp interrupts
ASoC: SOF: Intel: lnl: Disable DMIC/SSP offload on remove
ASoC: Intel: avs: boards: Add modules description
ASoC: codecs: ES8326: Removing the control of ADC_SCALE
ASoC: codecs: ES8326: Solve a headphone detection issue after suspend and resume
ASoC: codecs: ES8326: modify clock table
ASoC: codecs: ES8326: Solve error interruption issue
ALSA: line6: Zero-initialize message buffers
ALSA: hda/realtek: cs35l41: Support ASUS ROG G634JYR
ALSA: hda/realtek: Update Panasonic CF-SZ6 quirk to support headset with microphone
ALSA: hda/realtek: Add sound quirks for Lenovo Legion slim 7 16ARHA7 models
Revert "ALSA: emu10k1: fix synthesizer sample playback position and caching"
OSS: dmasound/paula: Mark driver struct with __refdata to prevent section mismatch
ALSA: hda/realtek: Add quirks for ASUS Laptops using CS35L56
ASoC: amd: acp: fix for acp_init function error handling
ASoC: tas2781: mark dvc_tlv with __maybe_unused
ASoC: ops: Fix wraparound for mask in snd_soc_get_volsw
ASoC: rt-sdw*: add __func__ to all error logs
ASoC: rt722-sdca-sdw: fix locking sequence
...
Merge tag 'drm-fixes-2024-04-05' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Weekly fixes, mostly xe and i915, amdgpu on a week off, otherwise a
nouveau fix for a crash with new vulkan cts tests, and a couple of
cleanups and misc fixes.
display:
- fix typos in kerneldoc
prime:
- unbreak dma-buf export for virt-gpu
nouveau:
- uvmm: fix remap address calculation
- minor cleanups
panfrost:
- fix power-transition timeouts
xe:
- Stop using system_unbound_wq for preempt fences
- Fix saving unordered rebinding fences by attaching them as kernel
feces to the vm's resv
- Fix TLB invalidation fences completing out of order
- Move rebind TLB invalidation to the ring ops to reduce the latency
i915:
- A few DisplayPort related fixes
- eDP PSR fixes
- Remove some VM space restrictions on older platforms
- Disable automatic load CCS load balancing"
* tag 'drm-fixes-2024-04-05' of https://gitlab.freedesktop.org/drm/kernel: (22 commits)
drm/xe: Use ordered wq for preempt fence waiting
drm/xe: Move vma rebinding to the drm_exec locking loop
drm/xe: Make TLB invalidation fences unordered
drm/xe: Rework rebinding
drm/xe: Use ring ops TLB invalidation for rebinds
drm/i915/mst: Reject FEC+MST on ICL
drm/i915/mst: Limit MST+DSC to TGL+
drm/i915/dp: Fix the computation for compressed_bpp for DISPLAY < 13
drm/i915/gt: Enable only one CCS for compute workload
drm/i915/gt: Do not generate the command streamer for all the CCS
drm/i915/gt: Disable HW load balancing for CCS
drm/i915/gt: Limit the reserved VM space to only the platforms that need it
drm/i915/psr: Fix intel_psr2_sel_fetch_et_alignment usage
drm/i915/psr: Move writing early transport pipe src
drm/i915/psr: Calculate PIPE_SRCSZ_ERLY_TPT value
drm/i915/dp: Remove support for UHBR13.5
drm/i915/dp: Fix DSC state HW readout for SST connectors
drm/display: fix typo
drm/prime: Unbreak virtgpu dma-buf export
nouveau/uvmm: fix addr/range calcs for remap operations
...
stackdepot: rename pool_index to pool_index_plus_1
Commit 3ee34eabac2a ("lib/stackdepot: fix first entry having a 0-handle")
changed the meaning of the pool_index field to mean "the pool index plus
1". This made the code accessing this field less self-documenting, as
well as causing debuggers such as drgn to not be able to easily remain
compatible with both old and new kernels, because they typically do that
by testing for presence of the new field. Because stackdepot is a
debugging tool, we should make sure that it is debugger friendly.
Therefore, give the field a different name to improve readability as well
as enabling debugger backwards compatibility.
This is needed in 6.9, which would otherwise become an odd release with
the new semantics and old name so debuggers wouldn't recognize the new
semantics there.
PAT handling won't do the right thing in COW mappings: the first PTE (or,
in fact, all PTEs) can be replaced during write faults to point at anon
folios. Reliably recovering the correct PFN and cachemode using
follow_phys() from PTEs will not work in COW mappings.
Using follow_phys(), we might just get the address+protection of the anon
folio (which is very wrong), or fail on swap/nonswap entries, failing
follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn() and
track_pfn_copy(), not properly calling free_pfn_range().
In free_pfn_range(), we either wouldn't call memtype_free() or would call
it with the wrong range, possibly leaking memory.
To fix that, let's update follow_phys() to refuse returning anon folios,
and fallback to using the stored PFN inside vma->vm_pgoff for COW mappings
if we run into that.
We will now properly handle untrack_pfn() with COW mappings, where we
don't need the cachemode. We'll have to fail fork()->track_pfn_copy() if
the first page was replaced by an anon folio, though: we'd have to store
the cachemode in the VMA to make this work, likely growing the VMA size.
For now, lets keep it simple and let track_pfn_copy() just fail in that
case: it would have failed in the past with swap/nonswap entries already,
and it would have done the wrong thing with anon folios.
Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn():
indeed it can happen if the find_vmap_area_exceed_addr_lock() gets called
concurrently because it tries to acquire two nodes locks. It was done to
prevent removing a lowest VA found on a previous step.
To address this a lowest VA is found first without holding a node lock
where it resides. As a last step we check if a VA still there because it
can go away, if removed, proceed with next lowest.
Some background. It worked before because the lock that is in question
was statically defined and initialized. As of now, the locks and data
structures are initialized in the vmalloc_init() function.
To address that issue add the check whether the "vmap_initialized"
variable is set, if not find_vmap_area() bails out on entry returning NULL.
John Sperbeck [Sat, 23 Mar 2024 15:29:34 +0000 (08:29 -0700)]
init: open output files from cpio unpacking with O_LARGEFILE
If a member of a cpio archive for an initrd or initrams is larger than
2Gb, we'll eventually fail to write to that file when we get to that
limit, unless O_LARGEFILE is set.
The problem can be seen with this recipe, assuming that BLK_DEV_RAM
is not configured:
mm/secretmem: fix GUP-fast succeeding on secretmem folios
folio_is_secretmem() currently relies on secretmem folios being LRU
folios, to save some cycles.
However, folios might reside in a folio batch without the LRU flag set, or
temporarily have their LRU flag cleared. Consequently, the LRU flag is
unreliable for this purpose.
In particular, this is the case when secretmem_fault() allocates a fresh
page and calls filemap_add_folio()->folio_add_lru(). The folio might be
added to the per-cpu folio batch and won't get the LRU flag set until the
batch was drained using e.g., lru_add_drain().
Consequently, folio_is_secretmem() might not detect secretmem folios and
GUP-fast can succeed in grabbing a secretmem folio, crashing the kernel
when we would later try reading/writing to the folio, because the folio
has been unmapped from the directmap.
Merge tag '9p-for-6.9-rc3' of https://github.com/martinetd/linux
Pull minor 9p cleanups from Dominique Martinet:
- kernel doc fix & removal of unused flag
- fix some bogus debug statement for read/write
* tag '9p-for-6.9-rc3' of https://github.com/martinetd/linux:
9p: remove SLAB_MEM_SPREAD flag usage
9p: Fix read/write debug statements to report server reply
9p/trans_fd: remove Excess kernel-doc comment
Merge tag '6.9-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
"Three fixes, all also for stable:
- encryption fix
- memory overrun fix
- oplock break fix"
* tag '6.9-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: do not set SMB2_GLOBAL_CAP_ENCRYPTION for SMB 3.1.1
ksmbd: validate payload size in ipc response
ksmbd: don't send oplock break if rename fails
Merge tag 'vfs-6.9-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"This contains a few small fixes. This comes with some delay because I
wanted to wait on people running their reproducers and the Easter
Holidays meant that those replies came in a little later than usual:
- Fix handling of preventing writes to mounted block devices.
Since last kernel we allow to prevent writing to mounted block
devices provided CONFIG_BLK_DEV_WRITE_MOUNTED isn't set and the
block device is opened with restricted writes. When we switched to
opening block devices as files we altered the mechanism by which we
recognize when a block device has been opened with write
restrictions.
The detection logic assumed that only read-write mounted
filesystems would apply write restrictions to their block devices
from other openers. That of course is not true since it also makes
sense to apply write restrictions for filesystems that are
read-only.
Fix the detection logic using an FMODE_* bit. We still have a few
left since we freed up a couple a while ago. I also picked up a
patch to free up four additional FMODE_* bits scheduled for the
next merge window.
- Fix counting the number of writers to a block device. This just
changes the logic to be consistent.
- Fix a bug in aio causing a NULL pointer derefernce after we
implemented batched processing in aio.
- Finally, add the changes we discussed that allows to yield block
devices early even though file closing itself is deferred.
This also allows us to remove two holder operations to get and
release the holder to align lifetime of file and holder of the
block device"
* tag 'vfs-6.9-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
aio: Fix null ptr deref in aio_complete() wakeup
fs,block: yield devices early
block: count BLK_OPEN_RESTRICT_WRITES openers
block: handle BLK_OPEN_RESTRICT_WRITES correctly
Kent Overstreet [Sun, 31 Mar 2024 21:52:12 +0000 (17:52 -0400)]
aio: Fix null ptr deref in aio_complete() wakeup
list_del_init_careful() needs to be the last access to the wait queue
entry - it effectively unlocks access.
Previously, finish_wait() would see the empty list head and skip taking
the lock, and then we'd return - but the completion path would still
attempt to do the wakeup after the task_struct pointer had been
overwritten.
Merge tag 'asoc-fix-v6.9-rc2' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v6.9
A relatively large set of fixes here, the biggest piece of it is a
series correcting some problems with the delay reporting for Intel SOF
cards but there's a bunch of other things. Everything here is driver
specific except for a fix in the core for an issue with sign extension
handling volume controls.
Dave Airlie [Fri, 5 Apr 2024 02:25:28 +0000 (12:25 +1000)]
Merge tag 'drm-xe-fixes-2024-04-04' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
- Stop using system_unbound_wq for preempt fences,
as this can cause starvation when reaching more
than max_active defined by workqueue
- Fix saving unordered rebinding fences by attaching
them as kernel feces to the vm's resv
- Fix TLB invalidation fences completing out of order
- Move rebind TLB invalidation to the ring ops to reduce
the latency
x86/cpufeatures: Add CPUID_LNX_5 to track recently added Linux-defined word
Add CPUID_LNX_5 to track cpufeatures' word 21, and add the appropriate
compile-time assert in KVM to prevent direct lookups on the features in
CPUID_LNX_5. KVM uses X86_FEATURE_* flags to manage guest CPUID, and so
must translate features that are scattered by Linux from the Linux-defined
bit to the hardware-defined bit, i.e. should never try to directly access
scattered features in guest CPUID.
Opportunistically add NR_CPUID_WORDS to enum cpuid_leafs, along with a
compile-time assert in KVM's CPUID infrastructure to ensure that future
additions update cpuid_leafs along with NCAPINTS.
Merge tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, bluetooth and bpf.
Fairly usual collection of driver and core fixes. The large selftest
accompanying one of the fixes is also becoming a common occurrence.
Current release - regressions:
- ipv6: fix infinite recursion in fib6_dump_done()
- net/rds: fix possible null-deref in newly added error path
Current release - new code bugs:
- net: do not consume a full cacheline for system_page_pool
- bpf: fix bpf_arena-related file descriptor leaks in the verifier
- drv: ice: fix freeing uninitialized pointers, fixing misuse of the
newfangled __free() auto-cleanup
Previous releases - regressions:
- x86/bpf: fixes the BPF JIT with retbleed=stuff
- xen-netfront: add missing skb_mark_for_recycle, fix page pool
accounting leaks, revealed by recently added explicit warning
- tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
non-wildcard addresses
- Bluetooth:
- replace "hci_qca: Set BDA quirk bit if fwnode exists in DT" with
better workarounds to un-break some buggy Qualcomm devices
- set conn encrypted before conn establishes, fix re-connecting to
some headsets which use slightly unusual sequence of msgs
- mptcp:
- prevent BPF accessing lowat from a subflow socket
- don't account accept() of non-MPC client as fallback to TCP
- drv: mana: fix Rx DMA datasize and skb_over_panic
- drv: i40e: fix VF MAC filter removal
Previous releases - always broken:
- gro: various fixes related to UDP tunnels - netns crossing
problems, incorrect checksum conversions, and incorrect packet
transformations which may lead to panics
- bpf: support deferring bpf_link dealloc to after RCU grace period
- nf_tables:
- release batch on table validation from abort path
- release mutex after nft_gc_seq_end from abort path
- flush pending destroy work before exit_net release
- drv: r8169: skip DASH fw status checks when DASH is disabled"
* tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
netfilter: validate user input for expected length
net/sched: act_skbmod: prevent kernel-infoleak
net: usb: ax88179_178a: avoid the interface always configured as random address
net: dsa: sja1105: Fix parameters order in sja1110_pcs_mdio_write_c45()
net: ravb: Always update error counters
net: ravb: Always process TX descriptor ring
netfilter: nf_tables: discard table flag update with pending basechain deletion
netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get()
netfilter: nf_tables: reject new basechain after table flag update
netfilter: nf_tables: flush pending destroy work before exit_net release
netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path
netfilter: nf_tables: release batch on table validation from abort path
Revert "tg3: Remove residual error handling in tg3_suspend"
tg3: Remove residual error handling in tg3_suspend
net: mana: Fix Rx DMA datasize and skb_over_panic
net/sched: fix lockdep splat in qdisc_tree_reduce_backlog()
net: phy: micrel: lan8814: Fix when enabling/disabling 1-step timestamping
net: stmmac: fix rx queue priority assignment
net: txgbe: fix i2c dev name cannot match clkdev
net: fec: Set mac_managed_pm during probe
...
Merge tag 'bcachefs-2024-04-03' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs repair code from Kent Overstreet:
"A couple more small fixes, and new repair code.
We can now automatically recover from arbitrary corrupted interior
btree nodes by scanning, and we can reconstruct metadata as needed to
bring a filesystem back into a working, consistent, read-write state
and preserve access to whatevver wasn't corrupted.
Meaning - you can blow away all metadata except for extents and
dirents leaf nodes, and repair will reconstruct everything else and
give you your data, and under the correct paths. If inodes are missing
i_size will be slightly off and permissions/ownership/timestamps will
be gone, and we do still need the snapshots btree if snapshots were in
use - in the future we'll be able to guess the snapshot tree structure
in some situations.
IOW - aside from shaking out remaining bugs (fuzz testing is still
coming), repair code should be complete and if repair ever doesn't
work that's the highest priority bug that I want to know about
immediately.
This patchset was kindly tested by a user from India who accidentally
wiped one drive out of a three drive filesystem with no replication on
the family computer - it took a couple weeks but we got everything
important back"
* tag 'bcachefs-2024-04-03' of https://evilpiepirate.org/git/bcachefs:
bcachefs: reconstruct_inode()
bcachefs: Subvolume reconstruction
bcachefs: Check for extents that point to same space
bcachefs: Reconstruct missing snapshot nodes
bcachefs: Flag btrees with missing data
bcachefs: Topology repair now uses nodes found by scanning to fill holes
bcachefs: Repair pass for scanning for btree nodes
bcachefs: Don't skip fake btree roots in fsck
bcachefs: bch2_btree_root_alloc() -> bch2_btree_root_alloc_fake()
bcachefs: Etyzinger cleanups
bcachefs: bch2_shoot_down_journal_keys()
bcachefs: Clear recovery_passes_required as they complete without errors
bcachefs: ratelimit informational fsck errors
bcachefs: Check for bad needs_discard before doing discard
bcachefs: Improve bch2_btree_update_to_text()
mean_and_variance: Drop always failing tests
bcachefs: fix nocow lock deadlock
bcachefs: BCH_WATERMARK_interior_updates
bcachefs: Fix btree node reserve
Stefan O'Rear [Wed, 27 Mar 2024 06:12:58 +0000 (02:12 -0400)]
riscv: process: Fix kernel gp leakage
childregs represents the registers which are active for the new thread
in user context. For a kernel thread, childregs->gp is never used since
the kernel gp is not touched by switch_to. For a user mode helper, the
gp value can be observed in user space after execve or possibly by other
means.
[From the email thread]
The /* Kernel thread */ comment is somewhat inaccurate in that it is also used
for user_mode_helper threads, which exec a user process, e.g. /sbin/init or
when /proc/sys/kernel/core_pattern is a pipe. Such threads do not have
PF_KTHREAD set and are valid targets for ptrace etc. even before they exec.
childregs is the *user* context during syscall execution and it is observable
from userspace in at least five ways:
1. kernel_execve does not currently clear integer registers, so the starting
register state for PID 1 and other user processes started by the kernel has
sp = user stack, gp = kernel __global_pointer$, all other integer registers
zeroed by the memset in the patch comment.
This is a bug in its own right, but I'm unwilling to bet that it is the only
way to exploit the issue addressed by this patch.
2. ptrace(PTRACE_GETREGSET): you can PTRACE_ATTACH to a user_mode_helper thread
before it execs, but ptrace requires SIGSTOP to be delivered which can only
happen at user/kernel boundaries.
3. /proc/*/task/*/syscall: this is perfectly happy to read pt_regs for
user_mode_helpers before the exec completes, but gp is not one of the
registers it returns.
4. PERF_SAMPLE_REGS_USER: LOCKDOWN_PERF normally prevents access to kernel
addresses via PERF_SAMPLE_REGS_INTR, but due to this bug kernel addresses
are also exposed via PERF_SAMPLE_REGS_USER which is permitted under
LOCKDOWN_PERF. I have not attempted to write exploit code.
5. Much of the tracing infrastructure allows access to user registers. I have
not attempted to determine which forms of tracing allow access to user
registers without already allowing access to kernel registers.
Alexandre Ghiti [Tue, 26 Mar 2024 20:30:17 +0000 (21:30 +0100)]
riscv: Disable preemption when using patch_map()
patch_map() uses fixmap mappings to circumvent the non-writability of
the kernel text mapping.
The __set_fixmap() function only flushes the current cpu tlb, it does
not emit an IPI so we must make sure that while we use a fixmap mapping,
the current task is not migrated on another cpu which could miss the
newly introduced fixmap mapping.
So in order to avoid any task migration, disable the preemption.
Reported-by: Andrea Parri <[email protected]> Closes: https://lore.kernel.org/all/ZcS+GAaM25LXsBOl@andrea/ Reported-by: Andy Chiu <[email protected]> Closes: https://lore.kernel.org/linux-riscv/CABgGipUMz3Sffu-CkmeUB1dKVwVQ73+7=sgC45-m0AE9RCjOZg@mail.gmail.com/ Fixes: cad539baa48f ("riscv: implement a memset like function for text") Fixes: 0ff7c3b33127 ("riscv: Use text_mutex instead of patch_lock") Co-developed-by: Andy Chiu <[email protected]> Signed-off-by: Andy Chiu <[email protected]> Signed-off-by: Alexandre Ghiti <[email protected]> Acked-by: Puranjay Mohan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
* tag 'nvme-6.9-2024-04-04' of git://git.infradead.org/nvme:
nvme-fc: rename free_ctrl callback to match name pattern
nvmet-fc: move RCU read lock to nvmet_fc_assoc_exists
nvmet: implement unique discovery NQN
nvme: don't create a multipath node for zero capacity devices
nvme: split nvme_update_zone_info
nvme-multipath: don't inherit LBA-related fields for the multipath node
ASoC: SOF: Core: Add remove_late() to sof_init_environment failure path
In cases where the sof driver is unable to find the firmware and/or
topology file [1], it exits without releasing the i915 runtime
pm wakeref [2]. This results in dmesg warnings[3] during
suspend/resume or driver unbind. Add remove_late() to the failure path
of sof_init_environment so that i915 wakeref is released appropriately
[1]
[ 8.990366] sof-audio-pci-intel-mtl 0000:00:1f.3: SOF firmware and/or topology file not found.
[ 8.990396] sof-audio-pci-intel-mtl 0000:00:1f.3: Supported default profiles
[ 8.990398] sof-audio-pci-intel-mtl 0000:00:1f.3: - ipc type 1 (Requested):
[ 8.990399] sof-audio-pci-intel-mtl 0000:00:1f.3: Firmware file: intel/sof-ipc4/mtl/sof-mtl.ri
[ 8.990401] sof-audio-pci-intel-mtl 0000:00:1f.3: Topology file: intel/sof-ace-tplg/sof-mtl-rt711-2ch.tplg
[ 8.990402] sof-audio-pci-intel-mtl 0000:00:1f.3: Check if you have 'sof-firmware' package installed.
[ 8.990403] sof-audio-pci-intel-mtl 0000:00:1f.3: Optionally it can be manually downloaded from:
[ 8.990404] sof-audio-pci-intel-mtl 0000:00:1f.3: https://github.com/thesofproject/sof-bin/
[ 8.999088] sof-audio-pci-intel-mtl 0000:00:1f.3: error: sof_probe_work failed err: -2
Jakub Kicinski [Thu, 4 Apr 2024 18:37:39 +0000 (11:37 -0700)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2024-04-04
We've added 7 non-merge commits during the last 5 day(s) which contain
a total of 9 files changed, 75 insertions(+), 24 deletions(-).
The main changes are:
1) Fix x86 BPF JIT under retbleed=stuff which causes kernel panics due to
incorrect destination IP calculation and incorrect IP for relocations,
from Uros Bizjak and Joan Bruguera Micó.
2) Fix BPF arena file descriptor leaks in the verifier,
from Anton Protopopov.
3) Defer bpf_link deallocation to after RCU grace period as currently
running multi-{kprobes,uprobes} programs might still access cookie
information from the link, from Andrii Nakryiko.
4) Fix a BPF sockmap lock inversion deadlock in map_delete_elem reported
by syzkaller, from Jakub Sitnicki.
5) Fix resolve_btfids build with musl libc due to missing linux/types.h
include, from Natanael Copa.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, sockmap: Prevent lock inversion deadlock in map delete elem
x86/bpf: Fix IP for relocating call depth accounting
x86/bpf: Fix IP after emitting call depth accounting
bpf: fix possible file descriptor leaks in verifier
tools/resolve_btfids: fix build with musl libc
bpf: support deferring bpf_link dealloc to after RCU grace period
bpf: put uprobe link's path and task in release callback
====================
Vincent Guittot [Thu, 4 Apr 2024 10:42:00 +0000 (12:42 +0200)]
PM: EM: fix wrong utilization estimation in em_cpu_energy()
Commit 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove division")
has added back map_util_perf() in em_cpu_energy() computation which has
been removed with the rework of scheduler/cpufreq interface.
This is wrong because sugov_effective_cpu_perf() already takes care of
mapping the utilization to a performance level.
Fixes: 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove division") Signed-off-by: Vincent Guittot <[email protected]> Reviewed-by: Lukasz Luba <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
Kent Gibson [Thu, 4 Apr 2024 09:33:28 +0000 (11:33 +0200)]
gpio: cdev: fix missed label sanitizing in debounce_setup()
When adding sanitization of the label, the path through
edge_detector_setup() that leads to debounce_setup() was overlooked.
A request taking this path does not allocate a new label and the
request label is freed twice when the request is released, resulting
in memory corruption.
Add label sanitization to debounce_setup().
Cc: [email protected] Fixes: b34490879baa ("gpio: cdev: sanitize the label before requesting the interrupt") Signed-off-by: Kent Gibson <[email protected]>
[Bartosz: rebased on top of the fix for empty GPIO labels] Co-developed-by: Bartosz Golaszewski <[email protected]> Signed-off-by: Bartosz Golaszewski <[email protected]>
The buggy address belongs to the object at ffff88802cd73da0
which belongs to the cache kmalloc-8 of size 8
The buggy address is located 0 bytes inside of
allocated 1-byte region [ffff88802cd73da0, ffff88802cd73da1)
Memory state around the buggy address: ffff88802cd73c80: 07 fc fc fc 05 fc fc fc 05 fc fc fc fa fc fc fc ffff88802cd73d00: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc
>ffff88802cd73d80: fa fc fc fc 01 fc fc fc fa fc fc fc fa fc fc fc
^ ffff88802cd73e00: fa fc fc fc fa fc fc fc 05 fc fc fc 07 fc fc fc ffff88802cd73e80: 07 fc fc fc 07 fc fc fc 07 fc fc fc 07 fc fc fc
Jakub Kicinski [Thu, 4 Apr 2024 16:38:52 +0000 (09:38 -0700)]
Merge tag 'nf-24-04-04' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
Patch #1 unlike early commit path stage which triggers a call to abort,
an explicit release of the batch is required on abort, otherwise
mutex is released and commit_list remains in place.
Patch #2 release mutex after nft_gc_seq_end() in commit path, otherwise
async GC worker could collect expired objects.
Patch #3 flush pending destroy work in module removal path, otherwise UaF
is possible.
Patch #4 and #6 restrict the table dormant flag with basechain updates
to fix state inconsistency in the hook registration.
Patch #5 adds missing RCU read side lock to flowtable type to avoid races
with module removal.
* tag 'nf-24-04-04' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: discard table flag update with pending basechain deletion
netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get()
netfilter: nf_tables: reject new basechain after table flag update
netfilter: nf_tables: flush pending destroy work before exit_net release
netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path
netfilter: nf_tables: release batch on table validation from abort path
====================
net: usb: ax88179_178a: avoid the interface always configured as random address
After the commit d2689b6a86b9 ("net: usb: ax88179_178a: avoid two
consecutive device resets"), reset is not executed from bind operation and
mac address is not read from the device registers or the devicetree at that
moment. Since the check to configure if the assigned mac address is random
or not for the interface, happens after the bind operation from
usbnet_probe, the interface keeps configured as random address, although the
address is correctly read and set during open operation (the only reset
now).
In order to keep only one reset for the device and to avoid the interface
always configured as random address, after reset, configure correctly the
suitable field from the driver, if the mac address is read successfully from
the device registers or the devicetree. Take into account if a locally
administered address (random) was previously stored.
Hannes Reinecke [Wed, 3 Apr 2024 11:31:14 +0000 (13:31 +0200)]
nvmet: implement unique discovery NQN
Unique discovery NQNs allow to differentiate between discovery
services from (typically physically separate) NVMe-oF subsystems.
This is required for establishing secured connections as otherwise
the credentials won't be unique and the integrity of the connection
cannot be guaranteed.
This patch adds a configfs attribute 'discovery_nqn' in the 'nvmet'
configfs directory to specify the unique discovery NQN.
nvme: don't create a multipath node for zero capacity devices
Apparently there are nvme controllers around that report namespaces
in the namespace list which have zero capacity. Return -ENXIO instead
of -ENODEV from nvme_update_ns_info_block so we don't create a hidden
multipath node for these namespaces but entirely ignore them.
gpio: cdev: check for NULL labels when sanitizing them for irqs
We need to take into account that a line's consumer label may be NULL
and not try to kstrdup() it in that case but rather pass the NULL
pointer up the stack to the interrupt request function.
To that end: let make_irq_label() return NULL as a valid return value
and use ERR_PTR() instead to signal an allocation failure to callers.
Matthew Brost [Mon, 1 Apr 2024 22:19:11 +0000 (15:19 -0700)]
drm/xe: Use ordered wq for preempt fence waiting
Preempt fences can sleep waiting for an exec queue suspend operation to
complete. If the system_unbound_wq is used for waiting and the number of
waiters exceeds max_active this will result in other users of the
system_unbound_wq getting starved. Use a device private work queue for
preempt fences to avoid starvation of the system_unbound_wq.
Even though suspend operations can complete out-of-order, all suspend
operations within a VM need to complete before the preempt rebind worker
can start. With that, use a device private ordered wq for preempt fence
waiting.
v2:
- Add comment about cleanup on failure (Matt R)
- Update commit message (Lucas)
Thomas Hellström [Wed, 27 Mar 2024 09:11:36 +0000 (10:11 +0100)]
drm/xe: Move vma rebinding to the drm_exec locking loop
Rebinding might allocate page-table bos, causing evictions.
To support blocking locking during these evictions,
perform the rebinding in the drm_exec locking loop.
Also Reserve fence slots where actually needed rather than trying to
predict how many fence slots will be needed over a complete
wound-wait transaction.
v2:
- Remove a leftover call to xe_vm_rebind() (Matt Brost)
- Add a helper function xe_vm_validate_rebind() (Matt Brost)
v3:
- Add comments and squash with previous patch (Matt Brost)
Thomas Hellström [Wed, 27 Mar 2024 09:11:34 +0000 (10:11 +0100)]
drm/xe: Rework rebinding
Instead of handling the vm's rebind fence separately,
which is error prone if they are not strictly ordered,
attach rebind fences as kernel fences to the vm's resv.