Paolo Bonzini [Thu, 11 Jul 2024 22:27:54 +0000 (18:27 -0400)]
KVM: extend kvm_range_has_memory_attributes() to check subset of attributes
While currently there is no other attribute than KVM_MEMORY_ATTRIBUTE_PRIVATE,
KVM code such as kvm_mem_is_private() is written to expect their existence.
Allow using kvm_range_has_memory_attributes() as a multi-page version of
kvm_mem_is_private(), without it breaking later when more attributes are
introduced.
Paolo Bonzini [Thu, 11 Jul 2024 22:27:53 +0000 (18:27 -0400)]
KVM: cleanup and add shortcuts to kvm_range_has_memory_attributes()
Use a guard to simplify early returns, and add two more easy
shortcuts. If the requested attributes are invalid, the attributes
xarray will never show them as set. And if testing a single page,
kvm_get_memory_attributes() is more efficient.
Paolo Bonzini [Thu, 11 Jul 2024 22:27:52 +0000 (18:27 -0400)]
KVM: guest_memfd: move check for already-populated page to common code
Do not allow populating the same page twice with startup data. In the
case of SEV-SNP, for example, the firmware does not allow it anyway,
since the launch-update operation is only possible on pages that are
still shared in the RMP.
Even if it worked, kvm_gmem_populate()'s callback is meant to have side
effects such as updating launch measurements, and updating the same
page twice is unlikely to have the desired results.
Races between calls to the ioctl are not possible because
kvm_gmem_populate() holds slots_lock and the VM should not be running.
But again, even if this worked on other confidential computing technology,
it doesn't matter to guest_memfd.c whether this is something fishy
such as missing synchronization in userspace, or rather something
intentional. One of the racers wins, and the page is initialized by
either kvm_gmem_prepare_folio() or kvm_gmem_populate().
Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use
the same errno that kvm_gmem_populate() is using.
Paolo Bonzini [Thu, 11 Jul 2024 22:27:51 +0000 (18:27 -0400)]
KVM: remove kvm_arch_gmem_prepare_needed()
It is enough to return 0 if a guest need not do any preparation.
This is in fact how sev_gmem_prepare() works for non-SNP guests,
and it extends naturally to Intel hosts: the x86 callback for
gmem_prepare is optional and returns 0 if not defined.
Paolo Bonzini [Thu, 11 Jul 2024 22:27:50 +0000 (18:27 -0400)]
KVM: guest_memfd: make kvm_gmem_prepare_folio() operate on a single struct kvm
This is now possible because preparation is done by kvm_gmem_get_pfn()
instead of fallocate(). In practice this is not a limitation, because
even though guest_memfd can be bound to multiple struct kvm, for
hardware implementations of confidential computing only one guest
(identified by an ASID on SEV-SNP, or an HKID on TDX) will be able
to access it.
In the case of intra-host migration (not implemented yet for SEV-SNP,
but we can use SEV-ES as an idea of how it will work), the new struct
kvm inherits the same ASID and preparation need not be repeated.
Paolo Bonzini [Thu, 11 Jul 2024 22:27:49 +0000 (18:27 -0400)]
KVM: guest_memfd: delay kvm_gmem_prepare_folio() until the memory is passed to the guest
Initializing the contents of the folio on fallocate() is unnecessarily
restrictive. It means that the page is registered with the firmware and
then it cannot be touched anymore. In particular, this loses the
possibility of using fallocate() to pre-allocate the page for SEV-SNP
guests, because kvm_arch_gmem_prepare() then fails.
It's only when the guest actually accesses the page (and therefore
kvm_gmem_get_pfn() is called) that the page must be cleared from any
stale host data and registered with the firmware. The up-to-date flag
is clear if this has to be done (i.e. it is the first access and
kvm_gmem_populate() has not been called).
All in all, there are enough differences between kvm_gmem_get_pfn() and
kvm_gmem_populate(), that it's better to separate the two flows completely.
Extract the bulk of kvm_gmem_get_folio(), which take a folio and end up
setting its up-to-date flag, to a new function kvm_gmem_prepare_folio();
these are now done only by the non-__-prefixed kvm_gmem_get_pfn().
As a bonus, __kvm_gmem_get_pfn() loses its ugly "bool prepare" argument.
One difference is that fallocate(PUNCH_HOLE) can now race with a
page fault. Potentially this causes a page to be prepared and into the
filemap even after fallocate(PUNCH_HOLE). This is harmless, as it can be
fixed by another hole punching operation, and can be avoided by clearing
the private-page attribute prior to invoking fallocate(PUNCH_HOLE).
This way, the page fault will cause an exit to user space.
The previous semantics, where fallocate() could be used to prepare
the pages in advance of running the guest, can be accessed with
KVM_PRE_FAULT_MEMORY.
For now, accessing a page in one VM will attempt to call
kvm_arch_gmem_prepare() in all of those that have bound the guest_memfd.
Cleaning this up is left to a separate patch.
Paolo Bonzini [Thu, 11 Jul 2024 22:27:47 +0000 (18:27 -0400)]
KVM: rename CONFIG_HAVE_KVM_GMEM_* to CONFIG_HAVE_KVM_ARCH_GMEM_*
Add "ARCH" to the symbols; shortly, the "prepare" phase will include both
the arch-independent step to clear out contents left in the page by the
host, and the arch-dependent step enabled by CONFIG_HAVE_KVM_GMEM_PREPARE.
For consistency do the same for CONFIG_HAVE_KVM_GMEM_INVALIDATE as well.
Paolo Bonzini [Thu, 11 Jul 2024 22:27:45 +0000 (18:27 -0400)]
KVM: guest_memfd: delay folio_mark_uptodate() until after successful preparation
The up-to-date flag as is now is not too useful; it tells guest_memfd not
to overwrite the contents of a folio, but it doesn't say that the page
is ready to be mapped into the guest. For encrypted guests, mapping
a private page requires that the "preparation" phase has succeeded,
and at the same time the same page cannot be prepared twice.
So, ensure that folio_mark_uptodate() is only called on a prepared page. If
kvm_gmem_prepare_folio() or the post_populate callback fail, the folio
will not be marked up-to-date; it's not a problem to call clear_highpage()
again on such a page prior to the next preparation attempt.
Paolo Bonzini [Thu, 11 Jul 2024 22:27:44 +0000 (18:27 -0400)]
KVM: guest_memfd: return folio from __kvm_gmem_get_pfn()
Right now this is simply more consistent and avoids use of pfn_to_page()
and put_page(). It will be put to more use in upcoming patches, to
ensure that the up-to-date flag is set at the very end of both the
kvm_gmem_get_pfn() and kvm_gmem_populate() flows.
Paolo Bonzini [Wed, 17 Jul 2024 17:04:48 +0000 (13:04 -0400)]
KVM: x86: disallow pre-fault for SNP VMs before initialization
KVM_PRE_FAULT_MEMORY for an SNP guest can race with
sev_gmem_post_populate() in bad ways. The following sequence for
instance can potentially trigger an RMP fault:
thread A, sev_gmem_post_populate: called
thread B, sev_gmem_prepare: places below 'pfn' in a private state in RMP
thread A, sev_gmem_post_populate: *vaddr = kmap_local_pfn(pfn + i);
thread A, sev_gmem_post_populate: copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE);
RMP #PF
Fix this by only allowing KVM_PRE_FAULT_MEMORY to run after a guest's
initial private memory contents have been finalized via
KVM_SEV_SNP_LAUNCH_FINISH.
Beyond fixing this issue, it just sort of makes sense to enforce this,
since the KVM_PRE_FAULT_MEMORY documentation states:
"KVM maps memory as if the vCPU generated a stage-2 read page fault"
which sort of implies we should be acting on the same guest state that a
vCPU would see post-launch after the initial guest memory is all set up.
Michael Roth [Wed, 1 May 2024 08:52:10 +0000 (03:52 -0500)]
crypto: ccp: Add the SNP_VLEK_LOAD command
When requesting an attestation report a guest is able to specify whether
it wants SNP firmware to sign the report using either a Versioned Chip
Endorsement Key (VCEK), which is derived from chip-unique secrets, or a
Versioned Loaded Endorsement Key (VLEK) which is obtained from an AMD
Key Derivation Service (KDS) and derived from seeds allocated to
enrolled cloud service providers (CSPs).
For VLEK keys, an SNP_VLEK_LOAD SNP firmware command is used to load
them into the system after obtaining them from the KDS. Add a
corresponding userspace interface so to allow the loading of VLEK keys
into the system.
See SEV-SNP Firmware ABI 1.54, SNP_VLEK_LOAD for more details.
Wei Wang [Tue, 7 May 2024 13:31:02 +0000 (21:31 +0800)]
KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops
Introduces kvm_x86_call(), to streamline the usage of static calls of
kvm_x86_ops. The current implementation of these calls is verbose and
could lead to alignment challenges. This makes the code susceptible to
exceeding the "80 columns per single line of code" limit as defined in
the coding-style document. Another issue with the existing implementation
is that the addition of kvm_x86_ prefix to hooks at the static_call sites
hinders code readability and navigation. kvm_x86_call() is added to
improve code readability and maintainability, while adhering to the coding
style guidelines.
Wei Wang [Tue, 7 May 2024 13:31:01 +0000 (21:31 +0800)]
KVM: x86: Replace static_call_cond() with static_call()
The use of static_call_cond() is essentially the same as static_call() on
x86 (e.g. static_call() now handles a NULL pointer as a NOP), so replace
it with static_call() to simplify the code.
Paolo Bonzini [Tue, 16 Jul 2024 15:44:23 +0000 (11:44 -0400)]
Merge branch 'kvm-6.11-sev-attestation' into HEAD
The GHCB 2.0 specification defines 2 GHCB request types to allow SNP guests
to send encrypted messages/requests to firmware: SNP Guest Requests and SNP
Extended Guest Requests. These encrypted messages are used for things like
servicing attestation requests issued by the guest. Implementing support for
these is required to be fully GHCB-compliant.
For the most part, KVM only needs to handle forwarding these requests to
firmware (to be issued via the SNP_GUEST_REQUEST firmware command defined
in the SEV-SNP Firmware ABI), and then forwarding the encrypted response to
the guest.
However, in the case of SNP Extended Guest Requests, the host is also
able to provide the certificate data corresponding to the endorsement key
used by firmware to sign attestation report requests. This certificate data
is provided by userspace because:
1) It allows for different keys/key types to be used for each particular
guest with requiring any sort of KVM API to configure the certificate
table in advance on a per-guest basis.
2) It provides additional flexibility with how attestation requests might
be handled during live migration where the certificate data for
source/dest might be different.
3) It allows all synchronization between certificates and firmware/signing
key updates to be handled purely by userspace rather than requiring
some in-kernel mechanism to facilitate it. [1]
To support fetching certificate data from userspace, a new KVM exit type will
be needed to handle fetching the certificate from userspace. An attempt to
define a new KVM_EXIT_COCO/KVM_EXIT_COCO_REQ_CERTS exit type to handle this
was introduced in v1 of this patchset, but is still being discussed by
community, so for now this patchset only implements a stub version of SNP
Extended Guest Requests that does not provide certificate data, but is still
enough to provide compliance with the GHCB 2.0 spec.
Michael Roth [Mon, 1 Jul 2024 22:31:48 +0000 (17:31 -0500)]
KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event
Version 2 of GHCB specification added support for the SNP Extended Guest
Request Message NAE event. This event serves a nearly identical purpose
to the previously-added SNP_GUEST_REQUEST event, but for certain message
types it allows the guest to supply a buffer to be used for additional
information in some cases.
Currently the GHCB spec only defines extended handling of this sort in
the case of attestation requests, where the additional buffer is used to
supply a table of certificate data corresponding to the attestion
report's signing key. Support for this extended handling will require
additional KVM APIs to handle coordinating with userspace.
Whether or not the hypervisor opts to provide this certificate data is
optional. However, support for processing SNP_EXTENDED_GUEST_REQUEST
GHCB requests is required by the GHCB 2.0 specification for SNP guests,
so for now implement a stub implementation that provides an empty
certificate table to the guest if it supplies an additional buffer, but
otherwise behaves identically to SNP_GUEST_REQUEST.
Michael Roth [Mon, 1 Jul 2024 22:31:47 +0000 (17:31 -0500)]
x86/sev: Move sev_guest.h into common SEV header
sev_guest.h currently contains various definitions relating to the
format of SNP_GUEST_REQUEST commands to SNP firmware. Currently only the
sev-guest driver makes use of them, but when the KVM side of this is
implemented there's a need to parse the SNP_GUEST_REQUEST header to
determine whether additional information needs to be provided to the
guest. Prepare for this by moving those definitions to a common header
that's shared by host/guest code so that KVM can also make use of them.
KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event
Version 2 of GHCB specification added support for the SNP Guest Request
Message NAE event. The event allows for an SEV-SNP guest to make
requests to the SEV-SNP firmware through the hypervisor using the
SNP_GUEST_REQUEST API defined in the SEV-SNP firmware specification.
This is used by guests primarily to request attestation reports from
firmware. There are other request types are available as well, but the
specifics of what guest requests are being made generally does not
affect how they are handled by the hypervisor, which only serves as a
proxy for the guest requests and firmware responses.
Implement handling for these events.
When an SNP Guest Request is issued, the guest will provide its own
request/response pages, which could in theory be passed along directly
to firmware. However, these pages would need special care:
- Both pages are from shared guest memory, so they need to be
protected from migration/etc. occurring while firmware reads/writes
to them. At a minimum, this requires elevating the ref counts and
potentially needing an explicit pinning of the memory. This places
additional restrictions on what type of memory backends userspace
can use for shared guest memory since there would be some reliance
on using refcounted pages.
- The response page needs to be switched to Firmware-owned state
before the firmware can write to it, which can lead to potential
host RMP #PFs if the guest is misbehaved and hands the host a
guest page that KVM is writing to for other reasons (e.g. virtio
buffers).
Both of these issues can be avoided completely by using
separately-allocated bounce pages for both the request/response pages
and passing those to firmware instead. So that's the approach taken
here.
KVM: x86: Suppress MMIO that is triggered during task switch emulation
Explicitly suppress userspace emulated MMIO exits that are triggered when
emulating a task switch as KVM doesn't support userspace MMIO during
complex (multi-step) emulation. Silently ignoring the exit request can
result in the WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to
userspace for some other reason prior to purging mmio_needed.
See commit 0dc902267cb3 ("KVM: x86: Suppress pending MMIO write exits if
emulator detects exception") for more details on KVM's limitations with
respect to emulated MMIO during complex emulator flows.
KVM: x86/mmu: Clean up make_huge_page_split_spte() definition and intro
Tweak the definition of make_huge_page_split_spte() to eliminate an
unnecessarily long line, and opportunistically initialize child_spte to
make it more obvious that the child is directly derived from the huge
parent.
KVM: x86/mmu: Bug the VM if KVM tries to split a !hugepage SPTE
Bug the VM instead of simply warning if KVM tries to split a SPTE that is
non-present or not-huge. KVM is guaranteed to end up in a broken state as
the callers fully expect a valid SPTE, e.g. the shadow MMU will add an
rmap entry, and all MMUs will account the expected small page. Returning
'0' is also technically wrong now that SHADOW_NONPRESENT_VALUE exists,
i.e. would cause KVM to create a potential #VE SPTE.
While it would be possible to have the callers gracefully handle failure,
doing so would provide no practical value as the scenario really should be
impossible, while the error handling would add a non-trivial amount of
noise.
Paolo Bonzini [Tue, 16 Jul 2024 13:56:41 +0000 (09:56 -0400)]
Merge tag 'kvm-x86-vmx-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM VMX changes for 6.11
- Remove an unnecessary EPT TLB flush when enabling hardware.
- Fix a series of bugs that cause KVM to fail to detect nested pending posted
interrupts as valid wake eents for a vCPU executing HLT in L2 (with
HLT-exiting disable by L1).
Paolo Bonzini [Tue, 16 Jul 2024 13:54:57 +0000 (09:54 -0400)]
Merge tag 'kvm-x86-mtrrs-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MTRR virtualization removal
Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD
hack, and instead always honor guest PAT on CPUs that support self-snoop.
Paolo Bonzini [Tue, 16 Jul 2024 13:51:36 +0000 (09:51 -0400)]
Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM generic changes for 6.11
- Enable halt poll shrinking by default, as Intel found it to be a clear win.
- Setup empty IRQ routing when creating a VM to avoid having to synchronize
SRCU when creating a split IRQCHIP on x86.
- Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
that arch code can use for hooking both sched_in() and sched_out().
- Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
truncating a bogus value from userspace, e.g. to help userspace detect bugs.
- Mark a vCPU as preempted if and only if it's scheduled out while in the
KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
memory when retrieving guest state during live migration blackout.
Paolo Bonzini [Tue, 16 Jul 2024 13:51:14 +0000 (09:51 -0400)]
Merge tag 'kvm-x86-fixes-6.10-11' of https://github.com/kvm-x86/linux into HEAD
KVM Xen:
Fix a bug where KVM fails to check the validity of an incoming userspace
virtual address and tries to activate a gfn_to_pfn_cache with a kernel address.
Oliver Upton [Sun, 14 Jul 2024 00:28:57 +0000 (00:28 +0000)]
Merge branch kvm-arm64/docs into kvmarm/next
* kvm-arm64/docs:
: KVM Documentation fixes, courtesy of Changyuan Lyu
:
: Small set of typo fixes / corrections to the KVM API documentation
: relating to MSIs and arm64 VGIC UAPI.
MAINTAINERS: Include documentation in KVM/arm64 entry
KVM: Documentation: Correct the VGIC V2 CPU interface addr space size
KVM: Documentation: Enumerate allowed value macros of `irq_type`
KVM: Documentation: Fix typo `BFD`
Oliver Upton [Sun, 14 Jul 2024 00:28:30 +0000 (00:28 +0000)]
Merge branch kvm-arm64/nv-tcr2 into kvmarm/next
* kvm-arm64/nv-tcr2:
: Fixes to the handling of TCR_EL1, courtesy of Marc Zyngier
:
: Series addresses a couple gaps that are present in KVM (from cover
: letter):
:
: - VM configuration: HCRX_EL2.TCR2En is forced to 1, and we blindly
: save/restore stuff.
:
: - trap bit description and routing: none, obviously, since we make a
: point in not trapping.
KVM: arm64: Honor trap routing for TCR2_EL1
KVM: arm64: Make PIR{,E0}_EL1 save/restore conditional on FEAT_TCRX
KVM: arm64: Make TCR2_EL1 save/restore dependent on the VM features
KVM: arm64: Get rid of HCRX_GUEST_FLAGS
KVM: arm64: Correctly honor the presence of FEAT_TCRX
Oliver Upton [Sun, 14 Jul 2024 00:27:01 +0000 (00:27 +0000)]
Merge branch kvm-arm64/nv-sve into kvmarm/next
* kvm-arm64/nv-sve:
: CPTR_EL2, FPSIMD/SVE support for nested
:
: This series brings support for honoring the guest hypervisor's CPTR_EL2
: trap configuration when running a nested guest, along with support for
: FPSIMD/SVE usage at L1 and L2.
KVM: arm64: Allow the use of SVE+NV
KVM: arm64: nv: Add additional trap setup for CPTR_EL2
KVM: arm64: nv: Add trap description for CPTR_EL2
KVM: arm64: nv: Add TCPAC/TTA to CPTR->CPACR conversion helper
KVM: arm64: nv: Honor guest hypervisor's FP/SVE traps in CPTR_EL2
KVM: arm64: nv: Load guest FP state for ZCR_EL2 trap
KVM: arm64: nv: Handle CPACR_EL1 traps
KVM: arm64: Spin off helper for programming CPTR traps
KVM: arm64: nv: Ensure correct VL is loaded before saving SVE state
KVM: arm64: nv: Use guest hypervisor's max VL when running nested guest
KVM: arm64: nv: Save guest's ZCR_EL2 when in hyp context
KVM: arm64: nv: Load guest hyp's ZCR into EL1 state
KVM: arm64: nv: Handle ZCR_EL2 traps
KVM: arm64: nv: Forward SVE traps to guest hypervisor
KVM: arm64: nv: Forward FP/ASIMD traps to guest hypervisor
Oliver Upton [Sun, 14 Jul 2024 00:15:00 +0000 (00:15 +0000)]
Merge branch kvm-arm64/ctr-el0 into kvmarm/next
* kvm-arm64/ctr-el0:
: Support for user changes to CTR_EL0, courtesy of Sebastian Ott
:
: Allow userspace to change the guest-visible value of CTR_EL0 for a VM,
: so long as the requested value represents a subset of features supported
: by hardware. In other words, prevent the VMM from over-promising the
: capabilities of hardware.
:
: Make this happen by fitting CTR_EL0 into the existing infrastructure for
: feature ID registers.
KVM: selftests: Assert that MPIDR_EL1 is unchanged across vCPU reset
KVM: arm64: nv: Unfudge ID_AA64PFR0_EL1 masking
KVM: selftests: arm64: Test writes to CTR_EL0
KVM: arm64: rename functions for invariant sys regs
KVM: arm64: show writable masks for feature registers
KVM: arm64: Treat CTR_EL0 as a VM feature ID register
KVM: arm64: unify code to prepare traps
KVM: arm64: nv: Use accessors for modifying ID registers
KVM: arm64: Add helper for writing ID regs
KVM: arm64: Use read-only helper for reading VM ID registers
KVM: arm64: Make idregs debugfs iterator search sysreg table directly
KVM: arm64: Get sys_reg encoding from descriptor in idregs_debug_show()
Oliver Upton [Sun, 14 Jul 2024 00:11:45 +0000 (00:11 +0000)]
Merge branch kvm-arm64/shadow-mmu into kvmarm/next
* kvm-arm64/shadow-mmu:
: Shadow stage-2 MMU support for NV, courtesy of Marc Zyngier
:
: Initial implementation of shadow stage-2 page tables to support a guest
: hypervisor. In the author's words:
:
: So here's the 10000m (approximately 30000ft for those of you stuck
: with the wrong units) view of what this is doing:
:
: - for each {VMID,VTTBR,VTCR} tuple the guest uses, we use a
: separate shadow s2_mmu context. This context has its own "real"
: VMID and a set of page tables that are the combination of the
: guest's S2 and the host S2, built dynamically one fault at a time.
:
: - these shadow S2 contexts are ephemeral, and behave exactly as
: TLBs. For all intent and purposes, they *are* TLBs, and we discard
: them pretty often.
:
: - TLB invalidation takes three possible paths:
:
: * either this is an EL2 S1 invalidation, and we directly emulate
: it as early as possible
:
: * or this is an EL1 S1 invalidation, and we need to apply it to
: the shadow S2s (plural!) that match the VMID set by the L1 guest
:
: * or finally, this is affecting S2, and we need to teardown the
: corresponding part of the shadow S2s, which invalidates the TLBs
KVM: arm64: nv: Truely enable nXS TLBI operations
KVM: arm64: nv: Add handling of NXS-flavoured TLBI operations
KVM: arm64: nv: Add handling of range-based TLBI operations
KVM: arm64: nv: Add handling of outer-shareable TLBI operations
KVM: arm64: nv: Invalidate TLBs based on shadow S2 TTL-like information
KVM: arm64: nv: Tag shadow S2 entries with guest's leaf S2 level
KVM: arm64: nv: Handle FEAT_TTL hinted TLB operations
KVM: arm64: nv: Handle TLBI IPAS2E1{,IS} operations
KVM: arm64: nv: Handle TLBI ALLE1{,IS} operations
KVM: arm64: nv: Handle TLBI VMALLS12E1{,IS} operations
KVM: arm64: nv: Handle TLB invalidation targeting L2 stage-1
KVM: arm64: nv: Handle EL2 Stage-1 TLB invalidation
KVM: arm64: nv: Add Stage-1 EL2 invalidation primitives
KVM: arm64: nv: Unmap/flush shadow stage 2 page tables
KVM: arm64: nv: Handle shadow stage 2 page faults
KVM: arm64: nv: Implement nested Stage-2 page table walk logic
KVM: arm64: nv: Support multiple nested Stage-2 mmu structures
Oliver Upton [Sun, 14 Jul 2024 00:11:34 +0000 (00:11 +0000)]
Merge branch kvm-arm64/ffa-1p1 into kvmarm/next
* kvm-arm64/ffa-1p1:
: Improvements to the pKVM FF-A Proxy, courtesy of Sebastian Ene
:
: Various minor improvements to how host FF-A calls are proxied with the
: TEE, along with support for v1.1 of the protocol.
KVM: arm64: Use FF-A 1.1 with pKVM
KVM: arm64: Update the identification range for the FF-A smcs
KVM: arm64: Add support for FFA_PARTITION_INFO_GET
KVM: arm64: Trap FFA_VERSION host call in pKVM
Oliver Upton [Sun, 14 Jul 2024 00:11:26 +0000 (00:11 +0000)]
Merge branch kvm-arm64/misc into kvmarm/next
* kvm-arm64/misc:
: Miscellaneous updates
:
: - Provide a command-line parameter to statically control the WFx trap
: selection in KVM
:
: - Make sysreg masks allocation accounted
Revert "KVM: arm64: nv: Fix RESx behaviour of disabled FGTs with negative polarity"
KVM: arm64: nv: Use GFP_KERNEL_ACCOUNT for sysreg_masks allocation
KVM: arm64: nv: Fix RESx behaviour of disabled FGTs with negative polarity
KVM: arm64: Add early_param to control WFx trapping
Paolo Bonzini [Fri, 12 Jul 2024 15:22:39 +0000 (11:22 -0400)]
Merge tag 'kvm-s390-next-6.11-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
Assortment of tiny fixes which are not time critical:
- Rejecting memory region operations for ucontrol mode VMs
- Rewind the PSW on host intercepts for VSIE
- Remove unneeded include
Paolo Bonzini [Fri, 12 Jul 2024 15:18:45 +0000 (11:18 -0400)]
Merge branch 'kvm-prefault' into HEAD
Pre-population has been requested several times to mitigate KVM page faults
during guest boot or after live migration. It is also required by TDX
before filling in the initial guest memory with measured contents.
Introduce it as a generic API.
KVM: selftests: x86: Add test for KVM_PRE_FAULT_MEMORY
Add a test case to exercise KVM_PRE_FAULT_MEMORY and run the guest to access the
pre-populated area. It tests KVM_PRE_FAULT_MEMORY ioctl for KVM_X86_DEFAULT_VM
and KVM_X86_SW_PROTECTED_VM.
Wire KVM_PRE_FAULT_MEMORY ioctl to kvm_mmu_do_page_fault() to populate guest
memory. It can be called right after KVM_CREATE_VCPU creates a vCPU,
since at that point kvm_mmu_create() and kvm_init_mmu() are called and
the vCPU is ready to invoke the KVM page fault handler.
The helper function kvm_tdp_map_page() takes care of the logic to
process RET_PF_* return values and convert them to success or errno.
KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault"
Move the accounting of the result of kvm_mmu_do_page_fault() to its
callers, as only pf_fixed is common to guest page faults and async #PFs,
and upcoming support KVM_PRE_FAULT_MEMORY won't bump _any_ stats.
KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handler
Account stat.pf_taken in kvm_mmu_page_fault(), i.e. the actual page fault
handler, instead of conditionally bumping it in kvm_mmu_do_page_fault().
The "real" page fault handler is the only path that should ever increment
the number of taken page faults, as all other paths that "do page fault"
are by definition not handling faults that occurred in the guest.
KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory
Add a new ioctl KVM_PRE_FAULT_MEMORY in the KVM common code. It iterates on the
memory range and calls the arch-specific function. The implementation is
optional and enabled by a Kconfig symbol.
Adds documentation of KVM_PRE_FAULT_MEMORY ioctl. [1]
It populates guest memory. It doesn't do extra operations on the
underlying technology-specific initialization [2]. For example,
CoCo-related operations won't be performed. Concretely for TDX, this API
won't invoke TDH.MEM.PAGE.ADD() or TDH.MR.EXTEND(). Vendor-specific APIs
are required for such operations.
The key point is to adapt of vcpu ioctl instead of VM ioctl. First,
populating guest memory requires vcpu. If it is VM ioctl, we need to pick
one vcpu somehow. Secondly, vcpu ioctl allows each vcpu to invoke this
ioctl in parallel. It helps to scale regarding guest memory size, e.g.,
hundreds of GB.
Paolo Bonzini [Thu, 11 Jul 2024 17:56:54 +0000 (13:56 -0400)]
mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE
The flags AS_UNMOVABLE and AS_INACCESSIBLE were both added just for guest_memfd;
AS_UNMOVABLE is already in existing versions of Linux, while AS_INACCESSIBLE was
acked for inclusion in 6.11.
But really, they are the same thing: only guest_memfd uses them, at least for
now, and guest_memfd pages are unmovable because they should not be
accessed by the CPU.
So merge them into one; use the AS_INACCESSIBLE name which is more comprehensive.
At the same time, this fixes an embarrassing bug where AS_INACCESSIBLE was used
as a bit mask, despite it being just a bit index.
The bug was mostly benign, because AS_INACCESSIBLE's bit representation (1010)
corresponded to setting AS_UNEVICTABLE (which is already set) and AS_ENOSPC
(except no async writes can happen on the guest_memfd). So the AS_INACCESSIBLE
flag simply had no effect.
Fixes: 1d23040caa8b ("KVM: guest_memfd: Use AS_INACCESSIBLE when creating guest_memfd inode") Fixes: c72ceafbd12c ("mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory") Cc: [email protected] Acked-by: Vlastimil Babka <[email protected]> Acked-by: David Hildenbrand <[email protected]> Tested-by: Michael Roth <[email protected]> Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
Bibo Mao [Tue, 9 Jul 2024 08:25:51 +0000 (16:25 +0800)]
LoongArch: KVM: Add PV steal time support in guest side
Per-cpu struct kvm_steal_time is added here, its size is 64 bytes and
also defined as 64 bytes, so that the whole structure is in one physical
page.
When a VCPU is online, function pv_enable_steal_time() is called. This
function will pass guest physical address of struct kvm_steal_time and
tells hypervisor to enable steal time. When a vcpu is offline, physical
address is set as 0 and tells hypervisor to disable steal time.
Here is an output of vmstat on guest when there is workload on both host
and guest. It shows steal time stat information.
procs -----------memory---------- -----io---- -system-- ------cpu-----
r b swpd free inact active bi bo in cs us sy id wa st
15 1 0 7583616 184112 72208 20 0 162 52 31 6 43 0 20
17 0 0 7583616 184704 72192 0 0 6318 6885 5 60 8 5 22
16 0 0 7583616 185392 72144 0 0 1766 1081 0 49 0 1 50
16 0 0 7583616 184816 72304 0 0 6300 6166 4 62 12 2 20
18 0 0 7583632 184480 72240 0 0 2814 1754 2 58 4 1 35
Bibo Mao [Tue, 9 Jul 2024 08:25:51 +0000 (16:25 +0800)]
LoongArch: KVM: Add PV steal time support in host side
Add ParaVirt steal time feature in host side, VM can search supported
features provided by KVM hypervisor, a feature KVM_FEATURE_STEAL_TIME
is added here. Like x86, steal time structure is saved in guest memory,
one hypercall function KVM_HCALL_FUNC_NOTIFY is added to notify KVM to
enable this feature.
One CPU attr ioctl command KVM_LOONGARCH_VCPU_PVTIME_CTRL is added to
save and restore the base address of steal time structure when a VM is
migrated.
Bibo Mao [Tue, 9 Jul 2024 08:25:51 +0000 (16:25 +0800)]
LoongArch: KVM: Mark page accessed and dirty with page ref added
Function kvm_map_page_fast() is fast path of secondary mmu page fault
flow, pfn is parsed from secondary mmu page table walker. However the
corresponding page reference is not added, it is dangerious to access
page out of mmu_lock.
Here page ref is added inside mmu_lock, function kvm_set_pfn_accessed()
and kvm_set_pfn_dirty() is called with page ref added, so that the page
will not be freed by others.
Also kvm_set_pfn_accessed() is removed here since it is called in the
following function kvm_release_pfn_clean().
Bibo Mao [Tue, 9 Jul 2024 08:25:51 +0000 (16:25 +0800)]
LoongArch: KVM: Add memory barrier before update pmd entry
When updating pmd entry such as allocating new pmd page or splitting
huge page into normal page, it is necessary to firstly update all pte
entries, and then update pmd entry.
It is weak order with LoongArch system, there will be problem if other
VCPUs see pmd update firstly while ptes are not updated. Here smp_wmb()
is added to assure this.
Bibo Mao [Tue, 9 Jul 2024 08:25:51 +0000 (16:25 +0800)]
LoongArch: KVM: Discard dirty page tracking on readonly memslot
For readonly memslot such as UEFI BIOS or UEFI var space, guest cannot
write this memory space directly. So it is not necessary to track dirty
pages for readonly memslot. Here we make such optimization in function
kvm_arch_commit_memory_region().
Bibo Mao [Tue, 9 Jul 2024 08:25:51 +0000 (16:25 +0800)]
LoongArch: KVM: Select huge page only if secondary mmu supports it
Currently page level selection about secondary mmu depends on memory
slot and page level about host mmu. There will be problems if page level
of secondary mmu is zero already. Huge page cannot be selected if there is
normal page mapped in secondary mmu already, since it is not supported to
merge normal pages into huge pages now.
So page level selection should depend on the following three conditions.
1. Memslot is aligned for huge page and vm is not migrating.
2. Page level of host mmu is also huge page.
3. Page level of secondary mmu is suituable for huge page.
Bibo Mao [Tue, 9 Jul 2024 08:25:50 +0000 (16:25 +0800)]
LoongArch: KVM: Delay secondary mmu tlb flush until guest entry
With hardware assisted virtualization, there are two level HW mmu, one
is GVA to GPA mapping, the other is GPA to HPA mapping which is called
secondary mmu in generic. If there is page fault for secondary mmu,
there needs tlb flush operation indexed with fault GPA address and VMID.
VMID is stored at register CSR.GSTAT and will be reload or recalculated
before guest entry.
Currently CSR.GSTAT is not saved and restored during VCPU context
switch, instead it is recalculated during guest entry. So CSR.GSTAT is
effective only when a VCPU runs in guest mode, however it may not be
effective if the VCPU exits to host mode. Since register CSR.GSTAT may
be stale, it may records the VMID of the last schedule-out VCPU, rather
than the current VCPU.
Function kvm_flush_tlb_gpa() should be called with its real VMID, so
here move it to the guest entrance. Also an arch-specific request id
KVM_REQ_TLB_FLUSH_GPA is added to flush tlb for secondary mmu, and it
can be optimized if VMID is updated, since all guest tlb entries will
be invalid if VMID is updated.
Bibo Mao [Tue, 9 Jul 2024 08:25:50 +0000 (16:25 +0800)]
LoongArch: KVM: Sync pending interrupt when getting ESTAT from user mode
Currently interrupts are posted and cleared with the asynchronous mode,
meanwhile they are saved in SW state vcpu::arch::irq_pending and vcpu::
arch::irq_clear. When vcpu is ready to run, pending interrupt is written
back to CSR.ESTAT register from SW state vcpu::arch::irq_pending at the
guest entrance.
During VM migration stage, vcpu is put into stopped state, however
pending interrupts are not synced to CSR.ESTAT register. So there will
be interrupt lost when VCPU is migrated to another host machines.
Here in this patch when ESTAT CSR register is read from VMM user mode,
pending interrupts are synchronized to ESTAT also. So that VMM can get
correct pending interrupts.
As Marc pointed out on the list [*], this patch is wrong, and those who
find themselves in the SOB chain should have their heads checked.
Annoyingly, the architecture has some FGT trap bits that are negative
(i.e. 0 implies trap), and there was some confusion how KVM handles
this for nested guests. However, it is clear now that KVM honors the
RES0-ness of FGT traps already, meaning traps for features never exposed
to the guest hypervisor get handled at L0. As they should.
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"A set of clk fixes for the Qualcomm, Mediatek, and Allwinner drivers:
- Fix the Qualcomm Stromer Plus PLL set_rate() clk_op to explicitly
set the alpha enable bit and not set bits that don't exist
- Mark Qualcomm IPQ9574 crypto clks as voted to avoid stuck clk
warnings
- Fix the parent of some PLLs on Qualcomm sm6530 so their rate is
correct
- Fix the min/max rate clamping logic in the Allwinner driver that
got broken in v6.9
- Limit runtime PM enabling in the Mediatek driver to only
mt8183-mfgcfg so that system wide resume doesn't break on other
Mediatek SoCs"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: mediatek: mt8183: Only enable runtime PM on mt8183-mfgcfg
clk: sunxi-ng: common: Don't call hw_to_ccu_common on hw without common
clk: qcom: gcc-ipq9574: Add BRANCH_HALT_VOTED flag
clk: qcom: apss-ipq-pll: remove 'config_ctl_hi_val' from Stromer pll configs
clk: qcom: clk-alpha-pll: set ALPHA_EN bit for Stromer Plus PLLs
clk: qcom: gcc-sm6350: Fix gpll6* & gpll7 parents
Merge tag 'powerpc-6.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix unnecessary copy to 0 when kernel is booted at address 0
- Fix usercopy crash when dumping dtl via debugfs
- Avoid possible crash when PCI hotplug races with error handling
- Fix kexec crash caused by scv being disabled before other CPUs
call-in
- Fix powerpc selftests build with USERCFLAGS set
Thanks to Anjali K, Ganesh Goudar, Gautam Menghani, Jinglin Wen,
Nicholas Piggin, Sourabh Jain, Srikar Dronamraju, and Vishal Chourasia.
* tag 'powerpc-6.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
selftests/powerpc: Fix build with USERCFLAGS set
powerpc/pseries: Fix scv instruction crash with kexec
powerpc/eeh: avoid possible crash when edev->pdev changes
powerpc/pseries: Whitelist dtl slub object for copying to userspace
powerpc/64s: Fix unnecessary copy to 0 when kernel is booted at address 0
Merge tag 'i2c-for-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fix from Wolfram Sang:
"An i2c driver fix"
* tag 'i2c-for-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: pnx: Fix potential deadlock warning from del_timer_sync() call in isr
Currently building the powerpc selftests with USERCFLAGS set to anything
causes the build to break:
$ make -C tools/testing/selftests/powerpc V=1 USERCFLAGS=-Wno-error
...
gcc -Wno-error cache_shape.c ...
cache_shape.c:18:10: fatal error: utils.h: No such file or directory
18 | #include "utils.h"
| ^~~~~~~~~
compilation terminated.
This happens because the USERCFLAGS are added to CFLAGS in lib.mk, which
causes the check of CFLAGS in powerpc/flags.mk to skip setting CFLAGS at
all, resulting in none of the usual CFLAGS being passed. That can
be seen in the output above, the only flag passed to the compiler is
-Wno-error.
Fix it by dropping the conditional setting of CFLAGS in flags.mk.
Instead always set CFLAGS, but also append USERCFLAGS if they are set.
Note that appending to CFLAGS (with +=) wouldn't work, because flags.mk
is included by multiple Makefiles (to support partial builds), causing
CFLAGS to be appended to multiple times. Additionally that would place
the USERCFLAGS prior to the standard CFLAGS, meaning the USERCFLAGS
couldn't override the standard flags. Being able to override the
standard flags is desirable, for example for adding -Wno-error.
With the fix in place, the CFLAGS are set correctly, including the
USERCFLAGS:
Merge tag 'integrity-v6.10-fix' of ssh://ra.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity fix from Mimi Zohar:
"A single bug fix to properly remove all of the securityfs IMA
measurement lists"
* tag 'integrity-v6.10-fix' of ssh://ra.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: fix wrong zero-assignment during securityfs dentry remove
Merge tag 'pci-v6.10-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci update from Bjorn Helgaas:
- Update MAINTAINERS and CREDITS to credit Gustavo Pimentel with the
Synopsys DesignWare eDMA driver and reflect that he is no longer at
Synopsys and isn't in a position to maintain the DesignWare xData
traffic generator (Bjorn Helgaas)
* tag 'pci-v6.10-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
CREDITS: Add Synopsys DesignWare eDMA driver for Gustavo Pimentel
MAINTAINERS: Orphan Synopsys DesignWare xData traffic generator
Merge tag 'riscv-for-linus-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A fix for the CMODX example in the recently added icache flushing
prctl()
- A fix to the perf driver to avoid corrupting event data on counter
overflows when external overflow handlers are in use
- A fix to clear all hardware performance monitor events on boot, to
avoid dangling events firmware or previously booted kernels from
triggering spuriously
- A fix to the perf event probing logic to avoid erroneously reporting
the presence of unimplemented counters. This also prevents some
implemented counters from being reported
- A build fix for the vector sigreturn selftest on clang
- A fix to ftrace, which now requires the previously optional index
argument to ftrace_graph_ret_addr()
- A fix to avoid deadlocking if kexec crash handling triggers in an
interrupt context
* tag 'riscv-for-linus-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: kexec: Avoid deadlock in kexec crash path
riscv: stacktrace: fix usage of ftrace_graph_ret_addr()
riscv: selftests: Fix vsetivli args for clang
perf: RISC-V: Check standard event availability
drivers/perf: riscv: Reset the counter to hpmevent mapping while starting cpus
drivers/perf: riscv: Do not update the event data if uptodate
documentation: Fix riscv cmodx example
- xe: migration error handling + typoed register name in gt setup
- i915: usb-c fix to shut up warnings on MTL+
- panthor: fix sync-only jobs + ioctl validation fix to not EINVAL
wrongly
- panel quirks
- nouveau: NULL deref in get_modes
drm core:
- fbdev big endian fix for the dma memory backed variant
drivers/firmware:
- fix sysfb refcounting"
* tag 'drm-fixes-2024-07-05' of https://gitlab.freedesktop.org/drm/kernel:
drm/xe/mcr: Avoid clobbering DSS steering
drm/xe: fix error handling in xe_migrate_update_pgtables
drm/ttm: Always take the bo delayed cleanup path for imported bos
drm/fbdev-generic: Fix framebuffer on big endian devices
drm/panthor: Fix sync-only jobs
drm/panthor: Don't check the array stride on empty uobj arrays
drm/amdgpu/atomfirmware: silence UBSAN warning
drm/radeon: check bo_va->bo is non-NULL before using it
drm/amd/display: Fix array-index-out-of-bounds in dml2/FCLKChangeSupport
drm/amd/display: Update efficiency bandwidth for dcn351
drm/amd/display: Fix refresh rate range for some panel
drm/amd/display: Account for cursor prefetch BW in DML1 mode support
drm/amd/display: Add refresh rate range check
drm/amd/display: Reset freesync config before update new state
drm: panel-orientation-quirks: Add labels for both Valve Steam Deck revisions
drm: panel-orientation-quirks: Add quirk for Valve Galileo
drm/i915/display: For MTL+ platforms skip mg dp programming
drm/nouveau: fix null pointer dereference in nouveau_connector_get_modes
firmware: sysfb: Fix reference count of sysfb parent device
Merge tag 'gpio-fixes-for-v6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
"Two OF lookup quirks and one fix for an issue in the generic gpio-mmio
driver:
- add two OF lookup quirks for TSC2005 and MIPS Lantiq
- don't try to figure out bgpio_bits from the 'ngpios' property in
gpio-mmio"
* tag 'gpio-fixes-for-v6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: of: add polarity quirk for TSC2005
gpio: mmio: do not calculate bgpio_bits via "ngpios"
gpiolib: of: fix lookup quirk for MIPS Lantiq
Merge tag 'tpmdd-next-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
Pull TPM fixes from Jarkko Sakkinen:
"This contains the fixes for !chip->auth condition, preventing the
breakage of:
- tpm_ftpm_tee.c
- tpm_i2c_nuvoton.c
- tpm_ibmvtpm.c
- tpm_tis_i2c_cr50.c
- tpm_vtpm_proxy.c
All drivers will continue to work as they did in 6.9, except a single
warning (dev_warn() not WARN()) is printed to klog only to inform that
authenticated sessions are not enabled"
* tag 'tpmdd-next-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
tpm: Address !chip->auth in tpm_buf_append_hmac_session*()
tpm: Address !chip->auth in tpm_buf_append_name()
tpm: Address !chip->auth in tpm2_*_auth_session()
Wolfram Sang [Fri, 5 Jul 2024 14:08:55 +0000 (16:08 +0200)]
Merge tag 'i2c-host-fixes-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current
This tag includes a nice fix in the PNX driver that has been
pending for a long time. Piotr has replaced a potential lock in
the interrupt context with a more efficient and straightforward
handling of the timeout signaling.
DTS for Nokia N900 incorrectly specifies "active high" polarity for
the reset line, while the chip documentation actually specifies it as
"active low". In the past the driver fudged gpiod API and inverted
the logic internally, but it was changed in d0d89493bff8.
Jarkko Sakkinen [Wed, 3 Jul 2024 15:47:46 +0000 (18:47 +0300)]
tpm: Address !chip->auth in tpm_buf_append_hmac_session*()
Unless tpm_chip_bootstrap() was called by the driver, !chip->auth can
cause a null derefence in tpm_buf_hmac_session*(). Thus, address
!chip->auth in tpm_buf_hmac_session*() and remove the fallback
implementation for !TCG_TPM2_HMAC.
Jarkko Sakkinen [Wed, 3 Jul 2024 15:33:14 +0000 (18:33 +0300)]
tpm: Address !chip->auth in tpm_buf_append_name()
Unless tpm_chip_bootstrap() was called by the driver, !chip->auth can
cause a null derefence in tpm_buf_append_name(). Thus, address
!chip->auth in tpm_buf_append_name() and remove the fallback
implementation for !TCG_TPM2_HMAC.
Jarkko Sakkinen [Wed, 3 Jul 2024 16:39:27 +0000 (19:39 +0300)]
tpm: Address !chip->auth in tpm2_*_auth_session()
Unless tpm_chip_bootstrap() was called by the driver, !chip->auth can cause
a null derefence in tpm2_*_auth_session(). Thus, address !chip->auth in
tpm2_*_auth_session().
Merge tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix folio refcounting when releasing them (encoded write, dummy
extent buffer)
- fix out of bounds read when checking qgroup inherit data
- fix how configurable chunk size is handled in zoned mode
- in the ref-verify tool, fix uninitialized return value when checking
extent owner ref and simple quota are not enabled
* tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix folio refcount in __alloc_dummy_extent_buffer()
btrfs: fix folio refcount in btrfs_do_encoded_write()
btrfs: fix uninitialized return value in the ref-verify tool
btrfs: always do the basic checks for btrfs_qgroup_inherit structure
btrfs: zoned: fix calc_available_free_space() for zoned mode
Merge tag 'net-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth, wireless and netfilter.
There's one fix for power management with Intel's e1000e here,
Thorsten tells us there's another problem that started in v6.9. We're
trying to wrap that up but I don't think it's blocking.
Current release - new code bugs:
- wifi: mac80211: disable softirqs for queued frame handling
- af_unix: fix uninit-value in __unix_walk_scc(), with the new
garbage collection algo
Previous releases - regressions:
- Bluetooth:
- qca: fix BT enable failure for QCA6390 after warm reboot
- add quirk to ignore reserved PHY bits in LE Extended Adv Report,
abused by some Broadcom controllers found on Apple machines
- wifi: wilc1000: fix ies_len type in connect path
Previous releases - always broken:
- tcp: fix DSACK undo in fast recovery to call tcp_try_to_open(),
avoid premature timeouts
- net: make sure skb_datagram_iter maps fragments page by page, in
case we somehow get compound highmem mixed in
- eth: bnx2x: fix multiple UBSAN array-index-out-of-bounds when more
queues are used
Misc:
- MAINTAINERS: Remembering Larry Finger"
* tag 'net-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
bnxt_en: Fix the resource check condition for RSS contexts
mlxsw: core_linecards: Fix double memory deallocation in case of invalid INI file
inet_diag: Initialize pad field in struct inet_diag_req_v2
tcp: Don't flag tcp_sk(sk)->rx_opt.saw_unknown for TCP AO.
selftests: make order checking verbose in msg_zerocopy selftest
selftests: fix OOM in msg_zerocopy selftest
ice: use proper macro for testing bit
ice: Reject pin requests with unsupported flags
ice: Don't process extts if PTP is disabled
ice: Fix improper extts handling
selftest: af_unix: Add test case for backtrack after finalising SCC.
af_unix: Fix uninit-value in __unix_walk_scc()
bonding: Fix out-of-bounds read in bond_option_arp_ip_targets_set()
net: rswitch: Avoid use-after-free in rswitch_poll()
netfilter: nf_tables: unconditionally flush pending work before notifier
wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference
wifi: iwlwifi: mvm: avoid link lookup in statistics
wifi: iwlwifi: mvm: don't wake up rx_sync_waitq upon RFKILL
wifi: iwlwifi: properly set WIPHY_FLAG_SUPPORTS_EXT_KEK_KCK
wifi: wilc1000: fix ies_len type in connect path
...
Merge tag 's390-6.10-8' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Heiko Carstens:
- Fix and add physical to virtual address translations in dasd and
virtio_ccw drivers. For virtio_ccw this is just a minimal fix.
More code cleanup will follow.
- Small defconfig updates
* tag 's390-6.10-8' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/dasd: Fix invalid dereferencing of indirect CCW data pointer
s390/vfio_ccw: Fix target addresses of TIC CCWs
s390: Update defconfigs
Merge tag 'mm-hotfixes-stable-2024-07-03-22-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from, Andrew Morton:
"6 hotfies, all cc:stable. Some fixes for longstanding nilfs2 issues
and three unrelated MM fixes"
* tag 'mm-hotfixes-stable-2024-07-03-22-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
nilfs2: fix incorrect inode allocation from reserved inodes
nilfs2: add missing check for inode numbers on directory entries
nilfs2: fix inode number range checks
mm: avoid overflows in dirty throttling logic
Revert "mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again"
mm: optimize the redundant loop of mm_update_owner_next()
bnxt_en: Fix the resource check condition for RSS contexts
While creating a new RSS context, bnxt_rfs_capable() currently
makes a strict check to see if the required VNICs are already
available. If the current VNICs are not what is required,
either too many or not enough, it will call the firmware to
reserve the exact number required.
There is a bug in the firmware when the driver tries to
relinquish some reserved VNICs and RSS contexts. It will
cause the default VNIC to lose its RSS configuration and
cause receive packets to be placed incorrectly.
Workaround this problem by skipping the resource reduction.
The driver will not reduce the VNIC and RSS context reservations
when a context is deleted. The resources will be available for
use when new contexts are created later.
Potentially, this workaround can cause us to run out of VNIC
and RSS contexts if there are a lot of VF functions creating
and deleting RSS contexts. In the future, we will conditionally
disable this workaround when the firmware fix is available.
mlxsw: core_linecards: Fix double memory deallocation in case of invalid INI file
In case of invalid INI file mlxsw_linecard_types_init() deallocates memory
but doesn't reset pointer to NULL and returns 0. In case of any error
occurred after mlxsw_linecard_types_init() call, mlxsw_linecards_init()
calls mlxsw_linecard_types_fini() which performs memory deallocation again.
Add pointer reset to NULL.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Jakub Kicinski [Thu, 4 Jul 2024 14:31:54 +0000 (07:31 -0700)]
Merge tag 'wireless-2024-07-04' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v6.10
Hopefully the last fixes for v6.10. Fix a regression in wilc1000
where bitrate Information Elements longer than 255 bytes were broken.
Few fixes also to mac80211 and iwlwifi.
* tag 'wireless-2024-07-04' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference
wifi: iwlwifi: mvm: avoid link lookup in statistics
wifi: iwlwifi: mvm: don't wake up rx_sync_waitq upon RFKILL
wifi: iwlwifi: properly set WIPHY_FLAG_SUPPORTS_EXT_KEK_KCK
wifi: wilc1000: fix ies_len type in connect path
wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP
====================