Git Repo - linux.git/log

Merge tag 'kvm-x86-selftests-6.13' of https://github.com/kvm-x86/linux into HEAD

KVM selftests changes for 6.13

- Enable XFAM-based features by default for all selftests VMs, which will
allow removing the "no AVX" restriction.

Merge tag 'kvm-x86-mmu-6.13' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.13

- Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve
   documentation, harden against unexpected changes, and to simplify
   A/D-disabled MMUs by using the hardware-defined A/D bits to track if a
   PFN is Accessed and/or Dirty.

- Elide TLB flushes when aging SPTEs, as has been done in x86's primary
   MMU for over 10 years.

- Batch TLB flushes when zapping collapsible TDP MMU SPTEs, i.e. when
   dirty logging is toggled off, which reduces the time it takes to disable
   dirty logging by ~3x.

- Recover huge pages in-place in the TDP MMU instead of zapping the SP
   and waiting until the page is re-accessed to create a huge mapping.
   Proactively installing huge pages can reduce vCPU jitter in extreme
   scenarios.

- Remove support for (poorly) reclaiming page tables in shadow MMUs via
   the primary MMU's shrinker interface.

Merge tag 'kvm-x86-generic-6.13' of https://github.com/kvm-x86/linux into HEAD

KVM generic changes for 6.13

- Rework kvm_vcpu_on_spin() to use a single for-loop instead of making two
   partial poasses over "all" vCPUs.  Opportunistically expand the comment
   to better explain the motivation and logic.

- Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead
   of RCU, so that running a vCPU on a different task doesn't encounter
   long stalls due to having to wait for all CPUs become quiescent.

Merge tag 'kvm-s390-next-6.13-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

- second part of the ucontrol selftest
- cpumodel sanity check selftest
- gen17 cpumodel changes

KVM: s390: selftests: Add regression tests for PFCR subfunctions

Check if the PFCR query reported in userspace coincides with the
kernel reported function list. Right now we don't mask the functions
in the kernel so they have to be the same.

Signed-off-by: Hendrik Brueckner <[email protected]>
Reviewed-by: Hariharan Mari <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[[email protected]: Added commit description]
Signed-off-by: Janosch Frank <[email protected]>
Message-ID: <20241107152319 [email protected]>

KVM: s390: add gen17 facilities to CPU model

Add gen17 facilities and let KVM_CAP_S390_VECTOR_REGISTERS handle
the enablement of the vector extension facilities.

Signed-off-by: Hendrik Brueckner <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Janosch Frank <[email protected]>
Message-ID: <20241107152319 [email protected]>

KVM: s390: add msa11 to cpu model

Message-security-assist 11 introduces pckmo subfunctions to encrypt
hmac keys.

Signed-off-by: Hendrik Brueckner <[email protected]>
Reviewed-by: Janosch Frank <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Janosch Frank <[email protected]>
Message-ID: <20241107152319 [email protected]>

KVM: s390: add concurrent-function facility to cpu model

Adding support for concurrent-functions facility which provides
additional subfunctions.

Signed-off-by: Hendrik Brueckner <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Janosch Frank <[email protected]>
Message-ID: <20241107152319 [email protected]>

KVM: s390: selftests: correct IP.b length in uc_handle_sieic debug output

The length of the interrupt parameters (IP) are:
a: 2 bytes
b: 4 bytes

Signed-off-by: Christoph Schlameuss <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[[email protected]: Fixed patch prefix]
Signed-off-by: Janosch Frank <[email protected]>
Message-ID: <20241107141024 [email protected]>

KVM: s390: selftests: Fix whitespace confusion in ucontrol test

Checkpatch thinks that we're doing a multiplication but we're obviously
not. Fix 4 instances where we adhered to wrong checkpatch advice.

Signed-off-by: Christoph Schlameuss <[email protected]>
Reviewed-by: Janosch Frank <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[[email protected]: Fixed patch prefix]
Signed-off-by: Janosch Frank <[email protected]>
Message-ID: <20241107141024 [email protected]>

KVM: s390: selftests: Verify reject memory region operations for ucontrol VMs

Add a test case verifying KVM_SET_USER_MEMORY_REGION and
KVM_SET_USER_MEMORY_REGION2 cannot be executed on ucontrol VMs.

Executing this test case on not patched kernels will cause a null
pointer dereference in the host kernel.
This is fixed with commit:
commit 7816e58967d0 ("kvm: s390: Reject memory region operations for ucontrol VMs")

Signed-off-by: Christoph Schlameuss <[email protected]>
Reviewed-by: Janosch Frank <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[[email protected]: Fixed patch prefix]
Signed-off-by: Janosch Frank <[email protected]>
Message-ID: <20241107141024 [email protected]>

KVM: s390: selftests: Add uc_skey VM test case

Add a test case manipulating s390 storage keys from within the ucontrol
VM.

Storage key instruction (ISKE, SSKE and RRBE) intercepts and
Keyless-subset facility are disabled on first use, where the skeys are
setup by KVM in non ucontrol VMs.

Signed-off-by: Christoph Schlameuss <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Janosch Frank <[email protected]>
[[email protected]: Fixed patch prefix]
Signed-off-by: Janosch Frank <[email protected]>
Message-ID: <20241108091620 [email protected]>

KVM: s390: selftests: Add uc_map_unmap VM test case

Add a test case verifying basic running and interaction of ucontrol VMs.
Fill the segment and page tables for allocated memory and map memory on
first access.

* uc_map_unmap
Store and load data to mapped and unmapped memory and use pic segment
translation handling to map memory on access.

Signed-off-by: Christoph Schlameuss <[email protected]>
Reviewed-by: Janosch Frank <[email protected]>
Link:
https://lore.kernel.org/r/20241107141024 [email protected]
[[email protected]: Fixed patch prefix]
Signed-off-by: Janosch Frank <[email protected]>
Message-ID: <20241107141024 [email protected]>

Merge tag 'kvm-riscv-6.13-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv changes for 6.13

- Accelerate KVM RISC-V when running as a guest
- Perf support to collect KVM guest statistics from host side

riscv: kvm: Fix out-of-bounds array access

In kvm_riscv_vcpu_sbi_init() the entry->ext_idx can contain an
out-of-bound index. This is used as a special marker for the base
extensions, that cannot be disabled. However, when traversing the
extensions, that special marker is not checked prior indexing the
array.

Add an out-of-bounds check to the function.

Fixes: 56d8a385b605 ("RISC-V: KVM: Allow some SBI extensions to be disabled by default")
Signed-off-by: Björn Töpel <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anup Patel <[email protected]>

RISC-V: KVM: Fix APLIC in_clrip and clripnum write emulation

In the section "4.7 Precise effects on interrupt-pending bits"
of the RISC-V AIA specification defines that:

"If the source mode is Level1 or Level0 and the interrupt domain
is configured in MSI delivery mode (domaincfg.DM = 1):
The pending bit is cleared whenever the rectified input value is
low, when the interrupt is forwarded by MSI, or by a relevant
write to an in_clrip register or to clripnum."

Update the aplic_write_pending() to match the spec.

Fixes: d8dd9f113e16 ("RISC-V: KVM: Fix APLIC setipnum_le/be write emulation")
Signed-off-by: Yong-Xuan Wang <[email protected]>
Reviewed-by: Vincent Chen <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anup Patel <[email protected]>

KVM: x86/mmu: Drop per-VM zapped_obsolete_pages list

Drop the per-VM zapped_obsolete_pages list now that the usage from the
defunct mmu_shrinker is gone, and instead use a local list to track pages
in kvm_zap_obsolete_pages(), the sole remaining user of
zapped_obsolete_pages.

Opportunistically add an assertion to verify and document that slots_lock
must be held, i.e. that there can only be one active instance of
kvm_zap_obsolete_pages() at any given time, and by doing so also prove
that using a local list instead of a per-VM list doesn't change any
functionality (beyond trivialities like list initialization).

Signed-off-by: Vipin Sharma <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: split to separate patch, write changelog]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Remove KVM's MMU shrinker

Remove KVM's MMU shrinker and (almost) all of its related code, as the
current implementation is very disruptive to VMs (if it ever runs),
without providing any meaningful benefit[1].

Alternatively, KVM could repurpose its shrinker, e.g. to reclaim pages
from the per-vCPU caches[2], but given that no one has complained about
lack of TDP MMU support for the shrinker in the 3+ years since the TDP MMU
was enabled by default, it's safe to say that there is likely no real use
case for initiating reclaim of KVM's page tables from the shrinker.

And while clever/cute, reclaiming the per-vCPU caches doesn't scale the
same way that reclaiming in-use page table pages does. E.g. the amount of
memory being used by a VM doesn't always directly correlate with the
number vCPUs, and even when it does, reclaiming a few pages from per-vCPU
caches likely won't make much of a dent in the VM's total memory usage,
especially for VMs with huge amounts of memory.

Lastly, if it turns out that there is a strong use case for dropping the
per-vCPU caches, re-introducing the shrinker registration is trivial
compared to the complexity of actually reclaiming pages from the caches.

[1] https://lore.kernel.org/lkml/[email protected]
[2] https://lore.kernel.org/kvm/20241004195540 [email protected]

Suggested-by: Sean Christopherson <[email protected]>
Suggested-by: David Matlack <[email protected]>
Signed-off-by: Vipin Sharma <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: keep zapped_obsolete_pages for now, massage changelog]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: WARN if huge page recovery triggered during dirty logging

WARN and bail out of recover_huge_pages_range() if dirty logging is
enabled. KVM shouldn't be recovering huge pages during dirty logging
anyway, since KVM needs to track writes at 4KiB. However it's not out of
the possibility that that changes in the future.

If KVM wants to recover huge pages during dirty logging, make_huge_spte()
must be updated to write-protect the new huge page mapping. Otherwise,
writes through the newly recovered huge page mapping will not be tracked.

Note that this potential risk did not exist back when KVM zapped to
recover huge page mappings, since subsequent accesses would just be
faulted in at PG_LEVEL_4K if dirty logging was enabled.

Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Rename make_huge_page_split_spte() to make_small_spte()

Rename make_huge_page_split_spte() to make_small_spte(). This ensures
that the usage of "small_spte" and "huge_spte" are consistent between
make_huge_spte() and make_small_spte().

This should also reduce some confusion as make_huge_page_split_spte()
almost reads like it will create a huge SPTE, when in fact it is
creating a small SPTE to split the huge SPTE.

No functional change intended.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Recover TDP MMU huge page mappings in-place instead of zapping

Recover TDP MMU huge page mappings in-place instead of zapping them when
dirty logging is disabled, and rename functions that recover huge page
mappings when dirty logging is disabled to move away from the "zap
collapsible spte" terminology.

Before KVM flushes TLBs, guest accesses may be translated through either
the (stale) small SPTE or the (new) huge SPTE. This is already possible
when KVM is doing eager page splitting (where TLB flushes are also
batched), and when vCPUs are faulting in huge mappings (where TLBs are
flushed after the new huge SPTE is installed).

Recovering huge pages reduces the number of page faults when dirty
logging is disabled:

$ perf stat -e kvm:kvm_page_fault -- ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g

Before: 393,599 kvm:kvm_page_fault
After: 262,575 kvm:kvm_page_fault

vCPU throughput and the latency of disabling dirty-logging are about
equal compared to zapping, but avoiding faults can be beneficial to
remove vCPU jitter in extreme scenarios.

Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Refactor TDP MMU iter need resched check

Refactor the TDP MMU iterator "need resched" checks into a helper
function so they can be called from a different code path in a
subsequent commit.

No functional change intended.

Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: rebase on a swapped order of checks]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Demote the WARN on yielded in xxx_cond_resched() to KVM_MMU_WARN_ON

Convert the WARN in tdp_mmu_iter_cond_resched() that the iterator hasn't
already yielded to a KVM_MMU_WARN_ON() so the code is compiled out for
production kernels (assuming production kernels disable KVM_PROVE_MMU).

Checking for a needed reschedule is a hot path, and KVM sanity checks
iter->yielded in several other less-hot paths, i.e. the odds of KVM not
flagging that something went sideways are quite low. Furthermore, the
odds of KVM not noticing *and* the WARN detecting something worth
investigating are even lower.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Check yielded_gfn for forward progress iff resched is needed

Swap the order of the checks in tdp_mmu_iter_cond_resched() so that KVM
checks to see if a resched is needed _before_ checking to see if yielding
must be disallowed to guarantee forward progress.  Iterating over TDP MMU
SPTEs is a hot path, e.g. tearing down a root can touch millions of SPTEs,
and not needing to reschedule is by far the common case.  On the other
hand, disallowing yielding because forward progress has not been made is a
very rare case.

Returning early for the common case (no resched), effectively reduces the
number of checks from 2 to 1 for the common case, and should make the code
slightly more predictable for the CPU.

To resolve a weird conundrum where the forward progress check currently
returns false, but the need resched check subtly returns iter->yielded,
which _should_ be false (enforced by a WARN), return false unconditionally
(which might also help make the sequence more predictable).  If KVM has a
bug where iter->yielded is left danging, continuing to yield is neither
right nor wrong, it was simply an artifact of how the original code was
written.

Unconditionally returning false when yielding is unnecessary or unwanted
will also allow extracting the "should resched" logic to a separate helper
in a future patch.

Cc: David Matlack <[email protected]>
Reviewed-by: James Houghton <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Ensure KVM supports AVX for SEV-ES VMSA FPU test

Verify that KVM's supported XCR0 includes AVX (and earlier features) when
running the SEV-ES VMSA XSAVE test. In practice, the issue will likely
never pop up, since KVM support for AVX predates KVM support for SEV-ES,
but checking for KVM support makes the requirement more obvious.

Reviewed-by: Vitaly Kuznetsov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Drop manual XCR0 configuration from SEV smoke test

Now that CR4.OSXSAVE and XCR0 are setup by default, drop the manual
enabling from the SEV smoke test that validates FPU state can be
transferred into the VMSA.

In guest_code_xsave(), explicitly set the Requested-Feature Bitmask (RFBM)
to exactly XFEATURE_MASK_X87_AVX instead of relying on the host side of
things to enable only X87_AVX features in guest XCR0. I.e. match the RFBM
for the host XSAVE.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Drop manual XCR0 configuration from state test

Now that CR4.OSXSAVE and XCR0 are setup by default, drop the manual
enabling from the state test, which is fully redundant with the default
behavior.

Reviewed-by: Vitaly Kuznetsov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Drop manual XCR0 configuration from AMX test

Now that CR4.OSXSAVE and XCR0 are setup by default, drop the manual
enabling of OXSAVE and XTILE from the AMX test.

Reviewed-by: Vitaly Kuznetsov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Drop manual CR4.OSXSAVE enabling from CR4/CPUID sync test

Now that CR4.OSXSAVE is enabled by default, drop the manual enabling from
CR4/CPUID sync test and instead assert that CR4.OSXSAVE is enabled.

Reviewed-by: Vitaly Kuznetsov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Verify XCR0 can be "downgraded" and "upgraded"

Now that KVM selftests enable all supported XCR0 features by default, add
a testcase to the XCR0 vs. CPUID test to verify that the guest can disable
everything except the legacy FPU in XCR0, and then re-enable the full
feature set, which is kinda sorta what the test did before XCR0 was setup
by default.

Reviewed-by: Vitaly Kuznetsov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Configure XCR0 to max supported value by default

To play nice with compilers generating AVX instructions, set CR4.OSXSAVE
and configure XCR0 by default when creating selftests vCPUs.  Some distros
have switched gcc to '-march=x86-64-v3' by default, and while it's hard to
find a CPU which doesn't support AVX today, many KVM selftests fail with

  ==== Test Assertion Failure ====
    lib/x86_64/processor.c:570: Unhandled exception in guest
    pid=72747 tid=72747 errno=4 - Interrupted system call
    Unhandled exception '0x6' at guest RIP '0x4104f7'

due to selftests not enabling AVX by default for the guest.  The failure
is easy to reproduce elsewhere with:

   $ make clean && CFLAGS='-march=x86-64-v3' make -j && ./x86_64/kvm_pv_test

E.g. gcc-13 with -march=x86-64-v3 compiles this chunk from selftests'
kvm_fixup_exception():

        regs->rip = regs->r11;
        regs->r9 = regs->vector;
        regs->r10 = regs->error_code;

into this monstronsity (which is clever, but oof):

  405313:       c4 e1 f9 6e c8          vmovq  %rax,%xmm1
  405318:       48 89 68 08             mov    %rbp,0x8(%rax)
  40531c:       48 89 e8                mov    %rbp,%rax
  40531f:       c4 c3 f1 22 c4 01       vpinsrq $0x1,%r12,%xmm1,%xmm0
  405325:       49 89 6d 38             mov    %rbp,0x38(%r13)
  405329:       c5 fa 7f 45 00          vmovdqu %xmm0,0x0(%rbp)

Alternatively, KVM selftests could explicitly restrict the compiler to
-march=x86-64-v2, but odds are very good that punting on AVX enabling will
simply result in tests that "need" AVX doing their own thing, e.g. there
are already three or so additional cleanups that can be done on top.

Reported-by: Vitaly Kuznetsov <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]
Reviewed-and-tested-by: Vitaly Kuznetsov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Rework OSXSAVE CR4=>CPUID test to play nice with AVX insns

Rework the CR4/CPUID sync test to clear CR4.OSXSAVE, do CPUID, and restore
CR4.OSXSAVE in assembly, so that there is zero chance of AVX instructions
being executed while CR4.OSXSAVE is disabled. This will allow enabling
CR4.OSXSAVE by default for selftests vCPUs as a general means of playing
nice with AVX instructions.

Reviewed-by: Vitaly Kuznetsov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Mask off OSPKE and OSXSAVE when comparing CPUID entries

Mask off OSPKE and OSXSAVE, which are toggled based on corresponding CR4
enabling bits, when comparing vCPU CPUID against KVM's supported CPUID.
This will allow setting OSXSAVE by default when creating vCPUs, without
causing test failures (KVM doesn't enumerate OSXSAVE=1).

Reviewed-by: Vitaly Kuznetsov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Precisely mask off dynamic fields in CPUID test

When comparing vCPU CPUID entries against KVM's supported CPUID, mask off
only the dynamic fields/bits instead of skipping the entire entry.
Precisely masking bits isn't meaningfully more difficult than skipping
entire entries, and will be necessary to maintain test coverage when a
future commit enables OSXSAVE by default, i.e. makes one bit in all of
CPUID.0x1 dynamic.

Reviewed-by: Vitaly Kuznetsov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Batch TLB flushes when zapping collapsible TDP MMU SPTEs

Set SPTEs directly to SHADOW_NONPRESENT_VALUE and batch up TLB flushes
when zapping collapsible SPTEs, rather than freezing them first.

Freezing the SPTE first is not required. It is fine for another thread
holding mmu_lock for read to immediately install a present entry before
TLBs are flushed because the underlying mapping is not changing. vCPUs
that translate through the stale 4K mappings or a new huge page mapping
will still observe the same GPA->HPA translations.

KVM must only flush TLBs before dropping RCU (to avoid use-after-free of
the zapped page tables) and before dropping mmu_lock (to synchronize
with mmu_notifiers invalidating mappings).

In VMs backed with 2MiB pages, batching TLB flushes improves the time it
takes to zap collapsible SPTEs to disable dirty logging:

$ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g

Before: Disabling dirty logging time: 14.334453428s (131072 flushes)
After: Disabling dirty logging time: 4.794969689s (76 flushes)

Skipping freezing SPTEs also avoids stalling vCPU threads on the frozen
SPTE for the time it takes to perform a remote TLB flush. vCPUs faulting
on the zapped mapping can now immediately install a new huge mapping and
proceed with guest execution.

Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Drop @max_level from kvm_mmu_max_mapping_level()

Drop the @max_level parameter from kvm_mmu_max_mapping_level(). All
callers pass in PG_LEVEL_NUM, so @max_level can be replaced with
PG_LEVEL_NUM in the function body.

No functional change intended.

Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86: Don't emit TLB flushes when aging SPTEs for mmu_notifiers

Follow x86's primary MMU, which hasn't flushed TLBs when clearing Accessed
bits for 10+ years, and skip all TLB flushes when aging SPTEs in response
to a clear_flush_young() mmu_notifier event. As documented in x86's
ptep_clear_flush_young(), the probability and impact of "bad" reclaim due
to stale A-bit information is relatively low, whereas the performance cost
of TLB flushes is relatively high. I.e. the cost of flushing TLBs
outweighs the benefits.

On KVM x86, the cost of TLB flushes is even higher, as KVM doesn't batch
TLB flushes for mmu_notifier events (KVM's mmu_notifier contract with MM
makes it all but impossible), and sending IPIs forces all running vCPUs to
go through a VM-Exit => VM-Enter roundtrip.

Furthermore, MGLRU aging of secondary MMUs is expected to use flush-less
mmu_notifiers, i.e. flushing for the !MGLRU will make even less sense, and
will be actively confusing as it wouldn't be clear why KVM "needs" to
flush TLBs for legacy LRU aging, but not for MGLRU aging.

Cc: James Houghton <[email protected]>
Cc: Yan Zhao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: Allow arch code to elide TLB flushes when aging a young page

Add a Kconfig to allow architectures to opt-out of a TLB flush when a
young page is aged, as invalidating TLB entries is not functionally
required on most KVM-supported architectures. Stale TLB entries can
result in false negatives and theoretically lead to suboptimal reclaim,
but in practice all observations have been that the performance gained by
skipping TLB flushes outweighs any performance lost by reclaiming hot
pages.

E.g. the primary MMUs for x86 RISC-V, s390, and PPC Book3S elide the TLB
flush for ptep_clear_flush_young(), and arm64's MMU skips the trailing DSB
that's required for ordering (presumably because there are optimizations
related to eliding other TLB flushes when doing make-before-break).

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Set Dirty bit for new SPTEs, even if _hardware_ A/D bits are disabled

When making a SPTE, set the Dirty bit in the SPTE as appropriate, even if
hardware A/D bits are disabled. Only EPT allows A/D bits to be disabled,
and for EPT, the bits are software-available (ignored by hardware) when
A/D bits are disabled, i.e. it is perfectly legal for KVM to use the Dirty
to track dirty pages in software.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Dedup logic for detecting TLB flushes on leaf SPTE changes

Now that the shadow MMU and TDP MMU have identical logic for detecting
required TLB flushes when updating SPTEs, move said logic to a helper so
that the TDP MMU code can benefit from the comments that are currently
exclusive to the shadow MMU.

No functional change intended.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Stop processing TDP MMU roots for test_age if young SPTE found

Return immediately if a young SPTE is found when testing, but not updating,
SPTEs. The return value is a boolean, i.e. whether there is one young SPTE
or fifty is irrelevant (ignoring the fact that it's impossible for there to
be fifty SPTEs, as KVM has a hard limit on the number of valid TDP MMU
roots).

Link: https://lore.kernel.org/r/[email protected]
[sean: use guard(rcu)(), as suggested by Paolo]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Process only valid TDP MMU roots when aging a gfn range

Skip invalid TDP MMU roots when aging a gfn range. There is zero reason
to process invalid roots, as they by definition hold stale information.
E.g. if a root is invalid because its from a previous memslot generation,
in the unlikely event the root has a SPTE for the gfn, then odds are good
that the gfn=>hva mapping is different, i.e. doesn't map to the hva that
is being aged by the primary MMU.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Use Accessed bit even when _hardware_ A/D bits are disabled

Use the Accessed bit in SPTEs even when A/D bits are disabled in hardware,
i.e. propagate accessed information to SPTE.Accessed even when KVM is
doing manual tracking by making SPTEs not-present. In addition to
eliminating a small amount of code in is_accessed_spte(), this also paves
the way for preserving Accessed information when a SPTE is zapped in
response to a mmu_notifier PROTECTION event, e.g. if a SPTE is zapped
because NUMA balancing kicks in.

Note, EPT is the only flavor of paging in which A/D bits are conditionally
enabled, and the Accessed (and Dirty) bit is software-available when A/D
bits are disabled.

Note #2, there are currently no concrete plans to preserve Accessed
information. Explorations on that front were the initial catalyst, but
the cleanup is the motivation for the actual commit.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Set shadow_dirty_mask for EPT even if A/D bits disabled

Set shadow_dirty_mask to the architectural EPT Dirty bit value even if
A/D bits are disabled at the module level, i.e. even if KVM will never
enable A/D bits in hardware. Doing so provides consistent behavior for
Accessed and Dirty bits, i.e. doesn't leave KVM in a state where it sets
shadow_accessed_mask but not shadow_dirty_mask.

Functionally, this should be one big nop, as consumption of
shadow_dirty_mask is always guarded by a check that hardware A/D bits are
enabled.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Set shadow_accessed_mask for EPT even if A/D bits disabled

Now that KVM doesn't use shadow_accessed_mask to detect if hardware A/D
bits are enabled, set shadow_accessed_mask for EPT even when A/D bits
are disabled in hardware. This will allow using shadow_accessed_mask for
software purposes, e.g. to preserve accessed status in a non-present SPTE
acros NUMA balancing, if something like that is ever desirable.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Add a dedicated flag to track if A/D bits are globally enabled

Add a dedicated flag to track if KVM has enabled A/D bits at the module
level, instead of inferring the state based on whether or not the MMU's
shadow_accessed_mask is non-zero. This will allow defining and using
shadow_accessed_mask even when A/D bits aren't used by hardware.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: WARN and flush if resolving a TDP MMU fault clears MMU-writable

Do a remote TLB flush if installing a leaf SPTE overwrites an existing
leaf SPTE (with the same target pfn, which is enforced by a BUG() in
handle_changed_spte()) and clears the MMU-Writable bit. Since the TDP MMU
passes ACC_ALL to make_spte(), i.e. always requests a Writable SPTE, the
only scenario in which make_spte() should create a !MMU-Writable SPTE is
if the gfn is write-tracked or if KVM is prefetching a SPTE.

When write-protecting for write-tracking, KVM must hold mmu_lock for write,
i.e. can't race with a vCPU faulting in the SPTE. And when prefetching a
SPTE, the TDP MMU takes care to avoid clobbering a shadow-present SPTE,
i.e. it should be impossible to replace a MMU-writable SPTE with a
!MMU-writable SPTE when handling a TDP MMU fault.

Cc: David Matlack <[email protected]>
Cc: Yan Zhao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Fold mmu_spte_update_no_track() into mmu_spte_update()

Fold the guts of mmu_spte_update_no_track() into mmu_spte_update() now
that the latter doesn't flush when clearing A/D bits, i.e. now that there
is no need to explicitly avoid TLB flushes when aging SPTEs.

Opportunistically WARN if mmu_spte_update() requests a TLB flush when
aging SPTEs, as aging should never modify a SPTE in such a way that KVM
thinks a TLB flush is needed.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Drop ignored return value from kvm_tdp_mmu_clear_dirty_slot()

Drop the return value from kvm_tdp_mmu_clear_dirty_slot() as its sole
caller ignores the result (KVM flushes after clearing dirty logs based on
the logs themselves, not based on SPTEs).

Cc: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Don't flush TLBs when clearing Dirty bit in shadow MMU

Don't force a TLB flush when an SPTE update in the shadow MMU happens to
clear the Dirty bit, as KVM unconditionally flushes TLBs when enabling
dirty logging, and when clearing dirty logs, KVM flushes based on its
software structures, not the SPTEs. I.e. the flows that care about
accurate Dirty bit information already ensure there are no stale TLB
entries.

Opportunistically drop is_dirty_spte() as mmu_spte_update() was the sole
caller.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Don't force flush if SPTE update clears Accessed bit

Don't force a TLB flush if mmu_spte_update() clears the Accessed bit, as
access tracking tolerates false negatives, as evidenced by the
mmu_notifier hooks that explicitly test and age SPTEs without doing a TLB
flush.

In practice, this is very nearly a nop.  spte_write_protect() and
spte_clear_dirty() never clear the Accessed bit.  make_spte() always
sets the Accessed bit for !prefetch scenarios.  FNAME(sync_spte) only sets
SPTE if the protection bits are changing, i.e. if a flush will be needed
regardless of the Accessed bits.  And FNAME(pte_prefetch) sets SPTE if and
only if the old SPTE is !PRESENT.

That leaves kvm_arch_async_page_ready() as the one path that will generate
a !ACCESSED SPTE *and* overwrite a PRESENT SPTE.  And that's very arguably
a bug, as clobbering a valid SPTE in that case is nonsensical.

Tested-by: Alex Bennée <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Fold all of make_spte()'s writable handling into one if-else

Now that make_spte() no longer uses a funky goto to bail out for a special
case of its unsync handling, combine all of the unsync vs. writable logic
into a single if-else statement.

No functional change intended.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Always set SPTE's dirty bit if it's created as writable

When creating a SPTE, always set the Dirty bit if the Writable bit is set,
i.e. if KVM is creating a writable mapping. If two (or more) vCPUs are
racing to install a writable SPTE on a !PRESENT fault, only the "winning"
vCPU will create a SPTE with W=1 and D=1, all "losers" will generate a
SPTE with W=1 && D=0.

As a result, tdp_mmu_map_handle_target_level() will fail to detect that
the losing faults are effectively spurious, and will overwrite the D=1
SPTE with a D=0 SPTE. For normal VMs, overwriting a present SPTE is a
small performance blip; KVM blasts a remote TLB flush, but otherwise life
goes on.

For upcoming TDX VMs, overwriting a present SPTE is much more costly, and
can even lead to the VM being terminated if KVM isn't careful, e.g. if KVM
attempts TDH.MEM.PAGE.AUG because the TDX code doesn't detect that the
new SPTE is actually the same as the old SPTE (which would be a bug in its
own right).

Suggested-by: Sagi Shahar <[email protected]>
Cc: Yan Zhao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Flush remote TLBs iff MMU-writable flag is cleared from RO SPTE

Don't force a remote TLB flush if KVM happens to effectively "refresh" a
read-only SPTE that is still MMU-Writable, as KVM allows MMU-Writable SPTEs
to have Writable TLB entries, even if the SPTE is !Writable. Remote TLBs
need to be flushed only when creating a read-only SPTE for write-tracking,
i.e. when installing a !MMU-Writable SPTE.

In practice, especially now that KVM doesn't overwrite existing SPTEs when
prefetching, KVM will rarely "refresh" a read-only, MMU-Writable SPTE,
i.e. this is unlikely to eliminate many, if any, TLB flushes. But, more
precisely flushing makes it easier to understand exactly when KVM does and
doesn't need to flush.

Note, x86 architecturally requires relevant TLB entries to be invalidated
on a page fault, i.e. there is no risk of putting a vCPU into an infinite
loop of read-only page faults.

Cc: Yan Zhao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: Protect vCPU's "last run PID" with rwlock, not RCU

To avoid jitter on KVM_RUN due to synchronize_rcu(), use a rwlock instead
of RCU to protect vcpu->pid, a.k.a. the pid of the task last used to a
vCPU. When userspace is doing M:N scheduling of tasks to vCPUs, e.g. to
run SEV migration helper vCPUs during post-copy, the synchronize_rcu()
needed to change the PID associated with the vCPU can stall for hundreds
of milliseconds, which is problematic for latency sensitive post-copy
operations.

In the directed yield path, do not acquire the lock if it's contended,
i.e. if the associated PID is changing, as that means the vCPU's task is
already running.

Reported-by: Steve Rutherford <[email protected]>
Reviewed-by: Steve Rutherford <[email protected]>
Acked-by: Oliver Upton <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: Return '0' directly when there's no task to yield to

Do "return 0" instead of initializing and returning a local variable in
kvm_vcpu_yield_to(), e.g. so that it's more obvious what the function
returns if there is no task.

No functional change intended.

Acked-by: Oliver Upton <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: Rework core loop of kvm_vcpu_on_spin() to use a single for-loop

Rework kvm_vcpu_on_spin() to use a single for-loop instead of making "two"
passes over all vCPUs. Given N=kvm->last_boosted_vcpu, the logic is to
iterate from vCPU[N+1]..vcpu[N-1], i.e. using two loops is just a kludgy
way of handling the wrap from the last vCPU to vCPU0 when a boostable vCPU
isn't found in vcpu[N+1]..vcpu[MAX].

Open code the xa_load() instead of using kvm_get_vcpu() to avoid reading
online_vcpus in every loop, as well as the accompanying smp_rmb(), i.e.
make it a custom kvm_for_each_vcpu(), for all intents and purposes.

Oppurtunistically clean up the comment explaining the logic.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Use ARRAY_SIZE for array length

Use of macro ARRAY_SIZE to calculate array size minimizes
the redundant code and improves code reusability.

./tools/testing/selftests/kvm/x86_64/debug_regs.c:169:32-33: WARNING: Use ARRAY_SIZE.

Reported-by: Abaci Robot <[email protected]>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=10847
Signed-off-by: Jiapeng Chong <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Remove unused macro in the hardware disable test

The macro GUEST_CODE_PIO_PORT is never referenced in the code,
just remove it.

Signed-off-by: Ba Jing <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

RISC-V: KVM: Use NACL HFENCEs for KVM request based HFENCEs

When running under some other hypervisor, use SBI NACL based HFENCEs
for TLB shoot-down via KVM requests. This makes HFENCEs faster whenever
SBI nested acceleration is available.

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anup Patel <[email protected]>

RISC-V: KVM: Save trap CSRs in kvm_riscv_vcpu_enter_exit()

Save trap CSRs in the kvm_riscv_vcpu_enter_exit() function instead of
the kvm_arch_vcpu_ioctl_run() function so that HTVAL and HTINST CSRs
are accessed in more optimized manner while running under some other
hypervisor.

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anup Patel <[email protected]>

RISC-V: KVM: Use SBI sync SRET call when available

Implement an optimized KVM world-switch using SBI sync SRET call
when SBI nested acceleration extension is available. This improves
KVM world-switch when KVM RISC-V is running as a Guest under some
other hypervisor.

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anup Patel <[email protected]>

RISC-V: KVM: Use nacl_csr_xyz() for accessing AIA CSRs

When running under some other hypervisor, prefer nacl_csr_xyz()
for accessing AIA CSRs in the run-loop. This makes CSR access
faster whenever SBI nested acceleration is available.

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anup Patel <[email protected]>

RISC-V: KVM: Use nacl_csr_xyz() for accessing H-extension CSRs

When running under some other hypervisor, prefer nacl_csr_xyz()
for accessing H-extension CSRs in the run-loop. This makes CSR
access faster whenever SBI nested acceleration is available.

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anup Patel <[email protected]>

RISC-V: KVM: Add common nested acceleration support

Add a common nested acceleration support which will be shared by
all parts of KVM RISC-V. This nested acceleration support detects
and enables SBI NACL extension usage based on static keys which
ensures minimum impact on the non-nested scenario.

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anup Patel <[email protected]>

RISC-V: Add defines for the SBI nested acceleration extension

Add defines for the new SBI nested acceleration extension which was
ratified as part of the SBI v2.0 specification.

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anup Patel <[email protected]>

RISC-V: KVM: Don't setup SGEI for zero guest external interrupts

No need to setup SGEI local interrupt when there are zero guest
external interrupts (i.e. zero HW IMSIC guest files).

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anup Patel <[email protected]>

RISC-V: KVM: Replace aia_set_hvictl() with aia_hvictl_value()

The aia_set_hvictl() internally writes the HVICTL CSR which makes
it difficult to optimize the CSR write using SBI NACL extension for
kvm_riscv_vcpu_aia_update_hvip() function so replace aia_set_hvictl()
with new aia_hvictl_value() which only computes the HVICTL value.

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anup Patel <[email protected]>

RISC-V: KVM: Break down the __kvm_riscv_switch_to() into macros

Break down the __kvm_riscv_switch_to() function into macros so that
these macros can be later re-used by SBI NACL extension based low-level
switch function.

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anup Patel <[email protected]>

RISC-V: KVM: Save/restore SCOUNTEREN in C source

The SCOUNTEREN CSR need not be saved/restored in the low-level
__kvm_riscv_switch_to() function hence move the SCOUNTEREN CSR
save/restore to the kvm_riscv_vcpu_swap_in_guest_state() and
kvm_riscv_vcpu_swap_in_host_state() functions in C sources.

Also, re-arrange the CSR save/restore and related GPR usage in
the low-level __kvm_riscv_switch_to() low-level function.

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anup Patel <[email protected]>

RISC-V: KVM: Save/restore HSTATUS in C source

We will be optimizing HSTATUS CSR access via shared memory setup
using the SBI nested acceleration extension. To facilitate this,
we first move HSTATUS save/restore in kvm_riscv_vcpu_enter_exit().

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anup Patel <[email protected]>

RISC-V: KVM: Order the object files alphabetically

Order the object files alphabetically in the Makefile so that
it is very predictable inserting new object files in the future.

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anup Patel <[email protected]>

riscv: KVM: add basic support for host vs guest profiling

For the information collected on the host side, we need to
identify which data originates from the guest and record
these events separately, this can be achieved by having
KVM register perf callbacks.

Signed-off-by: Quan Zhou <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Link: https://lore.kernel.org/r/00342d535311eb0629b9ba4f1e457a48e2abee33.1728957131.git.zhouquan@iscas.ac.cn
Signed-off-by: Anup Patel <[email protected]>

riscv: perf: add guest vs host distinction

Introduce basic guest support in perf, enabling it to distinguish
between PMU interrupts in the host or guest, and collect
fundamental information.

Signed-off-by: Quan Zhou <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Link: https://lore.kernel.org/r/a67d527dc1b11493fe11f7f53584772fdd983744.1728957131.git.zhouquan@iscas.ac.cn
Signed-off-by: Anup Patel <[email protected]>

Linux 6.12-rc5

Merge tag 'x86_urgent_for_v6.12_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

- Prevent a certain range of pages which get marked as hypervisor-only,
   to get allocated to a CoCo (SNP) guest which cannot use them and thus
   fail booting

- Fix the microcode loader on AMD to pay attention to the stepping of a
   patch and to handle the case where a BIOS config option splits the
   machine into logical NUMA nodes per L3 cache slice

- Disable LAM from being built by default due to security concerns

* tag 'x86_urgent_for_v6.12_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev: Ensure that RMP table fixups are reserved
  x86/microcode/AMD: Split load_microcode_amd()
  x86/microcode/AMD: Pay attention to the stepping dynamically
  x86/lam: Disable ADDRESS_MASKING in most cases

Merge tag 'ftrace-v6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace fixes from Steven Rostedt:

- Fix missing mutex unlock in error path of register_ftrace_graph()

   A previous fix added a return on an error path and forgot to unlock
   the mutex. Instead of dealing with error paths, use guard(mutex) as
   the mutex is just released at the exit of the function anyway. Other
   functions in this file should be updated with this, but that's a
   cleanup and not a fix.

- Change cpuhp setup name to be consistent with other cpuhp states

   The same fix that the above patch fixes added a cpuhp_setup_state()
   call with the name of "fgraph_idle_init". I was informed that it
   should instead be something like: "fgraph:online". Update that too.

* tag 'ftrace-v6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  fgraph: Change the name of cpuhp state to "fgraph:online"
  fgraph: Fix missing unlock in register_ftrace_graph()

Merge tag 'platform-drivers-x86-v6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver fixes from Hans de Goede:

- Asus thermal profile fix, fixing performance issues on Lunar Lake

- Intel PMC: one revert for a lockdep issue and one bugfix

- Dell WMI: Ignore some WMI events on suspend/resume to silence warnings

* tag 'platform-drivers-x86-v6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  platform/x86: asus-wmi: Fix thermal profile initialization
  platform/x86: dell-wmi: Ignore suspend notifications
  platform/x86/intel/pmc: Fix pmc_core_iounmap to call iounmap for valid addresses
  platform/x86:intel/pmc: Revert "Enable the ACPI PM Timer to be turned off when suspended"

Merge tag 'firewire-fixes-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394

Pull firewire fix from Takashi Sakamoto:
"A single commit to resolve a regression existing in v6.11 or later.

  The change in 1394 OHCI driver in v6.11 kernel could cause general
  protection faults when rediscovering nodes in IEEE 1394 bus while
  holding a spin lock. Consequently, watchdog checks can report a hard
  lockup.

  Currently, this issue is observed primarily during the system resume
  phase when using an extra node with three ports or more is used.
  However, it could potentially occur in the other cases as well"

* tag 'firewire-fixes-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
  firewire: core: fix invalid port index for parent device

Merge tag 'block-6.12-20241026' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

- Pull request for MD via Song fixing a few issues

- Fix a wrong check in blk_rq_map_user_bvec(), causing IO errors on
   passthrough IO (Xinyu)

* tag 'block-6.12-20241026' of git://git.kernel.dk/linux:
  block: fix sanity checks in blk_rq_map_user_bvec
  md/raid10: fix null ptr dereference in raid10_size()
  md: ensure child flush IO does not affect origin bio->bi_status

Merge tag 'xfs-6.12-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Carlos Maiolino:

- Fix recovery of allocator ops after a growfs

- Do not fail repairs on metadata files with no attr fork

* tag 'xfs-6.12-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: update the pag for the last AG at recovery time
  xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_perag
  xfs: error out when a superblock buffer update reduces the agcount
  xfs: update the file system geometry after recoverying superblock buffers
  xfs: merge the perag freeing helpers
  xfs: pass the exact range to initialize to xfs_initialize_perag
  xfs: don't fail repairs on metadata files with no attr fork

firewire: core: fix invalid port index for parent device

In a commit 24b7f8e5cd65 ("firewire: core: use helper functions for self
ID sequence"), the enumeration over self ID sequence was refactored with
some helper functions with KUnit tests. These helper functions are
guaranteed to work expectedly by the KUnit tests, however their application
includes a mistake to assign invalid value to the index of port connected
to parent device.

This bug affects the case that any extra node devices which has three or
more ports are connected to 1394 OHCI controller. In the case, the path
to update the tree cache could hits WARN_ON(), and gets general protection
fault due to the access to invalid address computed by the invalid value.

This commit fixes the bug to assign correct port index.

Cc: [email protected]
Reported-by: Edmund Raile <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]/
Fixes: 24b7f8e5cd65 ("firewire: core: use helper functions for self ID sequence")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Sakamoto <[email protected]>

platform/x86: asus-wmi: Fix thermal profile initialization

When support for vivobook fan profiles was added, the initial
call to throttle_thermal_policy_set_default() was removed, which
however is necessary for full initialization.

Fix this by calling throttle_thermal_policy_set_default() again
when setting up the platform profile.

Fixes: bcbfcebda2cb ("platform/x86: asus-wmi: add support for vivobook fan profiles")
Reported-by: Michael Larabel <[email protected]>
Closes: https://www.phoronix.com/review/lunar-lake-xe2/5
Signed-off-by: Armin Wolf <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Hans de Goede <[email protected]>
Signed-off-by: Hans de Goede <[email protected]>

Merge tag '9p-for-6.12-rc5' of https://github.com/martinetd/linux

Pull more 9p reverts from Dominique Martinet:
"Revert patches causing inode collision problems.

  The code simplification introduced significant regressions on servers
  that do not remap inode numbers when exporting multiple underlying
  filesystems with colliding inodes. See the top-most revert (commit
  be2ca3825372) for details.

  This problem had been ignored for too long and the reverts will also
  head to stable (6.9+).

  I'm confident this set of patches gets us back to previous behaviour
  (another related patch had already been reverted back in April and
  we're almost back to square 1, and the rest didn't touch inode
  lifecycle)"

* tag '9p-for-6.12-rc5' of https://github.com/martinetd/linux:
  Revert "fs/9p: simplify iget to remove unnecessary paths"
  Revert "fs/9p: fix uaf in in v9fs_stat2inode_dotl"
  Revert "fs/9p: remove redundant pointer v9ses"
  Revert " fs/9p: mitigate inode collisions"

Merge tag 'v6.12-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

- Fix init module error caseb

- Fix memory allocation error path (for passwords) in mount

* tag 'v6.12-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix warning when destroy 'cifs_io_request_pool'
smb: client: Handle kstrdup failures for passwords

Merge tag 'fuse-fixes-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse fixes from Miklos Szeredi:

- Fix cached size after passthrough writes

   This fix needed a trivial change in the backing-file API, which
   resulted in some non-fuse files being touched.

- Revert a commit meant as a cleanup but which triggered a WARNING

- Remove a stray debug line left-over

* tag 'fuse-fixes-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: remove stray debug line
  Revert "fuse: move initialization of fuse_file to fuse_writepages() instead of in callback"
  fuse: update inode size after extending passthrough write
  fs: pass offset and result to backing_file end_write() callback

Merge tag 'nfsd-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

- Fix a couple of use-after-free bugs

* tag 'nfsd-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: cancel nfsd_shrinker_work using sync mode in nfs4_state_shutdown_net
nfsd: fix race between laundromat and free_stateid

Merge tag 'acpi-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
"These fix an ACPI PRM (Platform Runtime Mechanism) issue and add two
  new DMI quirks, one for an ACPI IRQ override and one for lid switch
  detection:

   - Make acpi_parse_prmt() look for EFI_MEMORY_RUNTIME memory regions
     only to comply with the UEFI specification and make PRM use
     efi_guid_t instead of guid_t to avoid a compiler warning triggered
     by that change (Koba Ko, Dan Carpenter)

   - Add an ACPI IRQ override quirk for LG 16T90SP (Christian Heusel)

   - Add a lid switch detection quirk for Samsung Galaxy Book2 (Shubham
     Panwar)"

* tag 'acpi-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: PRM: Clean up guid type in struct prm_handler_info
  ACPI: button: Add DMI quirk for Samsung Galaxy Book2 to fix initial lid detection issue
  ACPI: resource: Add LG 16T90SP to irq1_level_low_skip_override[]
  ACPI: PRM: Find EFI_MEMORY_RUNTIME block for PRM handler and context

Merge tag 'pm-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
"Update cpufreq documentation to match the code after recent changes
  (Christian Loehle), fix a units conversion issue in the CPPC cpufreq
  driver (liwei), and fix an error check in the dtpm_devfreq power
  capping driver (Yuan Can)"

* tag 'pm-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: CPPC: fix perf_to_khz/khz_to_perf conversion exception
  powercap: dtpm_devfreq: Fix error check against dev_pm_qos_add_request()
  cpufreq: docs: Reflect latency changes in docs

Merge tag 'pci-v6.12-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull pci fixes from Bjorn Helgaas:

- Hold the rescan lock while adding devices to avoid race with
   concurrent pwrctl rescan that can lead to a crash (Bartosz
   Golaszewski)

- Avoid binding pwrctl driver to QCom WCN wifi if the DT lacks the
   necessary PMU regulator descriptions (Bartosz Golaszewski)

* tag 'pci-v6.12-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
  PCI/pwrctl: Abandon QCom WCN probe on pre-pwrseq device-trees
  PCI: Hold rescan lock while adding devices during host probe

Merge tag 'fbdev-for-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev

Pull fbdev fixes from Helge Deller:

- Fix some build warnings and failures with CONFIG_FB_IOMEM_FOPS and
   CONFIG_FB_DEVICE

- Remove the da8xx fbdev driver

- Constify struct sbus_mmap_map and fix indentation warning

* tag 'fbdev-for-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
  fbdev: wm8505fb: select CONFIG_FB_IOMEM_FOPS
  fbdev: da8xx: remove the driver
  fbdev: Constify struct sbus_mmap_map
  fbdev: nvidiafb: fix inconsistent indentation warning
  fbdev: sstfb: Make CONFIG_FB_DEVICE optional

Merge tag 'gpio-fixes-for-v6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio fix from Bartosz Golaszewski:
"Update MAINTAINERS with a keyword pattern for legacy GPIO API

  The goal is to alert us to anyone trying to use the deprecated, legacy
  API (this happens almost every release)"

* tag 'gpio-fixes-for-v6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  MAINTAINERS: add a keyword entry for the GPIO subsystem

Merge tag 'ata-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux

Pull ata fix from Niklas Cassel:

- Fix the handling of ATA commands that timeout (command that did not
   receive a completion interrupt within the configured timeout time).

   Commands that timeout, while also having either the FAILFAST flag
   set, or the command being a passthrough command, should never be
   retried. Restore this behavior (as it was before v6.12-rc1).

* tag 'ata-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
  ata: libata: Set DID_TIME_OUT for commands that actually timed out

Merge branch 'kvm-no-struct-page' into HEAD

TL;DR: Eliminate KVM's long-standing (and heinous) behavior of essentially
guessing which pfns are refcounted pages (see kvm_pfn_to_refcounted_page()).

Getting there requires "fixing" arch code that isn't obviously broken.
Specifically, to get rid of kvm_pfn_to_refcounted_page(), KVM needs to
stop marking pages/folios dirty/accessed based solely on the pfn that's
stored in KVM's stage-2 page tables.

Instead of tracking which SPTEs correspond to refcounted pages, simply
remove all of the code that operates on "struct page" based ona the pfn
in stage-2 PTEs. This is the back ~40-50% of the series.

For x86 in particular, which sets accessed/dirty status when that info
would be "lost", e.g. when SPTEs are zapped or KVM clears the dirty flag
in a SPTE, foregoing the updates provides very measurable performance
improvements for related operations. E.g. when clearing dirty bits as
part of dirty logging, and zapping SPTEs to reconstitue huge pages when
disabling dirty logging.

The front ~40% of the series is cleanups and prep work, and most of it is
x86 focused (purely because x86 added the most special cases, *sigh*).
E.g. several of the inputs to hva_to_pfn() (and it's myriad wrappers),
can be removed by cleaning up and deduplicating x86 code.

Signed-off-by: Paolo Bonzini <[email protected]>

Merge tag 'sound-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"The majority of changes here are about ASoC.

  There are two core changes in ASoC (the bump of minimal topology ABI
  version and the fix for references of components in DAPM code), and
  others are mostly various device-specific fixes for SoundWire, AMD,
  Intel, SOF, Qualcomm and FSL, in addition to a few usual HD-audio
  quirks and fixes"

* tag 'sound-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (33 commits)
  ALSA: hda/realtek: Update default depop procedure
  ASoC: qcom: sc7280: Fix missing Soundwire runtime stream alloc
  ASoC: fsl_micfil: Add sample rate constraint
  ASoC: rt722-sdca: increase clk_stop_timeout to fix clock stop issue
  ALSA: hda/tas2781: select CRC32 instead of CRC32_SARWATE
  ALSA: hda/realtek: Add subwoofer quirk for Acer Predator G9-593
  ALSA: firewire-lib: Avoid division by zero in apply_constraint_to_size()
  ASoC: fsl_micfil: Add a flag to distinguish with different volume control types
  ASoC: codecs: lpass-rx-macro: fix RXn(rx,n) macro for DSM_CTL and SEC7 regs
  ASoC: Change my e-mail to gmail
  ASoC: Intel: soc-acpi: lnl: Add match entry for TM2 laptops
  ASoC: amd: yc: Fix non-functional mic on ASUS E1404FA
  ASoC: SOF: Intel: hda: Always clean up link DMA during stop
  soundwire: intel_ace2x: Send PDI stream number during prepare
  ASoC: SOF: Intel: hda: Handle prepare without close for non-HDA DAI's
  ASoC: SOF: ipc4-topology: Do not set ALH node_id for aggregated DAIs
  MAINTAINERS: Update maintainer list for MICROCHIP ASOC, SSC and MCP16502 drivers
  ASoC: qcom: Select missing common Soundwire module code on SDM845
  ASoC: fsl_esai: change dev_warn to dev_dbg in irq handler
  ASoC: rsnd: Fix probe failure on HiHope boards due to endpoint parsing
  ...

Merge tag 'drm-fixes-2024-10-25' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
"Weekly drm fixes, mostly amdgpu and xe, with minor bridge and an i915
  Kconfig fix. Nothing too scary and it seems to be pretty quiet.

  amdgpu:
   - ACPI method handling fixes
   - SMU 14.x fixes
   - Display idle optimization fix
   - DP link layer compliance fix
   - SDMA 7.x fix
   - PSR-SU fix
   - SWSMU fix

  i915:
   - Fix DRM_I915_GVT_KVMGT dependencies in Kconfig

  xe:
   - Increase invalidation timeout to avoid errors in some hosts
   - Flush worker on timeout
   - Better handling for force wake failure
   - Improve argument check on user fence creation
   - Don't restart parallel queues multiple times on GT reset

  bridge:
   - aux: Fix assignment of OF node
   - tc358767: Add missing of_node_put() in error path"

* tag 'drm-fixes-2024-10-25' of https://gitlab.freedesktop.org/drm/kernel:
  drm/xe: Don't restart parallel queues multiple times on GT reset
  drm/xe/ufence: Prefetch ufence addr to catch bogus address
  drm/xe: Handle unreliable MMIO reads during forcewake
  drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout
  drm/xe: Enlarge the invalidation timeout from 150 to 500
  drm/amdgpu: handle default profile on on devices without fullscreen 3D
  drm/amd/display: Disable PSR-SU on Parade 08-01 TCON too
  drm/amdgpu: fix random data corruption for sdma 7
  drm/amd/display: temp w/a for DP Link Layer compliance
  drm/amd/display: temp w/a for dGPU to enter idle optimizations
  drm/amd/pm: update deep sleep status on smu v14.0.2/3
  drm/amd/pm: update overdrive function on smu v14.0.2/3
  drm/amd/pm: update the driver-fw interface file for smu v14.0.2/3
  drm/amd: Guard against bad data for ATIF ACPI method
  drm/bridge: tc358767: fix missing of_node_put() in for_each_endpoint_of_node()
  drm/bridge: Fix assignment of the of_node of the parent to aux bridge
  i915: fix DRM_I915_GVT_KVMGT dependencies

KVM: Don't grab reference on VM_MIXEDMAP pfns that have a "struct page"

Now that KVM no longer relies on an ugly heuristic to find its struct page
references, i.e. now that KVM can't get false positives on VM_MIXEDMAP
pfns, remove KVM's hack to elevate the refcount for pfns that happen to
have a valid struct page. In addition to removing a long-standing wart
in KVM, this allows KVM to map non-refcounted struct page memory into the
guest, e.g. for exposing GPU TTM buffers to KVM guests.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Drop APIs that manipulate "struct page" via pfns

Remove all kvm_{release,set}_pfn_*() APIs now that all users are gone.

No functional change intended.

Reviewed-by: Alex Bennée <[email protected]>
Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: arm64: Don't mark "struct page" accessed when making SPTE young

Don't mark pages/folios as accessed in the primary MMU when making a SPTE
young in KVM's secondary MMU, as doing so relies on
kvm_pfn_to_refcounted_page(), and generally speaking is unnecessary and
wasteful. KVM participates in page aging via mmu_notifiers, so there's no
need to push "accessed" updates to the primary MMU.

Dropping use of kvm_set_pfn_accessed() also paves the way for removing
kvm_pfn_to_refcounted_page() and all its users.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEs

Don't mark pages/folios as accessed in the primary MMU when zapping SPTEs,
as doing so relies on kvm_pfn_to_refcounted_page(), and generally speaking
is unnecessary and wasteful. KVM participates in page aging via
mmu_notifiers, so there's no need to push "accessed" updates to the
primary MMU.

And if KVM zaps a SPTe in response to an mmu_notifier, marking it accessed
_after_ the primary MMU has decided to zap the page is likely to go
unnoticed, i.e. odds are good that, if the page is being zapped for
reclaim, the page will be swapped out regardless of whether or not KVM
marks the page accessed.

Dropping x86's use of kvm_set_pfn_accessed() also paves the way for
removing kvm_pfn_to_refcounted_page() and all its users.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>