Git Repo - linux.git/log

KVM: VMX: Rename "vmx/evmcs.{ch}" to "vmx/hyperv.{ch}"

To conform with SVM, rename VMX specific Hyper-V files from "evmcs.{ch}"
to "hyperv.{ch}". While Enlightened VMCS is a lion's share of these
files, some stuff (e.g. enlightened MSR bitmap, the upcoming Hyper-V
L2 TLB flush, ...) goes beyond that.

Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20221101145426 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Rename 'enable_direct_tlbflush' to 'enable_l2_tlb_flush'

To make terminology between Hyper-V-on-KVM and KVM-on-Hyper-V consistent,
rename 'enable_direct_tlbflush' to 'enable_l2_tlb_flush'. The change
eliminates the use of confusing 'direct' and adds the missing underscore.

No functional change.

Reviewed-by: Maxim Levitsky <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20221101145426 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

x86/hyperv: KVM: Rename "hv_enlightenments" to "hv_vmcb_enlightenments"

Now that KVM isn't littered with "struct hv_enlightenments" casts, rename
the struct to "hv_vmcb_enlightenments" to highlight the fact that the
struct is specifically for SVM's VMCB.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20221101145426 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Add a proper field for Hyper-V VMCB enlightenments

Add a union to provide hv_enlightenments side-by-side with the sw_reserved
bytes that Hyper-V's enlightenments overlay. Casting sw_reserved
everywhere is messy, confusing, and unnecessarily unsafe.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20221101145426 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: selftests: Move "struct hv_enlightenments" to x86_64/svm.h

Move Hyper-V's VMCB "struct hv_enlightenments" to the svm.h header so
that the struct can be referenced in "struct vmcb_control_area".
Alternatively, a dedicated header for SVM+Hyper-V could be added, a la
x86_64/evmcs.h, but it doesn't appear that Hyper-V will end up needing
a wholesale replacement for the VMCB.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20221101145426 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

x86/hyperv: Move VMCB enlightenment definitions to hyperv-tlfs.h

Move Hyper-V's VMCB enlightenment definitions to the TLFS header; the
definitions come directly from the TLFS[*], not from KVM.

No functional change intended.

[*] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/datatypes/hv_svm_enlightened_vmcb_fields

[vitaly: rename VMCB_HV_ -> HV_VMCB_ to match the rest of
hyperv-tlfs.h, keep svm/hyperv.h]

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20221101145426 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: avoid memslot check in NX hugepage recovery if it cannot succeed

Since gfn_to_memslot() is relatively expensive, it helps to
skip it if it the memslot cannot possibly have dirty logging
enabled.  In order to do this, add to struct kvm a counter
of the number of log-page memslots.  While the correct value
can only be read with slots_lock taken, the NX recovery thread
is content with using an approximate value.  Therefore, the
counter is an atomic_t.

Based on https://lore.kernel.org/kvm/20221027200316.2221027 [email protected]/
by David Matlack.

Signed-off-by: Paolo Bonzini <[email protected]>

Merge branch 'kvm-svm-harden' into HEAD

This fixes three issues in nested SVM:

1) in the shutdown_interception() vmexit handler we call kvm_vcpu_reset().
However, if running nested and L1 doesn't intercept shutdown, the function
resets vcpu->arch.hflags without properly leaving the nested state.
This leaves the vCPU in inconsistent state and later triggers a kernel
panic in SVM code. The same bug can likely be triggered by sending INIT
via local apic to a vCPU which runs a nested guest.

On VMX we are lucky that the issue can't happen because VMX always
intercepts triple faults, thus triple fault in L2 will always be
redirected to L1. Plus, handle_triple_fault() doesn't reset the vCPU.
INIT IPI can't happen on VMX either because INIT events are masked while
in VMX mode.

Secondarily, KVM doesn't honour SHUTDOWN intercept bit of L1 on SVM.
A normal hypervisor should always intercept SHUTDOWN, a unit test on
the other hand might want to not do so.

Finally, the guest can trigger a kernel non rate limited printk on SVM
from the guest, which is fixed as well.

Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: remove exit_int_info warning in svm_handle_exit

It is valid to receive external interrupt and have broken IDT entry,
which will lead to #GP with exit_int_into that will contain the index of
the IDT entry (e.g any value).

Other exceptions can happen as well, like #NP or #SS
(if stack switch fails).

Thus this warning can be user triggred and has very little value.

Cc: [email protected]
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20221103141351 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: selftests: add svm part to triple_fault_test

Add a SVM implementation to triple_fault_test to test that
emulated/injected shutdown works.

Since instead of the VMX, the SVM allows the hypervisor to avoid
intercepting shutdown in guest, don't intercept shutdown to test that
KVM suports this correctly.

Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20221103141351 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: allow L1 to not intercept triple fault

This is SVM correctness fix - although a sane L1 would intercept
SHUTDOWN event, it doesn't have to, so we have to honour this.

Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20221103141351 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: selftests: add svm nested shutdown test

Add test that tests that on SVM if L1 doesn't intercept SHUTDOWN,
then L2 crashes L1 and doesn't crash L2

Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20221103141351 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: selftests: move idt_entry to header

struct idt_entry will be used for a test which will break IDT on purpose.

Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20221103141351 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: forcibly leave nested mode on vCPU reset

While not obivous, kvm_vcpu_reset() leaves the nested mode by clearing
'vcpu->arch.hflags' but it does so without all the required housekeeping.

On SVM, it is possible to have a vCPU reset while in guest mode because
unlike VMX, on SVM, INIT's are not latched in SVM non root mode and in
addition to that L1 doesn't have to intercept triple fault, which should
also trigger L1's reset if happens in L2 while L1 didn't intercept it.

If one of the above conditions happen, KVM will continue to use vmcb02
while not having in the guest mode.

Later the IA32_EFER will be cleared which will lead to freeing of the
nested guest state which will (correctly) free the vmcb02, but since
KVM still uses it (incorrectly) this will lead to a use after free
and kernel crash.

This issue is assigned CVE-2022-3344

Cc: [email protected]
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20221103141351 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: add kvm_leave_nested

add kvm_leave_nested which wraps a call to nested_ops->leave_nested
into a function.

Cc: [email protected]
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20221103141351 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: nSVM: harden svm_free_nested against freeing vmcb02 while still in use

Make sure that KVM uses vmcb01 before freeing nested state, and warn if
that is not the case.

This is a minimal fix for CVE-2022-3344 making the kernel print a warning
instead of a kernel panic.

Cc: [email protected]
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20221103141351 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: nSVM: leave nested mode on vCPU free

If the VM was terminated while nested, we free the nested state
while the vCPU still is in nested mode.

Soon a warning will be added for this condition.

Cc: [email protected]
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20221103141351 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages

Do not recover (i.e. zap) an NX Huge Page that is being dirty tracked,
as it will just be faulted back in at the same 4KiB granularity when
accessed by a vCPU. This may need to be changed if KVM ever supports
2MiB (or larger) dirty tracking granularity, or faulting huge pages
during dirty tracking for reads/executes. However for now, these zaps
are entirely wasteful.

In order to check if this commit increases the CPU usage of the NX
recovery worker thread I used a modified version of execute_perf_test
[1] that supports splitting guest memory into multiple slots and reports
/proc/pid/schedstat:se.sum_exec_runtime for the NX recovery worker just
before tearing down the VM. The goal was to force a large number of NX
Huge Page recoveries and see if the recovery worker used any more CPU.

Test Setup:

  echo 1000 > /sys/module/kvm/parameters/nx_huge_pages_recovery_period_ms
  echo 10 > /sys/module/kvm/parameters/nx_huge_pages_recovery_ratio

Test Command:

  ./execute_perf_test -v64 -s anonymous_hugetlb_1gb -x 16 -o

        | kvm-nx-lpage-re:se.sum_exec_runtime      |
        | ---------------------------------------- |
Run     | Before             | After               |
------- | ------------------ | ------------------- |
1       | 730.084105         | 724.375314          |
2       | 728.751339         | 740.581988          |
3       | 736.264720         | 757.078163          |

Comparing the median results, this commit results in about a 1% increase
CPU usage of the NX recovery worker when testing a VM with 16 slots.
However, the effect is negligible with the default halving time of NX
pages, which is 1 hour rather than 10 seconds given by period_ms = 1000,
ratio = 10.

[1] https://lore.kernel.org/kvm/20221019234050.3919566 [email protected]/

Signed-off-by: David Matlack <[email protected]>
Message-Id: <20221103204421.1146958 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry

A removed SPTE is never present, hence the "if" in kvm_tdp_mmu_map
only fails in the exact same conditions that the earlier loop
tested in order to issue a "break". So, instead of checking twice the
condition (upper level SPTEs could not be created or was frozen), just
exit the loop with a goto---the usual poor-man C replacement for RAII
early returns.

While at it, do not use the "ret" variable for return values of
functions that do not return a RET_PF_* enum. This is clearer
and also makes it possible to initialize ret to RET_PF_RETRY.

Suggested-by: Robert Hoo <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault

Now that the TDP MMU has a mechanism to split huge pages, use it in the
fault path when a huge page needs to be replaced with a mapping at a
lower level.

This change reduces the negative performance impact of NX HugePages.
Prior to this change if a vCPU executed from a huge page and NX
HugePages was enabled, the vCPU would take a fault, zap the huge page,
and mapping the faulting address at 4KiB with execute permissions
enabled. The rest of the memory would be left *unmapped* and have to be
faulted back in by the guest upon access (read, write, or execute). If
guest is backed by 1GiB, a single execute instruction can zap an entire
GiB of its physical address space.

For example, it can take a VM longer to execute from its memory than to
populate that memory in the first place:

$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96

Populating memory             : 2.748378795s
Executing from memory         : 2.899670885s

With this change, such faults split the huge page instead of zapping it,
which avoids the non-present faults on the rest of the huge page:

$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96

Populating memory             : 2.729544474s
Executing from memory         : 0.111965688s   <---

This change also reduces the performance impact of dirty logging when
eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
be desirable for read-heavy workloads, as it avoids allocating memory to
split huge pages that are never written and avoids increasing the TLB
miss cost on reads of those pages.

             | Config: ept=Y, tdp_mmu=Y, 5% writes           |
             | Iteration 1 dirty memory time                 |
             | --------------------------------------------- |
vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
------------ | -------------- | ------------- | ------------ |
2            | 0.332305091s   | 0.019615027s  | 0.006108211s |
4            | 0.353096020s   | 0.019452131s  | 0.006214670s |
8            | 0.453938562s   | 0.019748246s  | 0.006610997s |
16           | 0.719095024s   | 0.019972171s  | 0.007757889s |
32           | 1.698727124s   | 0.021361615s  | 0.012274432s |
64           | 2.630673582s   | 0.031122014s  | 0.016994683s |
96           | 3.016535213s   | 0.062608739s  | 0.044760838s |

Eager page splitting remains beneficial for write-heavy workloads, but
the gap is now reduced.

             | Config: ept=Y, tdp_mmu=Y, 100% writes         |
             | Iteration 1 dirty memory time                 |
             | --------------------------------------------- |
vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
------------ | -------------- | ------------- | ------------ |
2            | 0.317710329s   | 0.296204596s  | 0.058689782s |
4            | 0.337102375s   | 0.299841017s  | 0.060343076s |
8            | 0.386025681s   | 0.297274460s  | 0.060399702s |
16           | 0.791462524s   | 0.298942578s  | 0.062508699s |
32           | 1.719646014s   | 0.313101996s  | 0.075984855s |
64           | 2.527973150s   | 0.455779206s  | 0.079789363s |
96           | 2.681123208s   | 0.673778787s  | 0.165386739s |

Further study is needed to determine if the remaining gap is acceptable
for customer workloads or if eager_page_split=N still requires a-priori
knowledge of the VM workload, especially when considering these costs
extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.

Signed-off-by: David Matlack <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
Message-Id: <20221109185905 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Merge tag 'kvm-selftests-6.2-1' of https://github.com/kvm-x86/linux into HEAD

KVM selftests updates for 6.2

perf_util:
- Add support for pinning vCPUs in dirty_log_perf_test.
- Add a lightweight psuedo RNG for guest use, and use it to randomize
   the access pattern and write vs. read percentage in the so called
   "perf util" tests.
- Rename the so called "perf_util" framework to "memstress".

ucall:
- Add a common pool-based ucall implementation (code dedup and pre-work
   for running SEV (and beyond) guests in selftests.
- Fix an issue in ARM's single-step test when using the new pool-based
   implementation; LDREX/STREX don't play nice with single-step exceptions.

init:
- Provide a common constructor and arch hook, which will eventually be
   used by x86 to automatically select the right hypercall (AMD vs. Intel).

x86:
- Clean up x86's page tabe management.
- Clean up and enhance the "smaller maxphyaddr" test, and add a related
   test to cover generic emulation failure.
- Clean up the nEPT support checks.
- Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.

KVM: selftests: Check for KVM nEPT support using "feature" MSRs

When checking for nEPT support in KVM, use kvm_get_feature_msr() instead
of vcpu_get_msr() to retrieve KVM's default TRUE_PROCBASED_CTLS and
PROCBASED_CTLS2 MSR values, i.e. don't require a VM+vCPU to query nEPT
support.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: rebase on merged code, write changelog]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Assert in prepare_eptp() that nEPT is supported

Now that a VM isn't needed to check for nEPT support, assert that KVM
supports nEPT in prepare_eptp() instead of skipping the test, and push
the TEST_REQUIRE() check out to individual tests. The require+assert are
somewhat redundant and will incur some amount of ongoing maintenance
burden, but placing the "require" logic in the test makes it easier to
find/understand a test's requirements and in this case, provides a very
strong hint that the test cares about nEPT.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: rebase on merged code, write changelog]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Drop helpers for getting specific KVM supported CPUID entry

Drop kvm_get_supported_cpuid_entry() and its inner helper now that all
known usage can use X86_FEATURE_*, X86_PROPERTY_*, X86_PMU_FEATURE_*, or
the dedicated Family/Model helpers. Providing "raw" access to CPUID
leafs is undesirable as it encourages open coding CPUID checks, which is
often error prone and not self-documenting.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Add and use KVM helpers for x86 Family and Model

Add KVM variants of the x86 Family and Model helpers, and use them in the
PMU event filter test. Open code the retrieval of KVM's supported CPUID
entry 0x1.0 in anticipation of dropping kvm_get_supported_cpuid_entry().

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Add dedicated helpers for getting x86 Family and Model

Add dedicated helpers for getting x86's Family and Model, which are the
last holdouts that "need" raw access to CPUID information. FMS info is
a mess and requires not only splicing together multiple values, but
requires doing so conditional in the Family case.

Provide wrappers to reduce the odds of copy+paste errors, but mostly to
allow for the eventual removal of kvm_get_supported_cpuid_entry().

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Add PMU feature framework, use in PMU event filter test

Add an X86_PMU_FEATURE_* framework to simplify probing architectural
events on Intel PMUs, which require checking the length of a bit vector
and the _absence_ of a "feature" bit. Add helpers for both KVM and
"this CPU", and use the newfangled magic (along with X86_PROPERTY_*)
to clean up pmu_event_filter_test.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Convert vmx_pmu_caps_test to use X86_PROPERTY_*

Add X86_PROPERTY_PMU_VERSION and use it in vmx_pmu_caps_test to replace
open coded versions of the same functionality.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Convert AMX test to use X86_PROPRETY_XXX

Add and use x86 "properties" for the myriad AMX CPUID values that are
validated by the AMX test. Drop most of the test's single-usage
helpers so that the asserts more precisely capture what check failed.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Add kvm_cpu_*() support for X86_PROPERTY_*

Extent X86_PROPERTY_* support to KVM, i.e. add kvm_cpu_property() and
kvm_cpu_has_p(), and use the new helpers in kvm_get_cpu_address_width().

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Refactor kvm_cpuid_has() to prep for X86_PROPERTY_* support

Refactor kvm_cpuid_has() to prepare for extending X86_PROPERTY_* support
to KVM as well as "this CPU".

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Use X86_PROPERTY_MAX_KVM_LEAF in CPUID test

Use X86_PROPERTY_MAX_KVM_LEAF to replace the equivalent open coded check
on KVM's maximum paravirt CPUID leaf.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Add X86_PROPERTY_* framework to retrieve CPUID values

Introduce X86_PROPERTY_* to allow retrieving values/properties from CPUID
leafs, e.g. MAXPHYADDR from CPUID.0x80000008. Use the same core code as
X86_FEATURE_*, the primary difference is that properties are multi-bit
values, whereas features enumerate a single bit.

Add this_cpu_has_p() to allow querying whether or not a property exists
based on the maximum leaf associated with the property, e.g. MAXPHYADDR
doesn't exist if the max leaf for 0x8000_xxxx is less than 0x8000_0008.

Use the new property infrastructure in vm_compute_max_gfn() to prove
that the code works as intended. Future patches will convert additional
selftests code.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Refactor X86_FEATURE_* framework to prep for X86_PROPERTY_*

Refactor the X86_FEATURE_* framework to prepare for extending the core
logic to support "properties". The "feature" framework allows querying a
single CPUID bit to detect the presence of a feature; the "property"
framework will extend the idea to allow querying a value, i.e. to get a
value that is a set of contiguous bits in a CPUID leaf.

Opportunistically add static asserts to ensure features are fully defined
at compile time, and to try and catch mistakes in the definition of
features.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Add X86_FEATURE_PAE and use it calc "fallback" MAXPHYADDR

Add X86_FEATURE_PAE and use it to guesstimate the MAXPHYADDR when the
MAXPHYADDR CPUID entry isn't supported.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Add a test for KVM_CAP_EXIT_ON_EMULATION_FAILURE

Add a selftest to exercise the KVM_CAP_EXIT_ON_EMULATION_FAILURE
capability.

This capability is also exercised through
smaller_maxphyaddr_emulation_test, but that test requires
allow_smaller_maxphyaddr=Y, which is off by default on Intel when ept=Y
and unconditionally disabled on AMD when npt=Y. This new test ensures
that KVM_CAP_EXIT_ON_EMULATION_FAILURE is exercised independent of
allow_smaller_maxphyaddr.

Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Expect #PF(RSVD) when TDP is disabled

Change smaller_maxphyaddr_emulation_test to expect a #PF(RSVD), rather
than an emulation failure, when TDP is disabled. KVM only needs to
emulate instructions to emulate a smaller guest.MAXPHYADDR when TDP is
enabled.

Fixes: 39bbcc3a4e39 ("selftests: kvm: Allows userspace to handle emulation errors.")
Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: massage comment to talk about having to emulate due to MAXPHYADDR]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Provide error code as a KVM_ASM_SAFE() output

Provide the error code on a fault in KVM_ASM_SAFE(), e.g. to allow tests
to assert that #PF generates the correct error code without needing to
manually install a #PF handler.  Use r10 as the scratch register for the
error code, as it's already clobbered by the asm blob (loaded with the
RIP of the to-be-executed instruction).  Deliberately load the output
"error_code" even in the non-faulting path so that error_code is always
initialized with deterministic data (the aforementioned RIP), i.e to
ensure a selftest won't end up with uninitialized consumption regardless
of how KVM_ASM_SAFE() is used.

Don't clear r10 in the non-faulting case and instead load error code with
the RIP (see above).  The error code is valid if and only if an exception
occurs, and '0' isn't necessarily a better "invalid" value, e.g. '0'
could result in false passes for a buggy test.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Avoid JMP in non-faulting path of KVM_ASM_SAFE()

Clear R9 in the non-faulting path of KVM_ASM_SAFE() and fall through to
to a common load of "vector" to effectively load "vector" with '0' to
reduce the code footprint of the asm blob, to reduce the runtime overhead
of the non-faulting path (when "vector" is stored in a register), and so
that additional output constraints that are valid if and only if a fault
occur are loaded even in the non-faulting case.

A future patch will add a 64-bit output for the error code, and if its
output is not explicitly loaded with _something_, the user of the asm
blob can end up technically consuming uninitialized data. Using a
common path to load the output constraints will allow using an existing
scratch register, e.g. r10, to hold the error code in the faulting path,
while also guaranteeing the error code is initialized with deterministic
data in the non-faulting patch (r10 is loaded with the RIP of
to-be-executed instruction).

Consuming the error code when a fault doesn't occur would obviously be a
test bug, but there's no guarantee the compiler will detect uninitialized
consumption. And conversely, it's theoretically possible that the
compiler might throw a false positive on uninitialized data, e.g. if the
compiler can't determine that the non-faulting path won't touch the error
code.

Alternatively, the error code could be explicitly loaded in the
non-faulting path, but loading a 64-bit memory|register output operand
with an explicitl value requires a sign-extended "MOV imm32, r/m64",
which isn't exactly straightforward and has a largish code footprint.
And loading the error code with what is effectively garbage (from a
scratch register) avoids having to choose an arbitrary value for the
non-faulting case.

Opportunistically remove a rogue asterisk in the block comment.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Copy KVM PFERR masks into selftests

Copy KVM's macros for page fault error masks into processor.h so they
can be used in selftests.

Signed-off-by: David Matlack <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Use BIT{,_ULL}() for PFERR masks

Use the preferred BIT() and BIT_ULL() to construct the PFERR masks
rather than open-coding the bit shifting.

No functional change intended.

Signed-off-by: David Matlack <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Move flds instruction emulation failure handling to header

Move the flds instruction emulation failure handling code to a header
so it can be re-used in an upcoming test.

No functional change intended.

Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Delete dead ucall code

Delete a bunch of code related to ucall handling from
smaller_maxphyaddr_emulation_test. The only thing
smaller_maxphyaddr_emulation_test needs to check is that the vCPU exits
with UCALL_DONE after the second vcpu_run().

Signed-off-by: David Matlack <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Explicitly require instructions bytes

Hard-code the flds instruction and assert the exact instruction bytes
are present in run->emulation_failure. The test already requires the
instruction bytes to be present because that's the only way the test
will advance the RIP past the flds and get to GUEST_DONE().

Note that KVM does not necessarily return exactly 2 bytes in
run->emulation_failure since it may not know the exact instruction
length in all cases. So just assert that
run->emulation_failure.insn_size is at least 2.

Signed-off-by: David Matlack <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Rename emulator_error_test to smaller_maxphyaddr_emulation_test

Rename emulator_error_test to smaller_maxphyaddr_emulation_test and
update the comment at the top of the file to document that this is
explicitly a test to validate that KVM emulates instructions in response
to an EPT violation when emulating a smaller MAXPHYADDR.

Signed-off-by: David Matlack <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Don't assume vcpu->id is '0' in xAPIC state test

In xapic_state_test's test_icr(), explicitly skip iterations that would
match vcpu->id instead of assuming vcpu->id is '0', so that IPIs are
are correctly sent to non-existent vCPUs.

Suggested-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/kvm/[email protected]
Signed-off-by: Gautam Menghani <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: massage shortlog and changelog]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Add arch specific post vm creation hook

Add arch specific API kvm_arch_vm_post_create to perform any required setup
after VM creation.

Suggested-by: Sean Christopherson <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Peter Gonda <[email protected]>
Signed-off-by: Vishal Annapurve <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: place x86's implementation by vm_arch_vcpu_add()]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Add arch specific initialization

Introduce arch specific API: kvm_selftest_arch_init to allow each arch to
handle initialization before running any selftest logic.

Suggested-by: Sean Christopherson <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Peter Gonda <[email protected]>
Signed-off-by: Vishal Annapurve <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: move common startup logic to kvm_util.c

Consolidate common startup logic in one place by implementing a single
setup function with __attribute((constructor)) for all selftests within
kvm_util.c.

This allows moving logic like:
/* Tell stdout not to buffer its content */
setbuf(stdout, NULL);
to a single file for all selftests.

This will also allow any required setup at entry in future to be done in
common main function.

Link: https://lore.kernel.org/lkml/Ywa9T+jKUpaHLu%[email protected]
Suggested-by: Sean Christopherson <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Peter Gonda <[email protected]>
Signed-off-by: Vishal Annapurve <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Play nice with huge pages when getting PTEs/GPAs

Play nice with huge pages when getting PTEs and translating GVAs to GPAs,
there's no reason to disallow using huge pages in selftests. Use
PG_LEVEL_NONE to indicate that the caller doesn't care about the mapping
level and just wants to get the pte+level.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Use vm_get_page_table_entry() in addr_arch_gva2gpa()

Use vm_get_page_table_entry() in addr_arch_gva2gpa() to get the leaf PTE
instead of manually walking page tables.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Use virt_get_pte() when getting PTE pointer

Use virt_get_pte() in vm_get_page_table_entry() instead of open coding
equivalent code.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Verify parent PTE is PRESENT when getting child PTE

Verify the parent PTE is PRESENT when getting a child via virt_get_pte()
so that the helper can be used for getting PTEs/GPAs without losing
sanity checks that the walker isn't wandering into the weeds.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Remove useless shifts when creating guest page tables

Remove the pointless shift from GPA=>GFN and immediately back to
GFN=>GPA when creating guest page tables. Ignore the other walkers
that have a similar pattern for the moment, they will be converted
to use virt_get_pte() in the near future.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Drop reserved bit checks from PTE accessor

Drop the reserved bit checks from the helper to retrieve a PTE, there's
very little value in sanity checking the constructed page tables as any
will quickly be noticed in the form of an unexpected #PF. The checks
also place unnecessary restrictions on the usage of the helpers, e.g. if
a test _wanted_ to set reserved bits for whatever reason.

Removing the NX check in particular allows for the removal of the @vcpu
param, which will in turn allow the helper to be reused nearly verbatim
for addr_gva2gpa().

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Drop helpers to read/write page table entries

Drop vm_{g,s}et_page_table_entry() and instead expose the "inner"
helper (was _vm_get_page_table_entry()) that returns a _pointer_ to the
PTE, i.e. let tests directly modify PTEs instead of bouncing through
helpers that just make life difficult.

Opportunsitically use BIT_ULL() in emulator_error_test, and use the
MAXPHYADDR define to set the "rogue" GPA bit instead of open coding the
same value.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Fix spelling mistake "begining" -> "beginning"

There is a spelling mistake in an assert message. Fix it.

Signed-off-by: Colin Ian King <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: fix an ironic typo in the changelog]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Add ucall pool based implementation

To play nice with guests whose stack memory is encrypted, e.g. AMD SEV,
introduce a new "ucall pool" implementation that passes the ucall struct
via dedicated memory (which can be mapped shared, a.k.a. as plain text).

Because not all architectures have access to the vCPU index in the guest,
use a bitmap with atomic accesses to track which entries in the pool are
free/used.  A list+lock could also work in theory, but synchronizing the
individual pointers to the guest would be a mess.

Note, there's no need to rewalk the bitmap to ensure success.  If all
vCPUs are simply allocating, success is guaranteed because there are
enough entries for all vCPUs.  If one or more vCPUs are freeing and then
reallocating, success is guaranteed because vCPUs _always_ walk the
bitmap from 0=>N; if vCPU frees an entry and then wins a race to
re-allocate, then either it will consume the entry it just freed (bit is
the first free bit), or the losing vCPU is guaranteed to see the freed
bit (winner consumes an earlier bit, which the loser hasn't yet visited).

Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Peter Gonda <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Drop now-unnecessary ucall_uninit()

Drop ucall_uninit() and ucall_arch_uninit() now that ARM doesn't modify
the host's copy of ucall_exit_mmio_addr, i.e. now that there's no need to
reset the pointer before potentially creating a new VM. The few calls to
ucall_uninit() are all immediately followed by kvm_vm_free(), and that is
likely always going to hold true, i.e. it's extremely unlikely a test
will want to effectively disable ucall in the middle of a test.

Reviewed-by: Andrew Jones <[email protected]>
Tested-by: Peter Gonda <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Make arm64's MMIO ucall multi-VM friendly

Fix a mostly-theoretical bug where ARM's ucall MMIO setup could result in
different VMs stomping on each other by cloberring the global pointer.

Fix the most obvious issue by saving the MMIO gpa into the VM.

A more subtle bug is that creating VMs in parallel (on multiple tasks)
could result in a VM using the wrong address. Synchronizing a global to
a guest effectively snapshots the value on a per-VM basis, i.e. the
"global" is already prepped to work with multiple VMs, but setting the
global in the host is not thread-safe. To fix that bug, add
write_guest_global() to allow stuffing a VM's copy of a "global" without
modifying the host value.

Reviewed-by: Andrew Jones <[email protected]>
Tested-by: Peter Gonda <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

tools: Add atomic_test_and_set_bit()

Add x86 and generic implementations of atomic_test_and_set_bit() to allow
KVM selftests to atomically manage bitmaps.

Note, the generic version is taken from arch_test_and_set_bit() as of
commit 415d83249709 ("locking/atomic: Make test_and_*_bit() ordered on
failure").

Signed-off-by: Peter Gonda <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Automatically do init_ucall() for non-barebones VMs

Do init_ucall() automatically during VM creation to kill two (three?)
birds with one stone.

First, initializing ucall immediately after VM creations allows forcing
aarch64's MMIO ucall address to immediately follow memslot0.  This is
still somewhat fragile as tests could clobber the MMIO address with a
new memslot, but it's safe-ish since tests have to be conversative when
accounting for memslot0.  And this can be hardened in the future by
creating a read-only memslot for the MMIO page (KVM ARM exits with MMIO
if the guest writes to a read-only memslot).  Add a TODO to document that
selftests can and should use a memslot for the ucall MMIO (doing so
requires yet more rework because tests assumes thay can use all memslots
except memslot0).

Second, initializing ucall for all VMs prepares for making ucall
initialization meaningful on all architectures.  aarch64 is currently the
only arch that needs to do any setup, but that will change in the future
by switching to a pool-based implementation (instead of the current
stack-based approach).

Lastly, defining the ucall MMIO address from common code will simplify
switching all architectures (except s390) to a common MMIO-based ucall
implementation (if there's ever sufficient motivation to do so).

Cc: Oliver Upton <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Tested-by: Peter Gonda <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Consolidate boilerplate code in get_ucall()

Consolidate the actual copying of a ucall struct from guest=>host into
the common get_ucall(). Return a host virtual address instead of a guest
virtual address even though the addr_gva2hva() part could be moved to
get_ucall() too. Conceptually, get_ucall() is invoked from the host and
should return a host virtual address (and returning NULL for "nothing to
see here" is far superior to returning 0).

Use pointer shenanigans instead of an unnecessary bounce buffer when the
caller of get_ucall() provides a valid pointer.

Reviewed-by: Andrew Jones <[email protected]>
Tested-by: Peter Gonda <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: selftests: Consolidate common code for populating ucall struct

Make ucall() a common helper that populates struct ucall, and only calls
into arch code to make the actually call out to userspace.

Rename all arch-specific helpers to make it clear they're arch-specific,
and to avoid collisions with common helpers (one more on its way...)

Add WRITE_ONCE() to stores in ucall() code (as already done to aarch64
code in commit 9e2f6498efbb ("selftests: KVM: Handle compiler
optimizations in ucall")) to prevent clang optimizations breaking ucalls.

Cc: Colton Lewis <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Tested-by: Peter Gonda <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: selftests: Disable single-step without relying on ucall()

Automatically disable single-step when the guest reaches the end of the
verified section instead of using an explicit ucall() to ask userspace to
disable single-step.  An upcoming change to implement a pool-based scheme
for ucall() will add an atomic operation (bit test and set) in the guest
ucall code, and if the compiler generate "old school" atomics, e.g.

  40e57c:       c85f7c20        ldxr    x0, [x1]
  40e580:       aa100011        orr     x17, x0, x16
  40e584:       c80ffc31        stlxr   w15, x17, [x1]
  40e588:       35ffffaf        cbnz    w15, 40e57c <__aarch64_ldset8_sync+0x1c>

the guest will hang as the local exclusive monitor is reset by eret,
i.e. the stlxr will always fail due to the debug exception taken to EL2.

Link: https://lore.kernel.org/all/[email protected]
Cc: Oliver Upton <[email protected]>
Cc: Marc Zyngier <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Oliver Upton <[email protected]>

KVM: arm64: selftests: Disable single-step with correct KVM define

Disable single-step by setting debug.control to KVM_GUESTDBG_ENABLE,
not to SINGLE_STEP_DISABLE. The latter is an arbitrary test enum that
just happens to have the same value as KVM_GUESTDBG_ENABLE, and so
effectively disables single-step debug.

No functional change intended.

Cc: Reiji Watanabe <[email protected]>
Fixes: b18e4d4aebdd ("KVM: arm64: selftests: Add a test case for KVM_GUESTDBG_SINGLESTEP")
Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Oliver Upton <[email protected]>

KVM: selftests: Rename perf_test_util symbols to memstress

Replace the perf_test_ prefix on symbol names with memstress_ to match
the new file name.

"memstress" better describes the functionality proveded by this library,
which is to provide functionality for creating and running a VM that
stresses VM memory by reading and writing to guest memory on all vCPUs
in parallel.

"memstress" also contains the same number of chracters as "perf_test",
making it a drop-in replacement in symbols, e.g. function names, without
impacting line lengths. Also the lack of underscore between "mem" and
"stress" makes it clear "memstress" is a noun.

Signed-off-by: David Matlack <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Rename pta (short for perf_test_args) to args

Rename the local variables "pta" (which is short for perf_test_args) for
args. "pta" is not an obvious acronym and using "args" mirrors
"vcpu_args".

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Rename perf_test_util.[ch] to memstress.[ch]

Rename the perf_test_util.[ch] files to memstress.[ch]. Symbols are
renamed in the following commit to reduce the amount of churn here in
hopes of playiing nice with git's file rename detection.

The name "memstress" was chosen to better describe the functionality
proveded by this library, which is to create and run a VM that
reads/writes to guest memory on all vCPUs in parallel.

"memstress" also contains the same number of chracters as "perf_test",
making it a drop-in replacement in symbols, e.g. function names, without
impacting line lengths. Also the lack of underscore between "mem" and
"stress" makes it clear "memstress" is a noun.

Signed-off-by: David Matlack <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: randomize page access order

Create the ability to randomize page access order with the -a
argument. This includes the possibility that the same pages may be hit
multiple times during an iteration or not at all.

Population has random access as false to ensure all pages will be
touched by population and avoid page faults in late dirty memory that
would pollute the test results.

Signed-off-by: Colton Lewis <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: randomize which pages are written vs read

Randomize which pages are written vs read using the random number
generator.

Change the variable wr_fract and associated function calls to
write_percent that now operates as a percentage from 0 to 100 where X
means each page has an X% chance of being written. Change the -f
argument to -w to reflect the new variable semantics. Keep the same
default of 100% writes.

Population always uses 100% writes to ensure all memory is actually
populated and not just mapped to the zero page. The prevents expensive
copy-on-write faults from occurring during the dirty memory iterations
below, which would pollute the performance results.

Each vCPU calculates its own random seed by adding its index to the
seed provided.

Signed-off-by: Colton Lewis <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: create -r argument to specify random seed

Create a -r argument to specify a random seed. If no argument is
provided, the seed defaults to 1. The random seed is set with
perf_test_set_random_seed() and must be set before guest_code runs to
apply.

Signed-off-by: Colton Lewis <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: implement random number generator for guest code

Implement random number generator for guest code to randomize parts
of the test, making it less predictable and a more accurate reflection
of reality.

The random number generator chosen is the Park-Miller Linear
Congruential Generator, a fancy name for a basic and well-understood
random number generator entirely sufficient for this purpose.

Signed-off-by: Colton Lewis <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Allowing running dirty_log_perf_test on specific CPUs

Add a command line option, -c, to pin vCPUs to physical CPUs (pCPUs),
i.e. to force vCPUs to run on specific pCPUs.

Requirement to implement this feature came in discussion on the patch
"Make page tables for eager page splitting NUMA aware"
https://lore.kernel.org/lkml/[email protected]/

This feature is useful as it provides a way to analyze performance based
on the vCPUs and dirty log worker locations, like on the different NUMA
nodes or on the same NUMA nodes.

To keep things simple, implementation is intentionally very limited,
either all of the vCPUs will be pinned followed by an optional main
thread or nothing will be pinned.

Signed-off-by: Vipin Sharma <[email protected]>
Suggested-by: David Matlack <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Add atoi_positive() and atoi_non_negative() for input validation

Many KVM selftests take command line arguments which are supposed to be
positive (>0) or non-negative (>=0). Some tests do these validation and
some missed adding the check.

Add atoi_positive() and atoi_non_negative() to validate inputs in
selftests before proceeding to use those values.

Signed-off-by: Vipin Sharma <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Shorten the test args in memslot_modification_stress_test.c

Change test args memslot_modification_delay and nr_memslot_modifications
to delay and nr_iterations for simplicity.

Signed-off-by: Vipin Sharma <[email protected]>
Suggested-by: Sean Christopherson <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Use SZ_* macros from sizes.h in max_guest_memory_test.c

Replace size_1gb defined in max_guest_memory_test.c with the SZ_1G,
SZ_2G and SZ_4G from linux/sizes.h header file.

Signed-off-by: Vipin Sharma <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Add atoi_paranoid() to catch errors missed by atoi()

atoi() doesn't detect errors. There is no way to know that a 0 return
is correct conversion or due to an error.

Introduce atoi_paranoid() to detect errors and provide correct
conversion. Replace all atoi() calls with atoi_paranoid().

Signed-off-by: Vipin Sharma <[email protected]>
Suggested-by: David Matlack <[email protected]>
Suggested-by: Sean Christopherson <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Put command line options in alphabetical order in dirty_log_perf_test

There are 13 command line options and they are not in any order. Put
them in alphabetical order to make it easy to add new options.

No functional change intended.

Signed-off-by: Vipin Sharma <[email protected]>
Reviewed-by: Wei Wang <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: selftests: Add missing break between -e and -g option in dirty_log_perf_test

Passing -e option (Run VCPUs while dirty logging is being disabled) in
dirty_log_perf_test also unintentionally enables -g (Do not enable
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2). Add break between two switch case
logic.

Fixes: cfe12e64b065 ("KVM: selftests: Add an option to run vCPUs while disabling dirty logging")
Signed-off-by: Vipin Sharma <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: replace direct irq.h inclusion

virt/kvm/irqchip.c is including "irq.h" from the arch-specific KVM source
directory (i.e. not from arch/*/include) for the sole purpose of retrieving
irqchip_in_kernel.

Making the function inline in a header that is already included,
such as asm/kvm_host.h, is not possible because it needs to look at
struct kvm which is defined after asm/kvm_host.h is included. So add a
kvm_arch_irqchip_in_kernel non-inline function; irqchip_in_kernel() is
only performance critical on arm64 and x86, and the non-inline function
is enough on all other architectures.

irq.h can then be deleted from all architectures except x86.

Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter

Defer reprogramming counters and handling overflow via KVM_REQ_PMU
when incrementing counters.  KVM skips emulated WRMSR in the VM-Exit
fastpath, the fastpath runs with IRQs disabled, skipping instructions
can increment and reprogram counters, reprogramming counters can
sleep, and sleeping is disallowed while IRQs are disabled.

[*] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
[*] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2981888, name: CPU 15/KVM
[*] preempt_count: 1, expected: 0
[*] RCU nest depth: 0, expected: 0
[*] INFO: lockdep is turned off.
[*] irq event stamp: 0
[*] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[*] hardirqs last disabled at (0): [<ffffffff8121222a>] copy_process+0x146a/0x62d0
[*] softirqs last  enabled at (0): [<ffffffff81212269>] copy_process+0x14a9/0x62d0
[*] softirqs last disabled at (0): [<0000000000000000>] 0x0
[*] Preemption disabled at:
[*] [<ffffffffc2063fc1>] vcpu_enter_guest+0x1001/0x3dc0 [kvm]
[*] CPU: 17 PID: 2981888 Comm: CPU 15/KVM Kdump: 5.19.0-rc1-g239111db364c-dirty #2
[*] Call Trace:
[*]  <TASK>
[*]  dump_stack_lvl+0x6c/0x9b
[*]  __might_resched.cold+0x22e/0x297
[*]  __mutex_lock+0xc0/0x23b0
[*]  perf_event_ctx_lock_nested+0x18f/0x340
[*]  perf_event_pause+0x1a/0x110
[*]  reprogram_counter+0x2af/0x1490 [kvm]
[*]  kvm_pmu_trigger_event+0x429/0x950 [kvm]
[*]  kvm_skip_emulated_instruction+0x48/0x90 [kvm]
[*]  handle_fastpath_set_msr_irqoff+0x349/0x3b0 [kvm]
[*]  vmx_vcpu_run+0x268e/0x3b80 [kvm_intel]
[*]  vcpu_enter_guest+0x1d22/0x3dc0 [kvm]

Add a field to kvm_pmc to track the previous counter value in order
to defer overflow detection to kvm_pmu_handle_event() (the counter must
be paused before handling overflow, and that may increment the counter).

Opportunistically shrink sizeof(struct kvm_pmc) a bit.

Suggested-by: Wanpeng Li <[email protected]>
Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
Signed-off-by: Like Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: avoid re-triggering KVM_REQ_PMU on overflow, tweak changelog]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20220923001355.3741194 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()

Batch reprogramming PMU counters by setting KVM_REQ_PMU and thus
deferring reprogramming kvm_pmu_handle_event() to avoid reprogramming
a counter multiple times during a single VM-Exit.

Deferring programming will also allow KVM to fix a bug where immediately
reprogramming a counter can result in sleeping (taking a mutex) while
interrupts are disabled in the VM-Exit fastpath.

Introduce kvm_pmu_request_counter_reprogam() to make it obvious that
KVM is _requesting_ a reprogram and not actually doing the reprogram.

Opportunistically refine related comments to avoid misunderstandings.

Signed-off-by: Like Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20220923001355.3741194 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/pmu: Clear "reprogram" bit if counter is disabled or disallowed

When reprogramming a counter, clear the counter's "reprogram pending" bit
if the counter is disabled (by the guest) or is disallowed (by the
userspace filter). In both cases, there's no need to re-attempt
programming on the next coincident KVM_REQ_PMU as enabling the counter by
either method will trigger reprogramming.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20220923001355.3741194 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/pmu: Force reprogramming of all counters on PMU filter change

Force vCPUs to reprogram all counters on a PMU filter change to provide
a sane ABI for userspace.  Use the existing KVM_REQ_PMU to do the
programming, and take advantage of the fact that the reprogram_pmi bitmap
fits in a u64 to set all bits in a single atomic update.  Note, setting
the bitmap and making the request needs to be done _after_ the SRCU
synchronization to ensure that vCPUs will reprogram using the new filter.

KVM's current "lazy" approach is confusing and non-deterministic.  It's
confusing because, from a developer perspective, the code is buggy as it
makes zero sense to let userspace modify the filter but then not actually
enforce the new filter.  The lazy approach is non-deterministic because
KVM enforces the filter whenever a counter is reprogrammed, not just on
guest WRMSRs, i.e. a guest might gain/lose access to an event at random
times depending on what is going on in the host.

Note, the resulting behavior is still non-determinstic while the filter
is in flux.  If userspace wants to guarantee deterministic behavior, all
vCPUs should be paused during the filter update.

Jim Mattson <[email protected]>

Fixes: 66bb8a065f5a ("KVM: x86: PMU Event Filter")
Cc: Aaron Lewis <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20220923001355.3741194 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: WARN if TDP MMU SP disallows hugepage after being zapped

Extend the accounting sanity check in kvm_recover_nx_huge_pages() to the
TDP MMU, i.e. verify that zapping a shadow page unaccounts the disallowed
NX huge page regardless of the MMU type. Recovery runs while holding
mmu_lock for write and so it should be impossible to get false positives
on the WARN.

Suggested-by: Yan Zhao <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20221019165618 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()

Explicitly check if a NX huge page is disallowed when determining if a
page fault needs to be forced to use a smaller sized page. KVM currently
assumes that the NX huge page mitigation is the only scenario where KVM
will force a shadow page instead of a huge page, and so unnecessarily
keeps an existing shadow page instead of replacing it with a huge page.

Any scenario that causes KVM to zap leaf SPTEs may result in having a SP
that can be made huge without violating the NX huge page mitigation.
E.g. prior to commit 5ba7c4c6d1c7 ("KVM: x86/MMU: Zap non-leaf SPTEs when
disabling dirty logging"), KVM would keep shadow pages after disabling
dirty logging due to a live migration being canceled, resulting in
degraded performance due to running with 4kb pages instead of huge pages.

Although the dirty logging case is "fixed", that fix is coincidental,
i.e. is an implementation detail, and there are other scenarios where KVM
will zap leaf SPTEs. E.g. zapping leaf SPTEs in response to a host page
migration (mmu_notifier invalidation) to create a huge page would yield a
similar result; KVM would see the shadow-present non-leaf SPTE and assume
a huge page is disallowed.

Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
Reviewed-by: Ben Gardon <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Mingwei Zhang <[email protected]>
[sean: use spte_to_child_sp(), massage changelog, fold into if-statement]
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Yan Zhao <[email protected]>
Message-Id: <20221019165618 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Add helper to convert SPTE value to its shadow page

Add a helper to convert a SPTE to its shadow page to deduplicate a
variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to
get the shadow page for a SPTE without dropping high bits.

Opportunistically add a comment in mmu_free_root_page() documenting why
it treats the root HPA as a SPTE.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20221019165618 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages

Track the number of TDP MMU "shadow" pages instead of tracking the pages
themselves. With the NX huge page list manipulation moved out of the common
linking flow, elminating the list-based tracking means the happy path of
adding a shadow page doesn't need to acquire a spinlock and can instead
inc/dec an atomic.

Keep the tracking as the WARN during TDP MMU teardown on leaked shadow
pages is very, very useful for detecting KVM bugs.

Tracking the number of pages will also make it trivial to expose the
counter to userspace as a stat in the future, which may or may not be
desirable.

Note, the TDP MMU needs to use a separate counter (and stat if that ever
comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't bother
supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES (because the
TDP MMU consumes so few pages relative to shadow paging), and including TDP
MMU pages in that counter would break both the shrinker and shadow MMUs,
e.g. if a VM is using nested TDP.

Cc: Yan Zhao <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Yan Zhao <[email protected]>
Message-Id: <20221019165618 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE

Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP
visible to other readers, i.e. before setting its SPTE. This will allow
KVM to query the flag when determining if a shadow page can be replaced
by a NX huge page without violating the rules of the mitigation.

Note, the shadow/legacy MMU holds mmu_lock for write, so it's impossible
for another CPU to see a shadow page without an up-to-date
nx_huge_page_disallowed, i.e. only the TDP MMU needs the complicated
dance.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Reviewed-by: Yan Zhao <[email protected]>
Message-Id: <20221019165618 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs

Account and track NX huge pages for nonpaging MMUs so that a future
enhancement to precisely check if a shadow page can't be replaced by a NX
huge page doesn't get false positives.  Without correct tracking, KVM can
get stuck in a loop if an instruction is fetching and writing data on the
same huge page, e.g. KVM installs a small executable page on the fetch
fault, replaces it with an NX huge page on the write fault, and faults
again on the fetch.

Alternatively, and perhaps ideally, KVM would simply not enforce the
workaround for nonpaging MMUs.  The guest has no page tables to abuse
and KVM is guaranteed to switch to a different MMU on CR0.PG being
toggled so there's no security or performance concerns.  However, getting
make_spte() to play nice now and in the future is unnecessarily complex.

In the current code base, make_spte() can enforce the mitigation if TDP
is enabled or the MMU is indirect, but make_spte() may not always have a
vCPU/MMU to work with, e.g. if KVM were to support in-line huge page
promotion when disabling dirty logging.

Without a vCPU/MMU, KVM could either pass in the correct information
and/or derive it from the shadow page, but the former is ugly and the
latter subtly non-trivial due to the possibility of direct shadow pages
in indirect MMUs.  Given that using shadow paging with an unpaged guest
is far from top priority _and_ has been subjected to the workaround since
its inception, keep it simple and just fix the accounting glitch.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
Message-Id: <20221019165618 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Rename NX huge pages fields/functions for consistency

Rename most of the variables/functions involved in the NX huge page
mitigation to provide consistency, e.g. lpage vs huge page, and NX huge
vs huge NX, and also to provide clarity, e.g. to make it obvious the flag
applies only to the NX huge page mitigation, not to any condition that
prevents creating a huge page.

Add a comment explaining what the newly named "possible_nx_huge_pages"
tracks.

Leave the nx_lpage_splits stat alone as the name is ABI and thus set in
stone.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
Message-Id: <20221019165618 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked

Tag shadow pages that cannot be replaced with an NX huge page regardless
of whether or not zapping the page would allow KVM to immediately create
a huge page, e.g. because something else prevents creating a huge page.

I.e. track pages that are disallowed from being NX huge pages regardless
of whether or not the page could have been huge at the time of fault.
KVM currently tracks pages that were disallowed from being huge due to
the NX workaround if and only if the page could otherwise be huge. But
that fails to handled the scenario where whatever restriction prevented
KVM from installing a huge page goes away, e.g. if dirty logging is
disabled, the host mapping level changes, etc...

Failure to tag shadow pages appropriately could theoretically lead to
false negatives, e.g. if a fetch fault requests a small page and thus
isn't tracked, and a read/write fault later requests a huge page, KVM
will not reject the huge page as it should.

To avoid yet another flag, initialize the list_head and use list_empty()
to determine whether or not a page is on the list of NX huge pages that
should be recovered.

Note, the TDP MMU accounting is still flawed as fixing the TDP MMU is
more involved due to mmu_lock being held for read. This will be
addressed in a future commit.

Fixes: 5bcaf3e1715f ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested")
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20221019165618 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

selftests: kvm/x86: Test the flags in MSR filtering and MSR exiting

When using the flags in KVM_X86_SET_MSR_FILTER and
KVM_CAP_X86_USER_SPACE_MSR it is expected that an attempt to write to
any of the unused bits will fail. Add testing to walk over every bit
in each of the flag fields in MSR filtering and MSR exiting to verify
that unused bits return and error and used bits, i.e. valid bits,
succeed.

Signed-off-by: Aaron Lewis <[email protected]>
Message-Id: <20220921151525 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Add a VALID_MASK for the flags in kvm_msr_filter_range

Add the mask KVM_MSR_FILTER_RANGE_VALID_MASK for the flags in the
struct kvm_msr_filter_range. This simplifies checks that validate
these flags, and makes it easier to introduce new flags in the future.

No functional change intended.

Signed-off-by: Aaron Lewis <[email protected]>
Message-Id: <20220921151525 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Add a VALID_MASK for the flag in kvm_msr_filter

Add the mask KVM_MSR_FILTER_VALID_MASK for the flag in the struct
kvm_msr_filter. This makes it easier to introduce new flags in the
future.

No functional change intended.

Signed-off-by: Aaron Lewis <[email protected]>
Message-Id: <20220921151525 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Add a VALID_MASK for the MSR exit reason flags

Add the mask KVM_MSR_EXIT_REASON_VALID_MASK for the MSR exit reason
flags. This simplifies checks that validate these flags, and makes it
easier to introduce new flags in the future.

No functional change intended.

Signed-off-by: Aaron Lewis <[email protected]>
Message-Id: <20220921151525 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Disallow the use of KVM_MSR_FILTER_DEFAULT_ALLOW in the kernel

Protect the kernel from using the flag KVM_MSR_FILTER_DEFAULT_ALLOW.
Its value is 0, and using it incorrectly could have unintended
consequences. E.g. prevent someone in the kernel from writing something
like this.

if (filter.flags & KVM_MSR_FILTER_DEFAULT_ALLOW)
<allow the MSR>

and getting confused when it doesn't work.

It would be more ideal to remove this flag altogether, but userspace
may already be using it, so protecting the kernel is all that can
reasonably be done at this point.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Aaron Lewis <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Message-Id: <20220921151525 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: Allow to respond to generic signals during slow PF

Enable x86 slow page faults to be able to respond to non-fatal signals,
returning -EINTR properly when it happens.

Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Message-Id: <20221011195947 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: Add interruptible flag to __gfn_to_pfn_memslot()

Add a new "interruptible" flag showing that the caller is willing to be
interrupted by signals during the __gfn_to_pfn_memslot() request. Wire it
up with a FOLL_INTERRUPTIBLE flag that we've just introduced.

This prepares KVM to be able to respond to SIGUSR1 (for QEMU that's the
SIGIPI) even during e.g. handling an userfaultfd page fault.

No functional change intended.

Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Message-Id: <20221011195809 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>