Michael Mueller [Wed, 9 Feb 2022 15:22:17 +0000 (16:22 +0100)]
KVM: s390: pv: make use of ultravisor AIV support
This patch enables the ultravisor adapter interruption vitualization
support indicated by UV feature BIT_UV_FEAT_AIV. This allows ISC
interruption injection directly into the GISA IPM for PV kvm guests.
Hardware that does not support this feature will continue to use the
UV interruption interception method to deliver ISC interruptions to
PV kvm guests. For this purpose, the ECA_AIV bit for all guest cpus
will be cleared and the GISA will be disabled during PV CPU setup.
In addition a check in __inject_io() has been removed. That reduces the
required instructions for interruption handling for PV and traditional
kvm guests.
Paolo Bonzini [Mon, 14 Feb 2022 14:13:48 +0000 (09:13 -0500)]
KVM: x86/mmu: clear MMIO cache when unloading the MMU
For cleanliness, do not leave a stale GVA in the cache after all the roots are
cleared. In practice, kvm_mmu_load will go through kvm_mmu_sync_roots if
paging is on, and will not use vcpu_match_mmio_gva at all if paging is off.
However, leaving data in the cache might cause bugs in the future.
Paolo Bonzini [Mon, 22 Nov 2021 18:18:23 +0000 (13:18 -0500)]
KVM: x86/mmu: Always use current mmu's role when loading new PGD
Since the guest PGD is now loaded after the MMU has been set up
completely, the desired role for a cache hit is simply the current
mmu_role. There is no need to compute it again, so __kvm_mmu_new_pgd
can be folded in kvm_mmu_new_pgd.
Paolo Bonzini [Fri, 4 Feb 2022 09:12:31 +0000 (04:12 -0500)]
KVM: x86/mmu: load new PGD after the shadow MMU is initialized
Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and
shadow_root_level anymore, pull the PGD load after the initialization of
the shadow MMUs.
Besides being more intuitive, this enables future simplifications
and optimizations because it's not necessary anymore to compute the
role outside kvm_init_mmu. In particular, kvm_mmu_reset_context was not
attempting to use a cached PGD to avoid having to figure out the new role.
With this change, it could follow what nested_{vmx,svm}_load_cr3 are doing,
and avoid unloading all the cached roots.
Paolo Bonzini [Wed, 9 Feb 2022 07:49:47 +0000 (02:49 -0500)]
KVM: x86/mmu: look for a cached PGD when going from 32-bit to 64-bit
Right now, PGD caching avoids placing a PAE root in the cache by using the
old value of mmu->root_level and mmu->shadow_root_level; it does not look
for a cached PGD if the old root is a PAE one, and then frees it using
kvm_mmu_free_roots.
Change the logic instead to free the uncacheable root early.
This way, __kvm_new_mmu_pgd is able to look up the cache when going from
32-bit to 64-bit (if there is a hit, the invalid root becomes the least
recently used). An example of this is nested virtualization with shadow
paging, when a 64-bit L1 runs a 32-bit L2.
As a side effect (which is actually the reason why this patch was
written), PGD caching does not use the old value of mmu->root_level
and mmu->shadow_root_level anymore.
Paolo Bonzini [Mon, 21 Feb 2022 14:31:51 +0000 (09:31 -0500)]
KVM: x86/mmu: do not pass vcpu to root freeing functions
These functions only operate on a given MMU, of which there is more
than one in a vCPU (we care about two, because the third does not have
any roots and is only used to walk guest page tables). They do need a
struct kvm in order to lock the mmu_lock, but they do not needed anything
else in the struct kvm_vcpu. So, pass the vcpu->kvm directly to them.
Paolo Bonzini [Tue, 8 Feb 2022 22:53:55 +0000 (17:53 -0500)]
KVM: x86/mmu: do not consult levels when freeing roots
Right now, PGD caching requires a complicated dance of first computing
the MMU role and passing it to __kvm_mmu_new_pgd(), and then separately calling
kvm_init_mmu().
Part of this is due to kvm_mmu_free_roots using mmu->root_level and
mmu->shadow_root_level to distinguish whether the page table uses a single
root or 4 PAE roots. Because kvm_init_mmu() can overwrite mmu->root_level,
kvm_mmu_free_roots() must be called before kvm_init_mmu().
However, even after kvm_init_mmu() there is a way to detect whether the
page table may hold PAE roots, as root.hpa isn't backed by a shadow when
it points at PAE roots. Using this method results in simpler code, and
is one less obstacle in moving all calls to __kvm_mmu_new_pgd() after the
MMU has been initialized.
Paolo Bonzini [Mon, 21 Feb 2022 14:28:33 +0000 (09:28 -0500)]
KVM: x86: use struct kvm_mmu_root_info for mmu->root
The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info.
Use the struct to have more consistency between mmu->root and
mmu->prev_roots.
The patch is entirely search and replace except for cached_root_available,
which does not need a temporary struct kvm_mmu_root_info anymore.
Paolo Bonzini [Wed, 9 Feb 2022 00:08:33 +0000 (19:08 -0500)]
KVM: x86/mmu: avoid NULL-pointer dereference on page freeing bugs
WARN and bail if KVM attempts to free a root that isn't backed by a shadow
page. KVM allocates a bare page for "special" roots, e.g. when using PAE
paging or shadowing 2/3/4-level page tables with 4/5-level, and so root_hpa
will be valid but won't be backed by a shadow page. It's all too easy to
blindly call mmu_free_root_page() on root_hpa, be nice and WARN instead of
crashing KVM and possibly the kernel.
Paolo Bonzini [Wed, 9 Feb 2022 10:17:38 +0000 (05:17 -0500)]
KVM: x86: do not deliver asynchronous page faults if CR0.PG=0
Enabling async page faults is nonsensical if paging is disabled, but
it is allowed because CR0.PG=0 does not clear the async page fault
MSR. Just ignore them and only use the artificial halt state,
similar to what happens in guest mode if async #PF vmexits are disabled.
Given the increasingly complex logic, and the nicer code if the new
"if" is placed last, opportunistically change the "||" into a chain
of "if (...) return false" statements.
Paolo Bonzini [Wed, 9 Feb 2022 09:56:05 +0000 (04:56 -0500)]
KVM: x86: Reinitialize context if host userspace toggles EFER.LME
While the guest runs, EFER.LME cannot change unless CR0.PG is clear, and
therefore EFER.NX is the only bit that can affect the MMU role. However,
set_efer accepts a host-initiated change to EFER.LME even with CR0.PG=1.
In that case, the MMU has to be reset.
Fixes: 11988499e62b ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes") Cc: [email protected] Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
David Dunn [Wed, 23 Feb 2022 22:57:41 +0000 (22:57 +0000)]
KVM: x86: Provide per VM capability for disabling PMU virtualization
Add a new capability, KVM_CAP_PMU_CAPABILITY, that takes a bitmask of
settings/features to allow userspace to configure PMU virtualization on
a per-VM basis. For now, support a single flag, KVM_PMU_CAP_DISABLE,
to allow disabling PMU virtualization for a VM even when KVM is configured
with enable_pmu=true a module level.
To keep KVM simple, disallow changing VM's PMU configuration after vCPUs
have been created.
Cast kvm_x86_ops.func to 'void *' when updating KVM static calls that are
conditionally patched to __static_call_return0(). clang complains about
using mismatching pointers in the ternary operator, which breaks the
build when compiling with CONFIG_KVM_WERROR=y.
>> arch/x86/include/asm/kvm-x86-ops.h:82:1: warning: pointer type mismatch
('bool (*)(struct kvm_vcpu *)' and 'void *') [-Wpointer-type-mismatch]
Vipin Sharma [Tue, 22 Feb 2022 05:48:48 +0000 (05:48 +0000)]
KVM: Move VM's worker kthreads back to the original cgroup before exiting.
VM worker kthreads can linger in the VM process's cgroup for sometime
after KVM terminates the VM process.
KVM terminates the worker kthreads by calling kthread_stop() which waits
on the 'exited' completion, triggered by exit_mm(), via mm_release(), in
do_exit() during the kthread's exit. However, these kthreads are
removed from the cgroup using the cgroup_exit() which happens after the
exit_mm(). Therefore, A VM process can terminate in between the
exit_mm() and cgroup_exit() calls, leaving only worker kthreads in the
cgroup.
Moving worker kthreads back to the original cgroup (kthreadd_task's
cgroup) makes sure that the cgroup is empty as soon as the main VM
process is terminated.
Remove a redundant 'cpu' declaration from inside an if-statement that
that shadows an identical declaration at function scope. Both variables
are used as scratch variables in for_each_*_cpu() loops, thus there's no
harm in sharing a variable.
Peng Hao [Tue, 22 Feb 2022 10:40:29 +0000 (18:40 +0800)]
kvm: vmx: Fix typos comment in __loaded_vmcs_clear()
Fix a comment documenting the memory barrier related to clearing a
loaded_vmcs; loaded_vmcs tracks the host CPU the VMCS is loaded on via
the field 'cpu', it doesn't have a 'vcpu' field.
Peng Hao [Tue, 22 Feb 2022 10:40:54 +0000 (18:40 +0800)]
KVM: nVMX: Make setup/unsetup under the same conditions
Make sure nested_vmx_hardware_setup/unsetup() are called in pairs under
the same conditions. Calling nested_vmx_hardware_unsetup() when nested
is false "works" right now because it only calls free_page() on zero-
initialized pointers, but it's possible that more code will be added to
nested_vmx_hardware_unsetup() in the future.
Mark Brown [Wed, 23 Feb 2022 13:16:24 +0000 (13:16 +0000)]
KVM: selftests: aarch64: Skip tests if we can't create a vgic-v3
The arch_timer and vgic_irq kselftests assume that they can create a
vgic-v3, using the library function vgic_v3_setup() which aborts with a
test failure if it is not possible to do so. Since vgic-v3 can only be
instantiated on systems where the host has GICv3 this leads to false
positives on older systems where that is not the case.
Fix this by changing vgic_v3_setup() to return an error if the vgic can't
be instantiated and have the callers skip if this happens. We could also
exit flagging a skip in vgic_v3_setup() but this would prevent future test
cases conditionally deciding which GIC to use or generally doing more
complex output.
Vitaly Kuznetsov [Tue, 22 Feb 2022 15:46:42 +0000 (16:46 +0100)]
KVM: x86: hyper-v: HVCALL_SEND_IPI_EX is an XMM fast hypercall
It has been proven on practice that at least Windows Server 2019 tries
using HVCALL_SEND_IPI_EX in 'XMM fast' mode when it has more than 64 vCPUs
and it needs to send an IPI to a vCPU > 63. Similarly to other XMM Fast
hypercalls (HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}{,_EX}), this
information is missing in TLFS as of 6.0b. Currently, KVM returns an error
(HV_STATUS_INVALID_HYPERCALL_INPUT) and Windows crashes.
Note, HVCALL_SEND_IPI is a 'standard' fast hypercall (not 'XMM fast') as
all its parameters fit into RDX:R8 and this is handled by KVM correctly.
Vitaly Kuznetsov [Tue, 22 Feb 2022 15:46:41 +0000 (16:46 +0100)]
KVM: x86: hyper-v: Fix the maximum number of sparse banks for XMM fast TLB flush hypercalls
When TLB flush hypercalls (HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}_EX are
issued in 'XMM fast' mode, the maximum number of allowed sparse_banks is
not 'HV_HYPERCALL_MAX_XMM_REGISTERS - 1' (5) but twice as many (10) as each
XMM register is 128 bit long and can hold two 64 bit long banks.
Revert "KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"
Revert back to refreshing vmcs.HOST_CR3 immediately prior to VM-Enter.
The PCID (ASID) part of CR3 can be bumped without KVM being scheduled
out, as the kernel will switch CR3 during __text_poke(), e.g. in response
to a static key toggling. If switch_mm_irqs_off() chooses a new ASID for
the mm associate with KVM, KVM will do VM-Enter => VM-Exit with a stale
vmcs.HOST_CR3.
Add a comment to explain why KVM must wait until VM-Enter is imminent to
refresh vmcs.HOST_CR3.
The following splat was captured by stashing vmcs.HOST_CR3 in kvm_vcpu
and adding a WARN in load_new_mm_cr3() to fire if a new ASID is being
loaded for the KVM-associated mm while KVM has a "running" vCPU:
Revert "KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()"
Undo a nested VMX fix as a step toward reverting the commit it fixed, 15ad9762d69f ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"),
as the underlying premise that "host CR3 in the vcpu thread can only be
changed when scheduling" is wrong.
Maxim Levitsky [Wed, 23 Feb 2022 11:56:49 +0000 (13:56 +0200)]
KVM: x86: nSVM: disallow userspace setting of MSR_AMD64_TSC_RATIO to non default value when tsc scaling disabled
If nested tsc scaling is disabled, MSR_AMD64_TSC_RATIO should
never have non default value.
Due to way nested tsc scaling support was implmented in qemu,
it would set this msr to 0 when nested tsc scaling was disabled.
Ignore that value for now, as it causes no harm.
Liang Zhang [Tue, 22 Feb 2022 03:12:39 +0000 (11:12 +0800)]
KVM: x86/mmu: make apf token non-zero to fix bug
In current async pagefault logic, when a page is ready, KVM relies on
kvm_arch_can_dequeue_async_page_present() to determine whether to deliver
a READY event to the Guest. This function test token value of struct
kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a
READY event is finished by Guest. If value is zero meaning that a READY
event is done, so the KVM can deliver another.
But the kvm_arch_setup_async_pf() may produce a valid token with zero
value, which is confused with previous mention and may lead the loss of
this READY event.
Paolo Bonzini [Tue, 22 Feb 2022 14:07:16 +0000 (09:07 -0500)]
Merge branch 'kvm-ppc-cap-210' into kvm-master
By request of Nick Piggin:
> Patch 3 requires a KVM_CAP_PPC number allocated. QEMU maintainers are
> happy with it (link in changelog) just waiting on KVM upstreaming. Do
> you have objections to the series going to ppc/kvm tree first, or
> another option is you could take patch 3 alone first (it's relatively
> independent of the other 2) and ppc/kvm gets it from you?
Nicholas Piggin [Tue, 22 Feb 2022 14:06:54 +0000 (09:06 -0500)]
KVM: PPC: reserve capability 210 for KVM_CAP_PPC_AIL_MODE_3
Add KVM_CAP_PPC_AIL_MODE_3 to advertise the capability to set the AIL
resource mode to 3 with the H_SET_MODE hypercall. This capability
differs between processor types and KVM types (PR, HV, Nested HV), and
affects guest-visible behaviour.
QEMU will implement a cap-ail-mode-3 to control this behaviour[1], and
use the KVM CAP if available to determine KVM support[2].
Will Deacon [Mon, 21 Feb 2022 15:35:24 +0000 (15:35 +0000)]
KVM: arm64: Indicate SYSTEM_RESET2 in kvm_run::system_event flags field
When handling reset and power-off PSCI calls from the guest, we
initialise X0 to PSCI_RET_INTERNAL_FAILURE in case the VMM tries to
re-run the vCPU after issuing the call.
Unfortunately, this also means that the VMM cannot see which PSCI call
was issued and therefore cannot distinguish between PSCI SYSTEM_RESET
and SYSTEM_RESET2 calls, which is necessary in order to determine the
validity of the "reset_type" in X1.
Allocate bit 0 of the previously unused 'flags' field of the
system_event structure so that we can indicate the PSCI call used to
initiate the reset.
Will Deacon [Mon, 21 Feb 2022 15:35:23 +0000 (15:35 +0000)]
KVM: arm64: Expose PSCI SYSTEM_RESET2 call to the guest
PSCI v1.1 introduces the optional SYSTEM_RESET2 call, which allows the
caller to provide a vendor-specific "reset type" and "cookie" to request
a particular form of reset or shutdown.
Expose this call to the guest and handle it in the same way as PSCI
SYSTEM_RESET, along with some basic range checking on the type argument.
Remove mmu_audit.c and all its collateral, the auditing code has suffered
severe bitrot, ironically partly due to shadow paging being more stable
and thus not benefiting as much from auditing, but mostly due to TDP
supplanting shadow paging for non-nested guests and shadowing of nested
TDP not heavily stressing the logic that is being audited.
Paolo Bonzini [Tue, 15 Feb 2022 18:07:10 +0000 (13:07 -0500)]
KVM: x86: allow defining return-0 static calls
A few vendor callbacks are only used by VMX, but they return an integer
or bool value. Introduce KVM_X86_OP_OPTIONAL_RET0 for them: if a func is
NULL in struct kvm_x86_ops, it will be changed to __static_call_return0
when updating static calls.
Paolo Bonzini [Tue, 8 Feb 2022 18:08:19 +0000 (13:08 -0500)]
KVM: x86: make several APIC virtualization callbacks optional
All their invocations are conditional on vcpu->arch.apicv_active,
meaning that they need not be implemented by vendor code: even
though at the moment both vendors implement APIC virtualization,
all of them can be optional. In fact SVM does not need many of
them, and their implementation can be deleted now.
Paolo Bonzini [Thu, 9 Dec 2021 13:12:28 +0000 (08:12 -0500)]
KVM: x86: remove KVM_X86_OP_NULL and mark optional kvm_x86_ops
The original use of KVM_X86_OP_NULL, which was to mark calls
that do not follow a specific naming convention, is not in use
anymore. Instead, let's mark calls that are optional because
they are always invoked within conditionals or with static_call_cond.
Those that are _not_, i.e. those that are defined with KVM_X86_OP,
must be defined by both vendor modules or some kind of NULL pointer
dereference is bound to happen at runtime.
Paolo Bonzini [Tue, 15 Feb 2022 18:16:36 +0000 (13:16 -0500)]
KVM: x86: return 1 unconditionally for availability of KVM_CAP_VAPIC
The two ioctls used to implement userspace-accelerated TPR,
KVM_TPR_ACCESS_REPORTING and KVM_SET_VAPIC_ADDR, are available
even if hardware-accelerated TPR can be used. So there is
no reason not to report KVM_CAP_VAPIC.
Paolo Bonzini [Fri, 18 Feb 2022 10:07:09 +0000 (05:07 -0500)]
selftests: KVM: allow sev_migrate_tests on machines without SEV-ES
I managed to get hold of a machine that has SEV but not SEV-ES, and
sev_migrate_tests fails because sev_vm_create(true) returns ENOTTY.
Fix this, and while at it also return KSFT_SKIP on machines that do
not have SEV at all, instead of returning 0.
Peter Gonda [Fri, 11 Feb 2022 19:36:34 +0000 (11:36 -0800)]
KVM: SEV: Allow SEV intra-host migration of VM with mirrors
For SEV-ES VMs with mirrors to be intra-host migrated they need to be
able to migrate with the mirror. This is due to that fact that all VMSAs
need to be added into the VM with LAUNCH_UPDATE_VMSA before
lAUNCH_FINISH. Allowing migration with mirrors allows users of SEV-ES to
keep the mirror VMs VMSAs during migration.
Adds a list of mirror VMs for the original VM iterate through during its
migration. During the iteration the owner pointers can be updated from
the source to the destination. This fixes the ASID leaking issue which
caused the blocking of migration of VMs with mirrors.
Wanpeng Li [Fri, 18 Feb 2022 08:10:38 +0000 (00:10 -0800)]
x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPU
Inspired by commit 3553ae5690a (x86/kvm: Don't use pvqspinlock code if
only 1 vCPU), on a VM with only 1 vCPU, there is no need to enable
pv tlb/ipi/sched_yield and we can save the memory for __pv_cpu_mask.
Leonardo Bras [Fri, 18 Feb 2022 03:41:00 +0000 (00:41 -0300)]
x86/kvm: Fix compilation warning in non-x86_64 builds
On non-x86_64 builds, helpers gtod_is_based_on_tsc() and
kvm_guest_supported_xfd() are defined but never used. Because these are
static inline but are in a .c file, some compilers do warn for them with
-Wunused-function, which becomes an error if -Werror is present.
Add #ifdef so they are only defined in x86_64 builds.
kvm_vcpu_arch currently contains the guest supported features in both
guest_supported_xcr0 and guest_fpu.fpstate->user_xfeatures field.
Currently both fields are set to the same value in
kvm_vcpu_after_set_cpuid() and are not changed anywhere else after that.
Since it's not good to keep duplicated data, remove guest_supported_xcr0.
To keep the code more readable, introduce kvm_guest_supported_xcr()
and kvm_guest_supported_xfd() to replace the previous usages of
guest_supported_xcr0.
Leonardo Bras [Thu, 17 Feb 2022 05:30:29 +0000 (02:30 -0300)]
x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
During host/guest switch (like in kvm_arch_vcpu_ioctl_run()), the kernel
swaps the fpu between host/guest contexts, by using fpu_swap_kvm_fpstate().
When xsave feature is available, the fpu swap is done by:
- xsave(s) instruction, with guest's fpstate->xfeatures as mask, is used
to store the current state of the fpu registers to a buffer.
- xrstor(s) instruction, with (fpu_kernel_cfg.max_features &
XFEATURE_MASK_FPSTATE) as mask, is used to put the buffer into fpu regs.
For xsave(s) the mask is used to limit what parts of the fpu regs will
be copied to the buffer. Likewise on xrstor(s), the mask is used to
limit what parts of the fpu regs will be changed.
The mask for xsave(s), the guest's fpstate->xfeatures, is defined on
kvm_arch_vcpu_create(), which (in summary) sets it to all features
supported by the cpu which are enabled on kernel config.
This means that xsave(s) will save to guest buffer all the fpu regs
contents the cpu has enabled when the guest is paused, even if they
are not used.
This would not be an issue, if xrstor(s) would also do that.
xrstor(s)'s mask for host/guest swap is basically every valid feature
contained in kernel config, except XFEATURE_MASK_PKRU.
Accordingto kernel src, it is instead switched in switch_to() and
flush_thread().
Then, the following happens with a host supporting PKRU starts a
guest that does not support it:
1 - Host has XFEATURE_MASK_PKRU set. 1st switch to guest,
2 - xsave(s) fpu regs to host fpustate (buffer has XFEATURE_MASK_PKRU)
3 - xrstor(s) guest fpustate to fpu regs (fpu regs have XFEATURE_MASK_PKRU)
4 - guest runs, then switch back to host,
5 - xsave(s) fpu regs to guest fpstate (buffer now have XFEATURE_MASK_PKRU)
6 - xrstor(s) host fpstate to fpu regs.
7 - kvm_vcpu_ioctl_x86_get_xsave() copy guest fpstate to userspace (with
XFEATURE_MASK_PKRU, which should not be supported by guest vcpu)
On 5, even though the guest does not support PKRU, it does have the flag
set on guest fpstate, which is transferred to userspace via vcpu ioctl
KVM_GET_XSAVE.
This becomes a problem when the user decides on migrating the above guest
to another machine that does not support PKRU: the new host restores
guest's fpu regs to as they were before (xrstor(s)), but since the new
host don't support PKRU, a general-protection exception ocurs in xrstor(s)
and that crashes the guest.
This can be solved by making the guest's fpstate->user_xfeatures hold
a copy of guest_supported_xcr0. This way, on 7 the only flags copied to
userspace will be the ones compatible to guest requirements, and thus
there will be no issue during migration.
As a bonus, it will also fail if userspace tries to set fpu features
(with the KVM_SET_XSAVE ioctl) that are not compatible to the guest
configuration. Such features will never be returned by KVM_GET_XSAVE
or KVM_GET_XSAVE2.
Also, since kvm_vcpu_after_set_cpuid() now sets fpstate->user_xfeatures,
there is not need to set it in kvm_check_cpuid(). So, change
fpstate_realloc() so it does not touch fpstate->user_xfeatures if a
non-NULL guest_fpu is passed, which is the case when kvm_check_cpuid()
calls it.
Anton Romanov [Wed, 16 Feb 2022 18:26:54 +0000 (18:26 +0000)]
kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup mode
If vcpu has tsc_always_catchup set each request updates pvclock data.
KVM_HC_CLOCK_PAIRING consumers such as ptp_kvm_x86 rely on tsc read on
host's side and do hypercall inside pvclock_read_retry loop leading to
infinite loop in such situation.
lockdep_is_held() can return -1 when lockdep is disabled which triggers
this warning. Let's use lockdep_assert_not_held() which can detect
incorrect calls while holding a lock and it also avoids false negatives
when lockdep is disabled.
Aaron Lewis [Mon, 14 Feb 2022 21:29:51 +0000 (21:29 +0000)]
KVM: x86: Add KVM_CAP_ENABLE_CAP to x86
Follow the precedent set by other architectures that support the VCPU
ioctl, KVM_ENABLE_CAP, and advertise the VM extension, KVM_CAP_ENABLE_CAP.
This way, userspace can ensure that KVM_ENABLE_CAP is available on a
vcpu before using it.
Oliver Upton [Thu, 17 Feb 2022 10:12:42 +0000 (10:12 +0000)]
KVM: arm64: Don't miss pending interrupts for suspended vCPU
In order to properly emulate the WFI instruction, KVM reads back
ICH_VMCR_EL2 and enables doorbells for GICv4. These preparations are
necessary in order to recognize pending interrupts in
kvm_arch_vcpu_runnable() and return to the guest. Until recently, this
work was done by kvm_arch_vcpu_{blocking,unblocking}(). Since commit 6109c5a6ab7f ("KVM: arm64: Move vGIC v4 handling for WFI out arch
callback hook"), these callbacks were gutted and superseded by
kvm_vcpu_wfi().
It is important to note that KVM implements PSCI CPU_SUSPEND calls as
a WFI within the guest. However, the implementation calls directly into
kvm_vcpu_halt(), which skips the needed work done in kvm_vcpu_wfi()
to detect pending interrupts. Fix the issue by calling the WFI helper.
Thomas Huth [Tue, 15 Feb 2022 07:48:24 +0000 (08:48 +0100)]
selftests: kvm: Check whether SIDA memop fails for normal guests
Commit 2c212e1baedc ("KVM: s390: Return error on SIDA memop on normal
guest") fixed the behavior of the SIDA memops for normal guests. It
would be nice to have a way to test whether the current kernel has
the fix applied or not. Thus add a check to the KVM selftests for
these two memops.
KVM: s390: Update api documentation for memop ioctl
Document all currently existing operations, flags and explain under
which circumstances they are available. Document the recently
introduced absolute operations and the storage key protection flag,
as well as the existing SIDA operations.
KVM: s390: Add capability for storage key extension of MEM_OP IOCTL
Availability of the KVM_CAP_S390_MEM_OP_EXTENSION capability signals that:
* The vcpu MEM_OP IOCTL supports storage key checking.
* The vm MEM_OP IOCTL exists.
KVM: s390: Add vm IOCTL for key checked guest absolute memory access
Channel I/O honors storage keys and is performed on absolute memory.
For I/O emulation user space therefore needs to be able to do key
checked accesses.
The vm IOCTL supports read/write accesses, as well as checking
if an access would succeed.
Unlike relying on KVM_S390_GET_SKEYS for key checking would,
the vm IOCTL performs the check in lockstep with the read or write,
by, ultimately, mapping the access to move instructions that
support key protection checking with a supplied key.
Fetch and storage protection override are not applicable to absolute
accesses and so are not applied as they are when using the vcpu memop.
KVM: s390: Add optional storage key checking to MEMOP IOCTL
User space needs a mechanism to perform key checked accesses when
emulating instructions.
The key can be passed as an additional argument.
Having an additional argument is flexible, as user space can
pass the guest PSW's key, in order to make an access the same way the
CPU would, or pass another key if necessary.
KVM: s390: selftests: Test TEST PROTECTION emulation
Test the emulation of TEST PROTECTION in the presence of storage keys.
Emulation only occurs under certain conditions, one of which is the host
page being protected.
Trigger this by protecting the test pages via mprotect.
Use the access key operand to check for key protection when
translating guest addresses.
Since the translation code checks for accessing exceptions/error hvas,
we can remove the check here and simplify the control flow.
Keep checking if the memory is read-only even if such memslots are
currently not supported.
handle_tprot was the last user of guest_translate_address,
so remove it.
KVM: s390: Honor storage keys when accessing guest memory
Storage key checking had not been implemented for instructions emulated
by KVM. Implement it by enhancing the functions used for guest access,
in particular those making use of access_guest which has been renamed
to access_guest_with_key.
Accesses via access_guest_real should not be key checked.
For actual accesses, key checking is done by
copy_from/to_user_key (which internally uses MVCOS/MVCP/MVCS).
In cases where accessibility is checked without an actual access,
this is performed by getting the storage key and checking if the access
key matches. In both cases, if applicable, storage and fetch protection
override are honored.
Add copy_from/to_user_key functions, which perform storage key checking.
These functions can be used by KVM for emulating instructions that need
to be key checked.
These functions differ from their non _key counterparts in
include/linux/uaccess.h only in the additional key argument and must be
kept in sync with those.
Since the existing uaccess implementation on s390 makes use of move
instructions that support having an additional access key supplied,
we can implement raw_copy_from/to_user_key by enhancing the
existing implementation.
KVM: SVM: Rename AVIC helpers to use "avic" prefix instead of "svm"
Use "avic" instead of "svm" for SVM's all of APICv hooks and make a few
additional funciton name tweaks so that the AVIC functions conform to
their associated kvm_x86_ops hooks.
Jim Mattson [Thu, 3 Feb 2022 01:48:13 +0000 (17:48 -0800)]
KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW
AMD's event select is 3 nybbles, with the high nybble in bits 35:32 of
a PerfEvtSeln MSR. Don't mask off the high nybble when configuring a
RAW perf event.
Jim Mattson [Thu, 3 Feb 2022 01:48:12 +0000 (17:48 -0800)]
KVM: x86/pmu: Don't truncate the PerfEvtSeln MSR when creating a perf event
AMD's event select is 3 nybbles, with the high nybble in bits 35:32 of
a PerfEvtSeln MSR. Don't drop the high nybble when setting up the
config field of a perf_event_attr structure for a call to
perf_event_create_kernel_counter().
Maxim Levitsky [Tue, 8 Feb 2022 11:48:42 +0000 (06:48 -0500)]
KVM: SVM: fix race between interrupt delivery and AVIC inhibition
If svm_deliver_avic_intr is called just after the target vcpu's AVIC got
inhibited, it might read a stale value of vcpu->arch.apicv_active
which can lead to the target vCPU not noticing the interrupt.
To fix this use load-acquire/store-release so that, if the target vCPU
is IN_GUEST_MODE, we're guaranteed to see a previous disabling of the
AVIC. If AVIC has been disabled in the meanwhile, proceed with the
KVM_REQ_EVENT-based delivery.
Incomplete IPI vmexit has the same races as svm_deliver_avic_intr, and
in fact it can be handled in exactly the same way; the only difference
lies in who has set IRR, whether svm_deliver_interrupt or the processor.
Therefore, svm_complete_interrupt_delivery can be used to fix incomplete
IPI vmexits as well.
Maxim Levitsky [Tue, 8 Feb 2022 11:45:16 +0000 (06:45 -0500)]
KVM: SVM: extract avic_ring_doorbell
The check on the current CPU adds an extra level of indentation to
svm_deliver_avic_intr and conflates documentation on what happens
if the vCPU exits (of interest to svm_deliver_avic_intr) and migrates
(only of interest to avic_ring_doorbell, which calls get/put_cpu()).
Extract the wrmsr to a separate function and rewrite the
comment in svm_deliver_avic_intr().
There is no vmx_pi_mmio_test file. Remove it to get rid of error while
creation of selftest archive:
rsync: [sender] link_stat "/kselftest/kvm/x86_64/vmx_pi_mmio_test" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3]
Marc Zyngier [Thu, 3 Feb 2022 09:24:45 +0000 (09:24 +0000)]
KVM: arm64: vgic: Read HW interrupt pending state from the HW
It appears that a read access to GIC[DR]_I[CS]PENDRn doesn't always
result in the pending interrupts being accurately reported if they are
mapped to a HW interrupt. This is particularily visible when acking
the timer interrupt and reading the GICR_ISPENDR1 register immediately
after, for example (the interrupt appears as not-pending while it really
is...).
This is because a HW interrupt has its 'active and pending state' kept
in the *physical* distributor, and not in the virtual one, as mandated
by the spec (this is what allows the direct deactivation). The virtual
distributor only caries the pending and active *states* (note the
plural, as these are two independent and non-overlapping states).
Fix it by reading the HW state back, either from the timer itself or
from the distributor if necessary.
Oliver Upton [Fri, 4 Feb 2022 20:47:05 +0000 (20:47 +0000)]
KVM: VMX: Use local pointer to vcpu_vmx in vmx_vcpu_after_set_cpuid()
There is a local that contains a pointer to vcpu_vmx already. Just use
that instead to get at the structure directly instead of doing pointer
arithmetic.
Introduce a new test for Hyper-V nSVM extensions (Hyper-V on KVM) and add
a test for enlightened MSR-Bitmap feature:
- Intercept access to MSR_FS_BASE in L1 and check that this works
with enlightened MSR-Bitmap disabled.
- Enabled enlightened MSR-Bitmap and check that the intercept still works
as expected.
- Intercept access to MSR_GS_BASE but don't clear the corresponding bit
from clean fields mask, KVM is supposed to skip updating MSR-Bitmap02 and
thus the consequent access to the MSR from L2 will not get intercepted.
- Finally, clear the corresponding bit from clean fields mask and check
that access to MSR_GS_BASE is now intercepted.
The test works with the assumption, that access to MSR_FS_BASE/MSR_GS_BASE
is not intercepted for L1. If this ever becomes not true the test will
fail as nested_svm_exit_handled_msr() always checks L1's MSR-Bitmap for
L2 irrespective of clean fields. The behavior is correct as enlightened
MSR-Bitmap feature is just an optimization, KVM is not obliged to ignore
updates when the corresponding bit in clean fields stays clear.
KVM: selftests: nSVM: Set up MSR-Bitmap for SVM guests
Similar to VMX, allocate memory for MSR-Bitmap and fill in 'msrpm_base_pa'
in VMCB. To use it, tests will need to set INTERCEPT_MSR_PROT interception
along with the required bits in the MSR-Bitmap.
Introduce a test for enlightened MSR-Bitmap feature (Hyper-V on KVM):
- Intercept access to MSR_FS_BASE in L1 and check that this works
with enlightened MSR-Bitmap disabled.
- Enabled enlightened MSR-Bitmap and check that the intercept still works
as expected.
- Intercept access to MSR_GS_BASE but don't clear the corresponding bit
from 'hv_clean_fields', KVM is supposed to skip updating MSR-Bitmap02 and
thus the consequent access to the MSR from L2 will not get intercepted.
- Finally, clear the corresponding bit from 'hv_clean_fields' and check
that access to MSR_GS_BASE is now intercepted.
The test works with the assumption, that access to MSR_FS_BASE/MSR_GS_BASE
is not intercepted for L1. If this ever becomes not true the test will
fail as nested_vmx_exit_handled_msr() always checks L1's MSR-Bitmap for
L2 irrespective of 'hv_clean_fields'. The behavior is correct as
enlightened MSR-Bitmap feature is just an optimization, KVM is not obliged
to ignore updates when the corresponding bit in 'hv_clean_fields' stays
clear.
KVM: selftests: nVMX: Properly deal with 'hv_clean_fields'
Instead of just resetting 'hv_clean_fields' to 0 on every enlightened
vmresume, do the expected cleaning of the corresponding bit on enlightened
vmwrite. Avoid direct access to 'current_evmcs' from evmcs_test to support
the change.
KVM: selftests: Adapt hyperv_cpuid test to the newly introduced Enlightened MSR-Bitmap
CPUID 0x40000000.EAX is now always present as it has Enlightened
MSR-Bitmap feature bit set. Adapt the test accordingly. Opportunistically
add a check for the supported eVMCS version range.
Similar to nVMX commit 502d2bf5f2fd ("KVM: nVMX: Implement Enlightened MSR
Bitmap feature"), add support for the feature for nSVM (Hyper-V on KVM).
Notable differences from nVMX implementation:
- As the feature uses SW reserved fields in VMCB control, KVM needs to
make sure it's dealing with a Hyper-V guest (kvm_hv_hypercall_enabled()).
- 'msrpm_base_pa' needs to be always be overwritten in
nested_svm_vmrun_msrpm(), even when the update is skipped. As an
optimization, nested_vmcb02_prepare_control() copies it from VMCB01
so when MSR-Bitmap feature for L2 is disabled nothing needs to be done.
- 'struct vmcb_ctrl_area_cached' needs to be extended with clean
fields/sw reserved data and __nested_copy_vmcb_control_to_cache() needs to
copy it so nested_svm_vmrun_msrpm() can use it later.
KVM: nSVM: Split off common definitions for Hyper-V on KVM and KVM on Hyper-V
In preparation to implementing Enlightened MSR-Bitmap feature for Hyper-V
on KVM, split off the required definitions into common 'svm/hyperv.h'
header.
KVM: x86: Make kvm_hv_hypercall_enabled() static inline
In preparation for using kvm_hv_hypercall_enabled() from SVM code, make
it static inline to avoid the need to export it. The function is a
simple check with only two call sites currently.
KVM: nSVM: Track whether changes in L0 require MSR bitmap for L2 to be rebuilt
Similar to nVMX commit ed2a4800ae9d ("KVM: nVMX: Track whether changes in
L0 require MSR bitmap for L2 to be rebuilt"), introduce a flag to keep
track of whether MSR bitmap for L2 needs to be rebuilt due to changes in
MSR bitmap for L1 or switching to a different L2.
David Matlack [Wed, 19 Jan 2022 23:07:39 +0000 (23:07 +0000)]
KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and INITIALLY_SET
Add an option to dirty_log_perf_test.c to disable
KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE and KVM_DIRTY_LOG_INITIALLY_SET so
the legacy dirty logging code path can be tested.
David Matlack [Wed, 19 Jan 2022 23:07:37 +0000 (23:07 +0000)]
KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG
When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not
write-protected when dirty logging is enabled on the memslot. Instead
they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for
the first time and only for the specific sub-region being cleared.
Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to
write-protecting to avoid causing write-protection faults on vCPU
threads. This also allows userspace to smear the cost of huge page
splitting across multiple ioctls, rather than splitting the entire
memslot as is the case when initially-all-set is not used.
David Matlack [Wed, 19 Jan 2022 23:07:36 +0000 (23:07 +0000)]
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
David Matlack [Wed, 19 Jan 2022 23:07:35 +0000 (23:07 +0000)]
KVM: x86/mmu: Separate TDP MMU shadow page allocation and initialization
Separate the allocation of shadow pages from their initialization. This
is in preparation for splitting huge pages outside of the vCPU fault
context, which requires a different allocation mechanism.
David Matlack [Wed, 19 Jan 2022 23:07:34 +0000 (23:07 +0000)]
KVM: x86/mmu: Derive page role for TDP MMU shadow pages from parent
Derive the page role from the parent shadow page, since the only thing
that changes is the level. This is in preparation for splitting huge
pages during VM-ioctls which do not have access to the vCPU MMU context.
David Matlack [Wed, 19 Jan 2022 23:07:33 +0000 (23:07 +0000)]
KVM: x86/mmu: Remove redundant role overrides for TDP MMU shadow pages
The vCPU's mmu_role already has the correct values for direct,
has_4_byte_gpte, access, and ad_disabled. Remove the code that was
redundantly overwriting these fields with the same values.
David Matlack [Wed, 19 Jan 2022 23:07:32 +0000 (23:07 +0000)]
KVM: x86/mmu: Refactor TDP MMU iterators to take kvm_mmu_page root
Instead of passing a pointer to the root page table and the root level
separately, pass in a pointer to the root kvm_mmu_page struct. This
reduces the number of arguments by 1, cutting down on line lengths.
David Matlack [Wed, 19 Jan 2022 23:07:31 +0000 (23:07 +0000)]
KVM: x86/mmu: Move restore_acc_track_spte() to spte.h
restore_acc_track_spte() is pure SPTE bit manipulation, making it a good
fit for spte.h. And now that the WARN_ON_ONCE() calls have been removed,
there isn't any good reason to not inline it.
This move also prepares for a follow-up commit that will need to call
restore_acc_track_spte() from spte.c
David Matlack [Wed, 19 Jan 2022 23:07:29 +0000 (23:07 +0000)]
KVM: x86/mmu: Remove unnecessary warnings from restore_acc_track_spte()
The warnings in restore_acc_track_spte() can be removed because the only
caller checks is_access_track_spte(), and is_access_track_spte() checks
!spte_ad_enabled(). In other words, the warning can never be triggered.
David Matlack [Wed, 19 Jan 2022 23:07:28 +0000 (23:07 +0000)]
KVM: x86/mmu: Consolidate logic to atomically install a new TDP MMU page table
Consolidate the logic to atomically replace an SPTE with an SPTE that
points to a new page table into a single helper function. This will be
used in a follow-up commit to split huge pages, which involves replacing
each huge page SPTE with an SPTE that points to a page table.
Opportunistically drop the call to trace_kvm_mmu_get_page() in
kvm_tdp_mmu_map() since it is redundant with the identical tracepoint in
tdp_mmu_alloc_sp().