Git Repo - linux.git/log

KVM: arm64: Fixup hyp stage-1 refcount

In nVHE-protected mode, the hyp stage-1 page-table refcount is broken
due to the lack of refcount support in the early allocator. Fix-up the
refcount in the finalize walker, once the 'hyp_vmemmap' is up and running.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: Refcount hyp stage-1 pgtable pages

To prepare the ground for allowing hyp stage-1 mappings to be removed at
run-time, update the KVM page-table code to maintain a correct refcount
using the ->{get,put}_page() function callbacks.

Signed-off-by: Quentin Perret <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: Provide {get,put}_page() stubs for early hyp allocator

In nVHE protected mode, the EL2 code uses a temporary allocator during
boot while re-creating its stage-1 page-table. Unfortunately, the
hyp_vmmemap is not ready to use at this stage, so refcounting pages
is not possible. That is not currently a problem because hyp stage-1
mappings are never removed, which implies refcounting of page-table
pages is unnecessary.

In preparation for allowing hypervisor stage-1 mappings to be removed,
provide stub implementations for {get,put}_page() in the early allocator.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

Merge branch kvm-arm64/vgic-fixes-5.17 into kvmarm-master/next

* kvm-arm64/vgic-fixes-5.17:
  : .
  : A few vgic fixes:
  : - Harden vgic-v3 error handling paths against signed vs unsigned
  :   comparison that will happen once the xarray-based vcpus are in
  : - Demote userspace-triggered console output to kvm_debug()
  : .
  KVM: arm64: vgic: Demote userspace-triggered console prints to kvm_debug()
  KVM: arm64: vgic-v3: Fix vcpu index comparison

Signed-off-by: Marc Zyngier <[email protected]>

KVM: arm64: vgic: Demote userspace-triggered console prints to kvm_debug()

Running the KVM selftests results in these messages being dumped
in the kernel console:

[  188.051073] kvm [469]: VGIC redist and dist frames overlap
[  188.056820] kvm [469]: VGIC redist and dist frames overlap
[  188.076199] kvm [469]: VGIC redist and dist frames overlap

Being amle to trigger this from userspace is definitely not on,
so demote these warnings to kvm_debug().

Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: vgic-v3: Fix vcpu index comparison

When handling an error at the point where we try and register
all the redistributors, we unregister all the previously
registered frames by counting down from the failing index.

However, the way the code is written relies on that index
being a signed value. Which won't be true once we switch to
an xarray-based vcpu set.

Since this code is pretty awkward the first place, and that the
failure mode is hard to spot, rewrite this loop to iterate
over the vcpus upwards rather than downwards.

Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

Merge branch kvm-arm64/pkvm-cleanups-5.17 into kvmarm-master/next

* kvm-arm64/pkvm-cleanups-5.17:
  : .
  : pKVM cleanups from Quentin Perret:
  :
  : This series is a collection of various fixes and cleanups for KVM/arm64
  : when running in nVHE protected mode. The first two patches are real
  : fixes/improvements, the following two are minor cleanups, and the last
  : two help satisfy my paranoia so they're certainly optional.
  : .
  KVM: arm64: pkvm: Make kvm_host_owns_hyp_mappings() robust to VHE
  KVM: arm64: pkvm: Stub io map functions
  KVM: arm64: Make __io_map_base static
  KVM: arm64: Make the hyp memory pool static
  KVM: arm64: pkvm: Disable GICv2 support
  KVM: arm64: pkvm: Fix hyp_pool max order

Signed-off-by: Marc Zyngier <[email protected]>

KVM: arm64: pkvm: Make kvm_host_owns_hyp_mappings() robust to VHE

The kvm_host_owns_hyp_mappings() function should return true if and only
if the host kernel is responsible for creating the hypervisor stage-1
mappings. That is only possible in standard non-VHE mode, or during boot
in protected nVHE mode. But either way, none of this makes sense in VHE,
so make sure to catch this case as well, hence making the function
return sensible values in any context (VHE or not).

Suggested-by: Marc Zyngier <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
Acked-by: Will Deacon <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: pkvm: Stub io map functions

Now that GICv2 is disabled in nVHE protected mode there should be no
other reason for the host to use create_hyp_io_mappings() or
kvm_phys_addr_ioremap(). Add sanity checks to make sure that assumption
remains true looking forward.

Signed-off-by: Quentin Perret <[email protected]>
Acked-by: Will Deacon <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: Make __io_map_base static

The __io_map_base variable is used at EL2 to track the end of the
hypervisor's "private" VA range in nVHE protected mode. However it
doesn't need to be used outside of mm.c, so let's make it static to keep
all the hyp VA allocation logic in one place.

Signed-off-by: Quentin Perret <[email protected]>
Acked-by: Will Deacon <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: Make the hyp memory pool static

The hyp memory pool struct is sized to fit exactly the needs of the
hypervisor stage-1 page-table allocator, so it is important it is not
used for anything else. As it is currently used only from setup.c,
reduce its visibility by marking it static.

Signed-off-by: Quentin Perret <[email protected]>
Reviewed-by: Andrew Walbran <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: pkvm: Disable GICv2 support

GICv2 requires having device mappings in guests and the hypervisor,
which is incompatible with the current pKVM EL2 page ownership model
which only covers memory. While it would be desirable to support pKVM
with GICv2, this will require a lot more work, so let's make the
current assumption clear until then.

Co-developed-by: Marc Zyngier <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
Acked-by: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: pkvm: Fix hyp_pool max order

The EL2 page allocator in protected mode maintains a per-pool max order
value to optimize allocations when the memory region it covers is small.
However, the max order value is currently under-estimated whenever the
number of pages in the region is a power of two. Fix the estimation.

Signed-off-by: Quentin Perret <[email protected]>
Acked-by: Will Deacon <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: Use kvm_arch_vcpu_get_wait() to get rcuwait object

Use kvm_arch_vcpu_get_wait() to get a vCPU's rcuwait object instead of
using vcpu->wait directly in kvmhv_run_single_vcpu(). Functionally, this
is a nop as vcpu->arch.waitp is guaranteed to point at vcpu->wait. But
that is not obvious at first glance, and a future change coming in via
the KVM tree, commit 510958e99721 ("KVM: Force PPC to define its own
rcuwait object"), will hide vcpu->wait from architectures that define
__KVM_HAVE_ARCH_WQP to prevent generic KVM from attepting to wake a vCPU
with the wrong rcuwait object.

Reported-by: Sachin Sant <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Sachin Sant <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

selftests: KVM: Add test to verify KVM doesn't explode on "bad" I/O

Add an x86 selftest to verify that KVM doesn't WARN or otherwise explode
if userspace modifies RCX during a userspace exit to handle string I/O.
This is a regression test for a user-triggerable WARN introduced by
commit 3b27de271839 ("KVM: x86: split the two parts of emulator_pio_in").

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211025201311.1881846 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Don't WARN if userspace mucks with RCX during string I/O exit

Replace a WARN with a comment to call out that userspace can modify RCX
during an exit to userspace to handle string I/O. KVM doesn't actually
support changing the rep count during an exit, i.e. the scenario can be
ignored, but the WARN needs to go as it's trivial to trigger from
userspace.

Cc: [email protected]
Fixes: 3b27de271839 ("KVM: x86: split the two parts of emulator_pio_in")
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211025201311.1881846 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Raise #GP when clearing CR0_PG in 64 bit mode

In the SDM:
If the logical processor is in 64-bit mode or if CR4.PCIDE = 1, an
attempt to clear CR0.PG causes a general-protection exception (#GP).
Software should transition to compatibility mode and clear CR4.PCIDE
before attempting to disable paging.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211207095230 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

selftests: KVM: avoid failures due to reserved HyperTransport region

AMD proceessors define an address range that is reserved by HyperTransport
and causes a failure if used for guest physical addresses. Avoid
selftests failures by reserving those guest physical addresses; the
rules are:

- On parts with <40 bits, its fully hidden from software.

- Before Fam17h, it was always 12G just below 1T, even if there was more
RAM above this location. In this case we just not use any RAM above 1T.

- On Fam17h and later, it is variable based on SME, and is either just
below 2^48 (no encryption) or 2^43 (encryption).

Fixes: ef4c9f4f6546 ("KVM: selftests: Fix 32-bit truncation of vm_get_max_gfn()")
Cc: [email protected]
Cc: David Matlack <[email protected]>
Reported-by: Maxim Levitsky <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20210805105423 [email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Tested-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Ignore sparse banks size for an "all CPUs", non-sparse IPI req

Do not bail early if there are no bits set in the sparse banks for a
non-sparse, a.k.a. "all CPUs", IPI request.  Per the Hyper-V spec, it is
legal to have a variable length of '0', e.g. VP_SET's BankContents in
this case, if the request can be serviced without the extra info.

  It is possible that for a given invocation of a hypercall that does
  accept variable sized input headers that all the header input fits
  entirely within the fixed size header. In such cases the variable sized
  input header is zero-sized and the corresponding bits in the hypercall
  input should be set to zero.

Bailing early results in KVM failing to send IPIs to all CPUs as expected
by the guest.

Fixes: 214ff83d4473 ("KVM: x86: hyperv: implement PV IPI send hypercalls")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20211207220926 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Wait for IPIs to be delivered when handling Hyper-V TLB flush hypercall

Prior to commit 0baedd792713 ("KVM: x86: make Hyper-V PV TLB flush use
tlb_flush_guest()"), kvm_hv_flush_tlb() was using 'KVM_REQ_TLB_FLUSH |
KVM_REQUEST_NO_WAKEUP' when making a request to flush TLBs on other vCPUs
and KVM_REQ_TLB_FLUSH is/was defined as:

(0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)

so KVM_REQUEST_WAIT was lost. Hyper-V TLFS, however, requires that
"This call guarantees that by the time control returns back to the
caller, the observable effects of all flushes on the specified virtual
processors have occurred." and without KVM_REQUEST_WAIT there's a small
chance that the vCPU making the TLB flush will resume running before
all IPIs get delivered to other vCPUs and a stale mapping can get read
there.

Fix the issue by adding KVM_REQUEST_WAIT flag to KVM_REQ_TLB_FLUSH_GUEST:
kvm_hv_flush_tlb() is the sole caller which uses it for
kvm_make_all_cpus_request()/kvm_make_vcpus_request_mask() where
KVM_REQUEST_WAIT makes a difference.

Cc: [email protected]
Fixes: 0baedd792713 ("KVM: x86: make Hyper-V PV TLB flush use tlb_flush_guest()")
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20211209102937 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: arm64: Use Makefile.kvm for common files

Signed-off-by: David Woodhouse <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Message-Id: <20211121125451 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: powerpc: Use Makefile.kvm for common files

It's all fairly baroque but in the end, I don't think there's any reason
for $(KVM)/irqchip.o to have been handled differently, as they all end
up in $(kvm-y) in the end anyway, regardless of whether they get there
via $(common-objs-y) and the CPU-specific object lists.

Signed-off-by: David Woodhouse <[email protected]>
Acked-by: Michael Ellerman <[email protected]> (powerpc)
Message-Id: <20211121125451 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: RISC-V: Use Makefile.kvm for common files

Signed-off-by: David Woodhouse <[email protected]>
Message-Id: <20211121125451 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: mips: Use Makefile.kvm for common files

Signed-off-by: David Woodhouse <[email protected]>
Message-Id: <20211121125451 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: s390: Use Makefile.kvm for common files

Signed-off-by: David Woodhouse <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Message-Id: <20211121125451 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Add Makefile.kvm for common files, use it for x86

Splitting kvm_main.c out into smaller and better-organized files is
slightly non-trivial when it involves editing a bunch of per-arch
KVM makefiles. Provide virt/kvm/Makefile.kvm for them to include.

Signed-off-by: David Woodhouse <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Message-Id: <20211121125451 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Introduce CONFIG_HAVE_KVM_DIRTY_RING

I'd like to make the build include dirty_ring.c based on whether the
arch wants it or not. That's a whole lot simpler if there's a config
symbol instead of doing it implicitly on KVM_DIRTY_LOG_PAGE_OFFSET
being set to something non-zero.

Signed-off-by: David Woodhouse <[email protected]>
Message-Id: <20211121125451 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: selftests: svm_int_ctl_test: fix intercept calculation

INTERCEPT_x are bit positions, but the code was using the raw value of
INTERCEPT_VINTR (4) instead of BIT(INTERCEPT_VINTR).
This resulted in masking of bit 2 - that is, SMI instead of VINTR.

Signed-off-by: Maciej S. Szmigiero <[email protected]>
Message-Id: <49b9571d25588870db5380b0be1a41df4bbaaf93.1638486479 [email protected]>

KVM: VMX: Clean up PI pre/post-block WARNs

Move the WARN sanity checks out of the PI descriptor update loop so as
not to spam the kernel log if the condition is violated and the update
takes multiple attempts due to another writer. This also eliminates a
few extra uops from the retry path.

Technically not checking every attempt could mean KVM will now fail to
WARN in a scenario that would have failed before, but any such failure
would be inherently racy as some other agent (CPU or device) would have
to concurrent modify the PI descriptor.

Add a helper to handle the actual write and more importantly to document
why the write may need to be retried.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Ensure vCPU honors event request if posting nested IRQ fails

Add a memory barrier between writing vcpu->requests and reading
vcpu->guest_mode to ensure the read is ordered after the write when
(potentially) delivering an IRQ to L2 via nested posted interrupt. If
the request were to be completed after reading vcpu->mode, it would be
possible for the target vCPU to enter the guest without posting the
interrupt and without handling the event request.

Note, the barrier is only for documentation since atomic operations are
serializing on x86.

Suggested-by: Paolo Bonzini <[email protected]>
Fixes: 6b6977117f50 ("KVM: nVMX: Fix races when sending nested PI while dest enters/leaves L2")
Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing")
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: add a tracepoint for APICv/AVIC interrupt delivery

This allows to see how many interrupts were delivered via the
APICv/AVIC from the host.

Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20211209115440 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: arm64: Drop unused workaround_flags vcpu field

workaround_flags is a leftover from our earlier Spectre-v4 workaround
implementation, and now serves no purpose.

Get rid of the field and the corresponding asm-offset definition.

Fixes: 29e8910a566a ("KVM: arm64: Simplify handling of ARCH_WORKAROUND_2")
Signed-off-by: Marc Zyngier <[email protected]>

KVM: nVMX: Implement Enlightened MSR Bitmap feature

Updating MSR bitmap for L2 is not cheap and rearly needed. TLFS for Hyper-V
offers 'Enlightened MSR Bitmap' feature which allows L1 hypervisor to
inform L0 when it changes MSR bitmap, this eliminates the need to examine
L1's MSR bitmap for L2 every time when 'real' MSR bitmap for L2 gets
constructed.

Use 'vmx->nested.msr_bitmap_changed' flag to implement the feature.

Note, KVM already uses 'Enlightened MSR bitmap' feature when it runs as a
nested hypervisor on top of Hyper-V. The newly introduced feature is going
to be used by Hyper-V guests on KVM.

When the feature is enabled for Win10+WSL2, it shaves off around 700 CPU
cycles from a nested vmexit cost (tight cpuid loop test).

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20211129094704 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Track whether changes in L0 require MSR bitmap for L2 to be rebuilt

Introduce a flag to keep track of whether MSR bitmap for L2 needs to be
rebuilt due to changes in MSR bitmap for L1 or switching to a different
L2. This information will be used for Enlightened MSR Bitmap feature for
Hyper-V guests.

Note, setting msr_bitmap_changed to 'true' from set_current_vmptr() is
not really needed for Enlightened MSR Bitmap as the feature can only
be used in conjunction with Enlightened VMCS but let's keep tracking
information complete, it's cheap and in the future similar PV feature can
easily be implemented for KVM on KVM too.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20211129094704 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Introduce vmx_msr_bitmap_l01_changed() helper

In preparation to enabling 'Enlightened MSR Bitmap' feature for Hyper-V
guests move MSR bitmap update tracking to a dedicated helper.

Note: vmx_msr_bitmap_l01_changed() is called when MSR bitmap might be
updated. KVM doesn't check if the bit we're trying to set is already set
(or the bit it's trying to clear is already cleared). Such situations
should not be common and a few false positives should not be a problem.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Message-Id: <20211129094704 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Merge branch 'kvm-on-hv-msrbm-fix' into HEAD

Merge bugfix for enlightened MSR Bitmap, before adding support
to KVM for exposing the feature to nested guests.

KVM: x86: Exit to userspace if emulation prepared a completion callback

em_rdmsr() and em_wrmsr() return X86EMUL_IO_NEEDED if MSR accesses
required an exit to userspace. However, x86_emulate_insn() doesn't return
X86EMUL_*, so x86_emulate_instruction() doesn't directly act on
X86EMUL_IO_NEEDED; instead, it looks for other signals to differentiate
between PIO, MMIO, etc. causing RDMSR/WRMSR emulation not to
exit to userspace now.

Nevertheless, if the userspace_msr_exit_test testcase in selftests
is changed to test RDMSR/WRMSR with a forced emulation prefix,
the test passes. What happens is that first userspace exit
information is filled but the userspace exit does not happen.
Because x86_emulate_instruction() returns 1, the guest retries
the instruction---but this time RIP has already been adjusted
past the forced emulation prefix, so the guest executes RDMSR/WRMSR
and the userspace exit finally happens.

Since the X86EMUL_IO_NEEDED path has provided a complete_userspace_io
callback, x86_emulate_instruction() can just return 0 if the
callback is not NULL. Then RDMSR/WRMSR instruction emulation will
exit to userspace directly, without the RDMSR/WRMSR vmexit.

Fixes: 1ae099540e8c7 ("KVM: x86: Allow deflecting unknown MSR accesses to user space")
Signed-off-by: Hou Wenlong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <56f9df2ee5c05a81155e2be366c9dc1f7adc8817.1635842679 [email protected]>

KVM: nVMX: Don't use Enlightened MSR Bitmap for L3

When KVM runs as a nested hypervisor on top of Hyper-V it uses Enlightened
VMCS and enables Enlightened MSR Bitmap feature for its L1s and L2s (which
are actually L2s and L3s from Hyper-V's perspective). When MSR bitmap is
updated, KVM has to reset HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP from
clean fields to make Hyper-V aware of the change. For KVM's L1s, this is
done in vmx_disable_intercept_for_msr()/vmx_enable_intercept_for_msr().
MSR bitmap for L2 is build in nested_vmx_prepare_msr_bitmap() by blending
MSR bitmap for L1 and L1's idea of MSR bitmap for L2. KVM, however, doesn't
check if the resulting bitmap is different and never cleans
HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP in eVMCS02. This is incorrect and
may result in Hyper-V missing the update.

The issue could've been solved by calling evmcs_touch_msr_bitmap() for
eVMCS02 from nested_vmx_prepare_msr_bitmap() unconditionally but doing so
would not give any performance benefits (compared to not using Enlightened
MSR Bitmap at all). 3-level nesting is also not a very common setup
nowadays.

Don't enable 'Enlightened MSR Bitmap' feature for KVM's L2s (real L3s) for
now.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20211129094704 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Use different callback if msr access comes from the emulator

If msr access triggers an exit to userspace, the
complete_userspace_io callback would skip instruction by vendor
callback for kvm_skip_emulated_instruction(). However, when msr
access comes from the emulator, e.g. if kvm.force_emulation_prefix
is enabled and the guest uses rdmsr/wrmsr with kvm prefix,
VM_EXIT_INSTRUCTION_LEN in vmcs is invalid and
kvm_emulate_instruction() should be used to skip instruction
instead.

As Sean noted, unlike the previous case, there's no #UD if
unrestricted guest is disabled and the guest accesses an MSR in
Big RM. So the correct way to fix this is to attach a different
callback when the msr access comes from the emulator.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <34208da8f51580a06e45afefac95afea0e3f96e3.1635842679 [email protected]>

KVM: x86: Add an emulation type to handle completion of user exits

The next patch would use kvm_emulate_instruction() with
EMULTYPE_SKIP in complete_userspace_io callback to fix a
problem in msr access emulation. However, EMULTYPE_SKIP
only updates RIP, more things like updating interruptibility
state and injecting single-step #DBs would be done in the
callback. Since the emulator also does those things after
x86_emulate_insn(), add a new emulation type to pair with
EMULTYPE_SKIP to do those things for completion of user exits
within the emulator.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <8f8c8e268b65f31d55c2881a4b30670946ecfa0d.1635842679 [email protected]>

KVM: x86: Handle 32-bit wrap of EIP for EMULTYPE_SKIP with flat code seg

Truncate the new EIP to a 32-bit value when handling EMULTYPE_SKIP as the
decode phase does not truncate _eip. Wrapping the 32-bit boundary is
legal if and only if CS is a flat code segment, but that check is
implicitly handled in the form of limit checks in the decode phase.

Opportunstically prepare for a future fix by storing the result of any
truncation in "eip" instead of "_eip".

Fixes: 1957aa63be53 ("KVM: VMX: Handle single-step #DB for EMULTYPE_SKIP on EPT misconfig")
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <093eabb1eab2965201c9b018373baf26ff256d85.1635842679 [email protected]>

KVM: Clear pv eoi pending bit only when it is set

merge pv_eoi_get_pending and pv_eoi_clr_pending into a single
function pv_eoi_test_and_clear_pending, which returns and clear
the value of the pending bit.

This makes it possible to clear the pending bit only if the guest
had set it, and otherwise skip the call to pv_eoi_put_user().
This can save up to 300 nsec on AMD EPYC processors.

Suggested-by: Vitaly Kuznetsov <[email protected]>
Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Li RongQing <[email protected]>
Message-Id: <1636026974 [email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: don't print when fail to read/write pv eoi memory

If guest gives MSR_KVM_PV_EOI_EN a wrong value, this printk() will
be trigged, and kernel log is spammed with the useless message

Fixes: 0d88800d5472 ("kvm: x86: ioapic and apic debug macros cleanup")
Reported-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Li RongQing <[email protected]>
Cc: [email protected]
Message-Id: <1636026974 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Remove mmu parameter from load_pdptrs()

It uses vcpu->arch.walk_mmu always; nested EPT does not have PDPTRs,
and nested NPT treats them like all other non-leaf page table levels
instead of caching them.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211124122055 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Rename gpte_is_8_bytes to has_4_byte_gpte and invert the direction

This bit is very close to mean "role.quadrant is not in use", except that
it is false also when the MMU is mapping guest physical addresses
directly. In that case, role.quadrant is indeed not in use, but there
are no guest PTEs at all.

Changing the name and direction of the bit removes the special case,
since a guest with paging disabled, or not considering guest paging
structures as is the case for two-dimensional paging, does not have
to deal with 4-byte guest PTEs.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211124122055 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Use ept_caps_to_lpage_level() in hardware_setup()

Using ept_caps_to_lpage_level is simpler.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211124122055 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Add parameter huge_page_level to kvm_init_shadow_ept_mmu()

The level of supported large page on nEPT affects the rsvds_bits_mask.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211124122055 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Add huge_page_level to __reset_rsvds_bits_mask_ept()

Bit 7 on pte depends on the level of supported large page.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211124122055 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Remove mmu->translate_gpa

Reduce an indirect function call (retpoline) and some intialization
code.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211124122055 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Add parameter struct kvm_mmu *mmu into mmu->gva_to_gpa()

The mmu->gva_to_gpa() has no "struct kvm_mmu *mmu", so an extra
FNAME(gva_to_gpa_nested) is needed.

Add the parameter can simplify the code. And it makes it explicit that
the walk is upon vcpu->arch.walk_mmu for gva and vcpu->arch.mmu for L2
gpa in translate_nested_gpa() via the new parameter.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211124122055 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Calculate quadrant when !role.gpte_is_8_bytes

role.quadrant is only valid when gpte size is 4 bytes and only be
calculated when gpte size is 4 bytes.

Although "vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL" also means
gpte size is 4 bytes, but using "!role.gpte_is_8_bytes" is clearer

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211118110814 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Remove useless code to set role.gpte_is_8_bytes when role.direct

role.gpte_is_8_bytes is unused when role.direct; there is no
point in changing a bit in the role, the value that was set
when the MMU is initialized is just fine.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211118110814 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Remove unused declaration of __kvm_mmu_free_some_pages()

The body of __kvm_mmu_free_some_pages() has been removed.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211118110814 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Fix comment in __kvm_mmu_create()

The allocation of special roots is moved to mmu_alloc_special_roots().

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211118110814 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Skip allocating pae_root for vcpu->arch.guest_mmu when !tdp_enabled

It is never used.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211118110814 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Allocate sd->save_area with __GFP_ZERO

And remove clear_page() on it.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211118110814 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Rename get_max_npt_level() to get_npt_level()

It returns the only proper NPT level, so the "max" in the name
is not appropriate.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211118110814 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Change comments about vmx_get_msr()

The variable name is changed in the code.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211118110814 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Use kvm_set_msr_common() for MSR_IA32_TSC_ADJUST in the default way

MSR_IA32_TSC_ADJUST can be left to the default way which also uese
kvm_set_msr_common().

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211118110814 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()

The host CR3 in the vcpu thread can only be changed when scheduling.
Moving the code in vmx_prepare_switch_to_guest() makes the code
simpler.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211118110814 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Update msr value after kvm_set_user_return_msr() succeeds

Aoid earlier modification.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211118110814 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Avoid to rdmsrl(MSR_IA32_SYSENTER_ESP)

The value of host MSR_IA32_SYSENTER_ESP is known to be constant for
each CPU: (cpu_entry_stack(cpu) + 1) when 32 bit syscall is enabled or
NULL is 32 bit syscall is not enabled.

So rdmsrl() can be avoided for the first case and both rdmsrl() and
vmcs_writel() can be avoided for the second case.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211118110814 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Update mmu->pdptrs only when it is changed

It is unchanged in most cases.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211111144527 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Remove kvm_register_clear_available()

It has no user.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211108124407 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: vmx, svm: clean up mass updates to regs_avail/regs_dirty bits

Document the meaning of the three combinations of regs_avail and
regs_dirty. Update regs_dirty just after writeback instead of
doing it later after vmexit. After vmexit, instead, we clear the
regs_avail bits corresponding to lazily-loaded registers.

Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty

When vcpu->arch.cr3 is changed, it is marked dirty, so vmcs.GUEST_CR3
can be updated only when kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3).

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211108124407 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Mark CR3 dirty when vcpu->arch.cr3 is changed

When vcpu->arch.cr3 is changed, it should be marked dirty unless it
is being updated to the value of the architecture guest CR3 (i.e.
VMX.GUEST_CR3 or vmcb->save.cr3 when tdp is enabled).

This patch has no functionality changed because
kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3) is superset of
kvm_register_mark_available(vcpu, VCPU_EXREG_CR3) with additional
change to vcpu->arch.regs_dirty, but no code uses regs_dirty for
VCPU_EXREG_CR3. (vmx_load_mmu_pgd() uses vcpu->arch.regs_avail instead
to test if VCPU_EXREG_CR3 dirty which means current code (ab)uses
regs_avail for VCPU_EXREG_CR3 dirty information.)

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211108124407 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Remove references to VCPU_EXREG_CR3

VCPU_EXREG_CR3 is never cleared from vcpu->arch.regs_avail or
vcpu->arch.regs_dirty in SVM; therefore, marking CR3 as available is
merely a NOP, and testing it will likewise always succeed.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211108124407 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Remove outdated comment in svm_load_mmu_pgd()

The comment had been added in the commit 689f3bf21628 ("KVM: x86: unify
callbacks to load paging root") and its related code was removed later,
and it has nothing to do with the next line of code.

So the comment should be removed too.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211108124407 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Move CR0 pdptr_bits into header file as X86_CR0_PDPTR_BITS

Not functionality changed.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211108124407 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Add and use X86_CR4_PDPTR_BITS when !enable_ept

In set_cr4_guest_host_mask(), all cr4 pdptr bits are already set to be
intercepted in an unclear way.

Add X86_CR4_PDPTR_BITS to make it clear and self-documented.

No functionality changed.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211108124407 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Add and use X86_CR4_TLBFLUSH_BITS when !enable_ept

In set_cr4_guest_host_mask(), X86_CR4_PGE is set to be intercepted when
!enable_ept just because X86_CR4_PGE is the only bit that is
responsible for flushing TLB but listed in KVM_POSSIBLE_CR4_GUEST_BITS.

It is clearer and self-documented to use X86_CR4_TLBFLUSH_BITS instead.

No functionality changed.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211108124407 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Track dirtiness of PDPTRs even if NPT is disabled

Use the same logic to handle the availability of VCPU_EXREG_PDPTR
as VMX, also removing a branch in svm_vcpu_run().

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211108124407 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Mark VCPU_EXREG_PDPTR available in ept_save_pdptrs()

mmu->pdptrs[] and vmcs.GUEST_PDPTR[0-3] are synced, so mmu->pdptrs is
available and GUEST_PDPTR[0-3] is not dirty.

Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211108124407 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Ensure that dirty PDPTRs are loaded

For VMX with EPT, dirty PDPTRs need to be loaded before the next vmentry
via vmx_load_mmu_pgd()

But not all paths that call load_pdptrs() will cause vmx_load_mmu_pgd()
to be invoked. Normally, kvm_mmu_reset_context() is used to cause
KVM_REQ_LOAD_MMU_PGD, but sometimes it is skipped:

* commit d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and
CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs()
when changing CR0.CD and CR0.NW.

* commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current
PCID on MOV CR3 w/ flush") skips KVM_REQ_LOAD_MMU_PGD after
load_pdptrs() when rewriting the CR3 with the same value.

* commit a91a7c709600("KVM: X86: Don't reset mmu context when
toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after
load_pdptrs() when changing CR4.PGE.

Fixes: d81135a57aa6 ("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")
Fixes: 21823fbda552 ("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush")
Fixes: a91a7c709600 ("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE")
Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <20211108124407 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/svm: Add module param to control PMU virtualization

For Intel, the guest PMU can be disabled via clearing the PMU CPUID.
For AMD, all hw implementations support the base set of four
performance counters, with current mainstream hardware indicating
the presence of two additional counters via X86_FEATURE_PERFCTR_CORE.

In the virtualized world, the AMD guest driver may detect
the presence of at least one counter MSR. Most hypervisor
vendors would introduce a module param (like lbrv for svm)
to disable PMU for all guests.

Another control proposal per-VM is to pass PMU disable information
via MSR_IA32_PERF_CAPABILITIES or one bit in CPUID Fn4000_00[FF:00].
Both of methods require some guest-side changes, so a module
parameter may not be sufficiently granular, but practical enough.

Signed-off-by: Like Xu <[email protected]>
Message-Id: <20211117080304 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Remove vCPU from PI wakeup list before updating PID.NV

Remove the vCPU from the wakeup list before updating the notification
vector in the posted interrupt post-block helper.  There is no need to
wake the current vCPU as it is by definition not blocking.  Practically
speaking this is a nop as it only shaves a few meager cycles in the
unlikely case that the vCPU was migrated and the previous pCPU gets a
wakeup IRQ right before PID.NV is updated.  The real motivation is to
allow for more readable code in the future, when post-block is merged
with vmx_vcpu_pi_load(), at which point removal from the list will be
conditional on the old notification vector.

Opportunistically add comments to document why KVM has a per-CPU spinlock
that, at first glance, appears to be taken only on the owning CPU.
Explicitly call out that the spinlock must be taken with IRQs disabled, a
detail that was "lost" when KVM switched from spin_lock_irqsave() to
spin_lock(), with IRQs disabled for the entirety of the relevant path.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Move Posted Interrupt ndst computation out of write loop

Hoist the CPU => APIC ID conversion for the Posted Interrupt descriptor
out of the loop to write the descriptor, preemption is disabled so the
CPU won't change, and if the APIC ID changes KVM has bigger problems.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Read Posted Interrupt "control" exactly once per loop iteration

Use READ_ONCE() when loading the posted interrupt descriptor control
field to ensure "old" and "new" have the same base value. If the
compiler emits separate loads, and loads into "new" before "old", KVM
could theoretically drop the ON bit if it were set between the loads.

Fixes: 28b835d60fcc ("KVM: Update Posted-Interrupts Descriptor when vCPU is preempted")
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Save/restore IRQs (instead of CLI/STI) during PI pre/post block

Save/restore IRQs when disabling IRQs in posted interrupt pre/post block
in preparation for moving the code into vcpu_put/load(), where it would be
called with IRQs already disabled.

No functional changed intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Drop pointless PI.NDST update when blocking

Don't update Posted Interrupt's NDST, a.k.a. the target pCPU, in the
pre-block path, as NDST is guaranteed to be up-to-date.  The comment
about the vCPU being preempted during the update is simply wrong, as the
update path runs with IRQs disabled (from before snapshotting vcpu->cpu,
until after the update completes).

Since commit 8b306e2f3c41 ("KVM: VMX: avoid double list add with VT-d
posted interrupts", 2017-09-27) The vCPU can get preempted _before_
the update starts, but not during.  And if the vCPU is preempted before,
vmx_vcpu_pi_load() is responsible for updating NDST when the vCPU is
scheduled back in.  In that case, the check against the wakeup vector in
vmx_vcpu_pi_load() cannot be true as that would require the notification
vector to have been set to the wakeup vector _before_ blocking.

Opportunistically switch to using vcpu->cpu for the list/lock lookups,
which do not need pre_pcpu since the same commit.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Use boolean returns for Posted Interrupt "test" helpers

Return bools instead of ints for the posted interrupt "test" helpers.
The bit position of the flag being test does not matter to the callers,
and is in fact lost by virtue of test_bit() itself returning a bool.

Returning ints is potentially dangerous, e.g. "pi_test_on(pi_desc) == 1"
is safe-ish because ON is bit 0 and thus any sane implementation of
pi_test_on() will work, but for SN (bit 1), checking "== 1" would rely on
pi_test_on() to return 0 or 1, a.k.a. bools, as opposed to 0 or 2 (the
positive bit position).

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Drop unnecessary PI logic to handle impossible conditions

Drop sanity checks on the validity of the previous pCPU when handling
vCPU block/unlock for posted interrupts. The intention behind the sanity
checks is to avoid memory corruption in case of a race or incorrect locking,
but the code has been stable for a few years now and the checks get in
the way of eliminating kvm_vcpu.pre_cpu.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Skip Posted Interrupt updates if APICv is hard disabled

Explicitly skip posted interrupt updates if APICv is disabled in all of
KVM, or if the guest doesn't have an in-kernel APIC. The PI descriptor
is kept up-to-date if APICv is inhibited, e.g. so that re-enabling APICv
doesn't require a bunch of updates, but neither the module param nor the
APIC type can be changed on-the-fly.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Add helpers to wake/query blocking vCPU

Add helpers to wake and query a blocking vCPU. In addition to providing
nice names, the helpers reduce the probability of KVM neglecting to use
kvm_arch_vcpu_get_wait().

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Invoke kvm_vcpu_block() directly for non-HALTED wait states

Call kvm_vcpu_block() directly for all wait states except HALTED so that
kvm_vcpu_halt() is no longer a misnomer on x86.

Functionally, this means KVM will never attempt halt-polling or adjust
vcpu->halt_poll_ns for INIT_RECEIVED (a.k.a. Wait-For-SIPI (WFS)) or
AP_RESET_HOLD; UNINITIALIZED is handled in kvm_arch_vcpu_ioctl_run(),
and x86 doesn't use any other "wait" states.

As mentioned above, the motivation of this is purely so that "halt" isn't
overloaded on x86, e.g. in KVM's stats. Skipping halt-polling for WFS
(and RESET_HOLD) has no meaningful effect on guest performance as there
are typically single-digit numbers of INIT-SIPI sequences per AP vCPU,
per boot, versus thousands of HLTs just to boot to console.

Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Directly block (instead of "halting") UNINITIALIZED vCPUs

Go directly to kvm_vcpu_block() when handling the case where userspace
attempts to run an UNINITIALIZED vCPU. The vCPU is not halted, nor is it
likely that halt-polling will be successful in this case.

Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Don't redo ktime_get() when calculating halt-polling stop/deadline

Calculate the halt-polling "stop" time using "start" instead of redoing
ktime_get(). In practice, the numbers involved are in the noise (e.g.,
in the happy case where hardware correctly predicts do_halt_poll and
there are no interrupts, "start" is probably only a few cycles old)
and either approach is perfectly ok. But it's more precise to count
any extra latency toward the halt-polling time.

Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: stats: Add stat to detect if vcpu is currently blocking

Add a "blocking" stat that userspace can use to detect the case where a
vCPU is not being run because of an vCPU/guest action, e.g. HLT or WFS on
x86, WFI on arm64, etc... Current guest/host/halt stats don't show this
well, e.g. if a guest halts for a long period of time then the vCPU could
could appear pathologically blocked due to a host condition, when in
reality the vCPU has been put into a not-runnable state by the guest.

Originally-by: Cannon Matthews <[email protected]>
Suggested-by: Sean Christopherson <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Jing Zhang <[email protected]>
[sean: renamed stat to "blocking", massaged changelog]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Split out a kvm_vcpu_block() helper from kvm_vcpu_halt()

Factor out the "block" part of kvm_vcpu_halt() so that x86 can emulate
non-halt wait/sleep/block conditions that should not be subjected to
halt-polling.

No functional change intended.

Reviewed-by: Christian Borntraeger <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Rename kvm_vcpu_block() => kvm_vcpu_halt()

Rename kvm_vcpu_block() to kvm_vcpu_halt() in preparation for splitting
the actual "block" sequences into a separate helper (to be named
kvm_vcpu_block()). x86 will use the standalone block-only path to handle
non-halt cases where the vCPU is not runnable.

Rename block_ns to halt_ns to match the new function name.

No functional change intended.

Reviewed-by: David Matlack <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Drop obsolete kvm_arch_vcpu_block_finish()

Drop kvm_arch_vcpu_block_finish() now that all arch implementations are
nops.

No functional change intended.

Acked-by: Christian Borntraeger <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Tweak halt emulation helper names to free up kvm_vcpu_halt()

Rename a variety of HLT-related helpers to free up the function name
"kvm_vcpu_halt" for future use in generic KVM code, e.g. to differentiate
between "block" and "halt".

No functional change intended.

Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Don't block+unblock when halt-polling is successful

Invoke the arch hooks for block+unblock if and only if KVM actually
attempts to block the vCPU.  The only non-nop implementation is on x86,
specifically SVM's AVIC, and there is no need to put the AVIC prior to
halt-polling; KVM x86's kvm_vcpu_has_events() will scour the full vIRR
to find pending IRQs regardless of whether the AVIC is loaded/"running".

The primary motivation is to allow future cleanup to split out "block"
from "halt", but this is also likely a small performance boost on x86 SVM
when halt-polling is successful.

Adjust the post-block path to update "cur" after unblocking, i.e. include
AVIC load time in halt_wait_ns and halt_wait_hist, so that the behavior
is consistent.  Moving just the pre-block arch hook would result in only
the AVIC put latency being included in the halt_wait stats.  There is no
obvious evidence that one way or the other is correct, so just ensure KVM
is consistent.

Note, x86 has two separate paths for handling APICv with respect to vCPU
blocking.  VMX uses hooks in x86's vcpu_block(), while SVM uses the arch
hooks in kvm_vcpu_block().  Prior to this path, the two paths were more
or less functionally identical.  That is very much not the case after
this patch, as the hooks used by VMX _must_ fire before halt-polling.
x86's entire mess will be cleaned up in future patches.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: arm64: Move vGIC v4 handling for WFI out arch callback hook

Move the put and reload of the vGIC out of the block/unblock callbacks
and into a dedicated WFI helper.  Functionally, this is nearly a nop as
the block hook is called at the very beginning of kvm_vcpu_block(), and
the only code in kvm_vcpu_block() after the unblock hook is to update the
halt-polling controls, i.e. can only affect the next WFI.

Back when the arch (un)blocking hooks were added by commits 3217f7c25bca
("KVM: Add kvm_arch_vcpu_{un}blocking callbacks) and d35268da6687
("arm/arm64: KVM: arch_timer: Only schedule soft timer on vcpu_block"),
the hooks were invoked only when KVM was about to "block", i.e. schedule
out the vCPU.  The use case at the time was to schedule a timer in the
host based on the earliest timer in the guest in order to wake the
blocking vCPU when the emulated guest timer fired.  Commit accb99bcd0ca
("KVM: arm/arm64: Simplify bg_timer programming") reworked the timer
logic to be even more precise, by waiting until the vCPU was actually
scheduled out, and so move the timer logic from the (un)blocking hooks to
vcpu_load/put.

In the meantime, the hooks gained usage for enabling vGIC v4 doorbells in
commit df9ba95993b9 ("KVM: arm/arm64: GICv4: Use the doorbell interrupt
as an unblocking source"), and added related logic for the VMCR in commit
5eeaf10eec39 ("KVM: arm/arm64: Sync ICH_VMCR_EL2 back when about to block").

Finally, commit 07ab0f8d9a12 ("KVM: Call kvm_arch_vcpu_blocking early
into the blocking sequence") hoisted the (un)blocking hooks so that they
wrapped KVM's halt-polling logic in addition to the core "block" logic.

In other words, the original need for arch hooks to take action _only_
in the block path is long since gone.

Cc: Oliver Upton <[email protected]>
Cc: Marc Zyngier <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: s390: Clear valid_wakeup in kvm_s390_handle_wait(), not in arch hook

Move the clearing of valid_wakeup from kvm_arch_vcpu_block_finish() so
that a future patch can drop said arch hook. Unlike the other blocking-
related arch hooks, vcpu_blocking/unblocking(), vcpu_block_finish() needs
to be called even if the KVM doesn't actually block the vCPU. This will
allow future patches to differentiate between truly blocking the vCPU and
emulating a halt condition without introducing a contradiction.

Alternatively, the hook could be renamed to kvm_arch_vcpu_halt_finish(),
but there's literally one call site in s390, and future cleanup can also
be done to handle valid_wakeup fully within kvm_s390_handle_wait() and
allow generic KVM to drop vcpu_valid_wakeup().

No functional change intended.

Reviewed-by: Christian Borntraeger <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Reconcile discrepancies in halt-polling stats

Move the halt-polling "success" and histogram stats update into the
dedicated helper to fix a discrepancy where the success/fail "time" stats
consider polling successful so long as the wait is avoided, but the main
"success" and histogram stats consider polling successful if and only if
a wake event was detected by the halt-polling loop.

Move halt_attempted_poll to the helper as well so that all the stats are
updated in a single location.  While it's a bit odd to update the stat
well after the fact, practically speaking there's no meaningful advantage
to updating before polling.

Note, there is a functional change in addition to the success vs. fail
change.  The histogram updates previously called ktime_get() instead of
using "cur".  But that change is desirable as it means all the stats are
now updated with the same polling time, and avoids the extra ktime_get(),
which isn't expensive but isn't free either.

Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Refactor and document halt-polling stats update helper

Add a comment to document that halt-polling is considered successful even
if the polling loop itself didn't detect a wake event, i.e. if a wake
event was detect in the final kvm_vcpu_check_block(). Invert the param
to update helper so that the helper is a dumb function that is "told"
whether or not polling was successful, as opposed to determining success
based on blocking behavior.

Opportunistically tweak the params to the update helper to reduce the
line length for the call site so that it fits on a single line, and so
that the prototype conforms to the more traditional kernel style.

No functional change intended.

Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Update halt-polling stats if and only if halt-polling was attempted

Don't update halt-polling stats if halt-polling wasn't attempted. This is
a nop as @poll_ns is guaranteed to be '0' (poll_end == start); in a future
patch (to move the histogram stats into the helper), it will avoid to
avoid a discrepancy in what is considered a "successful" halt-poll.

No functional change intended.

Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Force PPC to define its own rcuwait object

Do not define/reference kvm_vcpu.wait if __KVM_HAVE_ARCH_WQP is true, and
instead force the architecture (PPC) to define its own rcuwait object.
Allowing common KVM to directly access vcpu->wait without a guard makes
it all too easy to introduce potential bugs, e.g. kvm_vcpu_block(),
kvm_vcpu_on_spin(), and async_pf_execute() all operate on vcpu->wait, not
the result of kvm_arch_vcpu_get_wait(), and so may do the wrong thing for
PPC.

Due to PPC's shenanigans with respect to callbacks and waits (it switches
to the virtual core's wait object at KVM_RUN!?!?), it's not clear whether
or not this fixes any bugs.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211009021236.4122790 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>