Git Repo - linux.git/log

KVM: x86: hyper-v: Introduce KVM_CAP_HYPERV_ENFORCE_CPUID

Modeled after KVM_CAP_ENFORCE_PV_FEATURE_CPUID, the new capability allows
for limiting Hyper-V features to those exposed to the guest in Hyper-V
CPUIDs (0x40000003, 0x40000004, ...).

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20210521095204.2161214 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

asm-generic/hyperv: add HV_STATUS_ACCESS_DENIED definition

From TLFSv6.0b, this status means: "The caller did not possess sufficient
access rights to perform the requested operation."

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Acked-by: Wei Liu <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20210521095204.2161214 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: hyper-v: Direct Virtual Flush support

From Hyper-V TLFS:
"The hypervisor exposes hypercalls (HvFlushVirtualAddressSpace,
  HvFlushVirtualAddressSpaceEx, HvFlushVirtualAddressList, and
  HvFlushVirtualAddressListEx) that allow operating systems to more
  efficiently manage the virtual TLB. The L1 hypervisor can choose to
  allow its guest to use those hypercalls and delegate the responsibility
  to handle them to the L0 hypervisor. This requires the use of a
  partition assist page."

Add the Direct Virtual Flush support for SVM.

Related VMX changes:
commit 6f6a657c9998 ("KVM/Hyper-V/VMX: Add direct tlb flush support")

Signed-off-by: Vineeth Pillai <[email protected]>
Message-Id: <fc8d24d8eb7017266bb961e39a171b0caf298d7f.1622730232 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: hyper-v: Enlightened MSR-Bitmap support

Enlightened MSR-Bitmap as per TLFS:

"The L1 hypervisor may collaborate with the L0 hypervisor to make MSR
  accesses more efficient. It can enable enlightened MSR bitmaps by setting
  the corresponding field in the enlightened VMCS to 1. When enabled, L0
  hypervisor does not monitor the MSR bitmaps for changes. Instead, the L1
  hypervisor must invalidate the corresponding clean field after making
  changes to one of the MSR bitmaps."

Enable this for SVM.

Related VMX changes:
commit ceef7d10dfb6 ("KVM: x86: VMX: hyper-v: Enlightened MSR-Bitmap support")

Signed-off-by: Vineeth Pillai <[email protected]>
Message-Id: <87df0710f95d28b91cc4ea014fc4d71056eebbee.1622730232 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: hyper-v: Remote TLB flush for SVM

Enable remote TLB flush for SVM.

Signed-off-by: Vineeth Pillai <[email protected]>
Message-Id: <1ee364e397e142aed662d2920d198cd03772f1a5.1622730232 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Software reserved fields

SVM added support for certain reserved fields to be used by
software or hypervisor. Add the following reserved fields:
  - VMCB offset 0x3e0 - 0x3ff
  - Clean bit 31
  - SVM intercept exit code 0xf0000000

Later patches will make use of this for supporting Hyper-V
nested virtualization enhancements.

Signed-off-by: Vineeth Pillai <[email protected]>
Message-Id: <a1f17a43a8e9e751a1a9cc0281649d71bdbf721b.1622730232 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: hyper-v: Move the remote TLB flush logic out of vmx

Currently the remote TLB flush logic is specific to VMX.
Move it to a common place so that SVM can use it as well.

Signed-off-by: Vineeth Pillai <[email protected]>
Message-Id: <4f4e4ca19778437dae502f44363a38e99e3ef5d1.1622730232 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

hyperv: SVM enlightened TLB flush support flag

Bit 22 of HYPERV_CPUID_FEATURES.EDX is specific to SVM and specifies
support for enlightened TLB flush. With this enlightenment enabled,
ASID invalidations flushes only gva->hpa entries. To flush TLB entries
derived from NPT, hypercalls should be used
(HvFlushGuestPhysicalAddressSpace or HvFlushGuestPhysicalAddressList)

Signed-off-by: Vineeth Pillai <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Message-Id: <a060f872d0df1955e52e30b877b3300485edb27c.1622730232 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

hyperv: Detect Nested virtualization support for SVM

Previously, to detect nested virtualization enlightenment support,
we were using HV_X64_ENLIGHTENED_VMCS_RECOMMENDED feature bit of
HYPERV_CPUID_ENLIGHTMENT_INFO.EAX CPUID as docuemented in TLFS:
"Bit 14: Recommend a nested hypervisor using the enlightened VMCS
interface. Also indicates that additional nested enlightenments
may be available (see leaf 0x4000000A)".

Enlightened VMCS, however, is an Intel only feature so the above
detection method doesn't work for AMD. So, use the
HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS.EAX CPUID information ("The
maximum input value for hypervisor CPUID information.") and this
works for both AMD and Intel.

Signed-off-by: Vineeth Pillai <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Message-Id: <43b25ff21cd2d9a51582033c9bdd895afefac056.1622730232 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: nSVM: Add a new VCPU statistic to show if VCPU is in guest mode

Add the following per-VCPU statistic to KVM debugfs to show if a given
VCPU is in guest mode:

guest_mode

Also add this as a per-VM statistic to KVM debugfs to show the total number
of VCPUs that are in guest mode in a given VM.

Signed-off-by: Krish Sadhukhan <[email protected]>
Message-Id: <20210609180340 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: nSVM: 'nested_run' should count guest-entry attempts that make it to guest code

Currently, the 'nested_run' statistic counts all guest-entry attempts,
including those that fail during vmentry checks on Intel and during
consistency checks on AMD. Convert this statistic to count only those
guest-entries that make it past these state checks and make it to guest
code. This will tell us the number of guest-entries that actually executed
or tried to execute guest code.

Signed-off-by: Krish Sadhukhan <[email protected]>
Message-Id: <20210609180340 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Drop "pre_" from enter/leave_smm() helpers

Now that .post_leave_smm() is gone, drop "pre_" from the remaining
helpers. The helpers aren't invoked purely before SMI/RSM processing,
e.g. both helpers are invoked after state is snapshotted (from regs or
SMRAM), and the RSM helper is invoked after some amount of register state
has been stuffed.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20210609185619 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Drop .post_leave_smm(), i.e. the manual post-RSM MMU reset

Drop the .post_leave_smm() emulator callback, which at this point is just
a wrapper to kvm_mmu_reset_context(). The manual context reset is
unnecessary, because unlike enter_smm() which calls vendor MSR/CR helpers
directly, em_rsm() bounces through the KVM helpers, e.g. kvm_set_cr4(),
which are responsible for processing side effects. em_rsm() is already
subtly relying on this behavior as it doesn't manually do
kvm_update_cpuid_runtime(), e.g. to recognize CR4.OSXSAVE changes.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20210609185619 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Rename SMM tracepoint to make it reflect reality

Rename the SMM tracepoint, which handles both entering and exiting SMM,
from kvm_enter_smm to kvm_smm_transition.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20210609185619 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Move "entering SMM" tracepoint into kvm_smm_changed()

Invoke the "entering SMM" tracepoint from kvm_smm_changed() instead of
enter_smm(), effectively moving it from before reading vCPU state to
after reading state (but still before writing it to SMRAM!). The primary
motivation is to consolidate code, but calling the tracepoint from
kvm_smm_changed() also makes its invocation consistent with respect to
SMI and RSM, and with respect to KVM_SET_VCPU_EVENTS (which previously
only invoked the tracepoint when forcing the vCPU out of SMM).

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20210609185619 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Move (most) SMM hflags modifications into kvm_smm_changed()

Move the core of SMM hflags modifications into kvm_smm_changed() and use
kvm_smm_changed() in enter_smm().  Clear HF_SMM_INSIDE_NMI_MASK for
leaving SMM but do not set it for entering SMM.  If the vCPU is executing
outside of SMM, the flag should unequivocally be cleared, e.g. this
technically fixes a benign bug where the flag could be left set after
KVM_SET_VCPU_EVENTS, but the reverse is not true as NMI blocking depends
on pre-SMM state or userspace input.

Note, this adds an extra kvm_mmu_reset_context() to enter_smm().  The
extra/early reset isn't strictly necessary, and in a way can never be
necessary since the vCPU/MMU context is in a half-baked state until the
final context reset at the end of the function.  But, enter_smm() is not
a hot path, and exploding on an invalid root_hpa is probably better than
having a stale SMM flag in the MMU role; it's at least no worse.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20210609185619 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Invoke kvm_smm_changed() immediately after clearing SMM flag

Move RSM emulation's call to kvm_smm_changed() from .post_leave_smm() to
.exiting_smm(), leaving behind the MMU context reset.  The primary
motivation is to allow for future cleanup, but this also fixes a bug of
sorts by queueing KVM_REQ_EVENT even if RSM causes shutdown, e.g. to let
an INIT wake the vCPU from shutdown.  Of course, KVM doesn't properly
emulate a shutdown state, e.g. KVM doesn't block SMIs after shutdown, and
immediately exits to userspace, so the event request is a moot point in
practice.

Moving kvm_smm_changed() also moves the RSM tracepoint.  This isn't
strictly necessary, but will allow consolidating the SMI and RSM
tracepoints in a future commit (by also moving the SMI tracepoint).
Invoking the tracepoint before loading SMRAM state also means the SMBASE
that reported in the tracepoint will point that the state that will be
used for RSM, as opposed to the SMBASE _after_ RSM completes, which is
arguably a good thing if the tracepoint is being used to debug a RSM/SMM
issue.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20210609185619 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Replace .set_hflags() with dedicated .exiting_smm() helper

Replace the .set_hflags() emulator hook with a dedicated .exiting_smm(),
moving the SMM and SMM_INSIDE_NMI flag handling out of the emulator in
the process. This is a step towards consolidating much of the logic in
kvm_smm_changed(), including the SMM hflags updates.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20210609185619 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Emulate triple fault shutdown if RSM emulation fails

Use the recently introduced KVM_REQ_TRIPLE_FAULT to properly emulate
shutdown if RSM from SMM fails.

Note, entering shutdown after clearing the SMM flag and restoring NMI
blocking is architecturally correct with respect to AMD's APM, which KVM
also uses for SMRAM layout and RSM NMI blocking behavior.  The APM says:

  An RSM causes a processor shutdown if an invalid-state condition is
  found in the SMRAM state-save area. Only an external reset, external
  processor-initialization, or non-maskable external interrupt (NMI) can
  cause the processor to leave the shutdown state.

Of note is processor-initialization (INIT) as a valid shutdown wake
event, as INIT is blocked by SMM, implying that entering shutdown also
forces the CPU out of SMM.

For recent Intel CPUs, restoring NMI blocking is technically wrong, but
so is restoring NMI blocking in the first place, and Intel's RSM
"architecture" is such a mess that just about anything is allowed and can
be justified as micro-architectural behavior.

Per the SDM:

  On Pentium 4 and later processors, shutdown will inhibit INTR and A20M
  but will not change any of the other inhibits. On these processors,
  NMIs will be inhibited if no action is taken in the SMI handler to
  uninhibit them (see Section 34.8).

where Section 34.8 says:

  When the processor enters SMM while executing an NMI handler, the
  processor saves the SMRAM state save map but does not save the
  attribute to keep NMI interrupts disabled. Potentially, an NMI could be
  latched (while in SMM or upon exit) and serviced upon exit of SMM even
  though the previous NMI handler has still not completed.

I.e. RSM unconditionally unblocks NMI, but shutdown on RSM does not,
which is in direct contradiction of KVM's behavior.  But, as mentioned
above, KVM follows AMD architecture and restores NMI blocking on RSM, so
that micro-architectural detail is already lost.

And for Pentium era CPUs, SMI# can break shutdown, meaning that at least
some Intel CPUs fully leave SMM when entering shutdown:

  In the shutdown state, Intel processors stop executing instructions
  until a RESET#, INIT# or NMI# is asserted.  While Pentium family
  processors recognize the SMI# signal in shutdown state, P6 family and
  Intel486 processors do not.

In other words, the fact that Intel CPUs have implemented the two
extremes gives KVM carte blanche when it comes to honoring Intel's
architecture for handling shutdown during RSM.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20210609185619 [email protected]>
[Return X86EMUL_CONTINUE after triple fault. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Drop vendor specific functions for APICv/AVIC enablement

Now that APICv/AVIC enablement is kept in common 'enable_apicv' variable,
there's no need to call kvm_apicv_init() from vendor specific code.

No functional change intended.

Reviewed-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20210609150911.1471882 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Use common 'enable_apicv' variable for both APICv and AVIC

Unify VMX and SVM code by moving APICv/AVIC enablement tracking to common
'enable_apicv' variable. Note: unlike APICv, AVIC is disabled by default.

No functional change intended.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20210609150911.1471882 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: implement KVM PM-notifier

Implement PM hibernation/suspend prepare notifiers so that KVM
can reliably set PVCLOCK_GUEST_STOPPED on VCPUs and properly
suspend VMs.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Message-Id: <20210606021045 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: add PM-notifier

Add KVM PM-notifier so that architectures can have arch-specific
VM suspend/resume routines. Such architectures need to select
CONFIG_HAVE_KVM_PM_NOTIFIER and implement kvm_arch_pm_notifier().

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Message-Id: <20210606021045 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: selftests: Introduce x2APIC register manipulation functions

Standardize reads and writes of the x2APIC MSRs.

Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Message-Id: <20210604172611 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: selftests: Hoist APIC functions out of individual tests

Move the APIC functions into the library to encourage code reuse and
to avoid unintended deviations.

Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Message-Id: <20210604172611 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: selftests: Move APIC definitions into a separate file

Processor.h is a hodgepodge of definitions. Though the local APIC is
technically built into the CPU these days, move the APIC definitions
into a new header file: apic.h.

Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Message-Id: <20210604172611 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Disable vmcs02 posted interrupts if vmcs12 PID isn't mappable

Don't allow posted interrupts to modify a stale posted interrupt
descriptor (including the initial value of 0).

Empirical tests on real hardware reveal that a posted interrupt
descriptor referencing an unbacked address has PCI bus error semantics
(reads as all 1's; writes are ignored). However, kvm can't distinguish
unbacked addresses from device-backed (MMIO) addresses, so it should
really ask userspace for an MMIO completion. That's overly
complicated, so just punt with KVM_INTERNAL_ERROR.

Don't return the error until the posted interrupt descriptor is
actually accessed. We don't want to break the existing kvm-unit-tests
that assume they can launch an L2 VM with a posted interrupt
descriptor that references MMIO space in L1.

Fixes: 6beb7bd52e48 ("kvm: nVMX: Refactor nested_get_vmcs12_pages()")
Signed-off-by: Jim Mattson <[email protected]>
Message-Id: <20210604172611 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Fail on MMIO completion for nested posted interrupts

When the kernel has no mapping for the vmcs02 virtual APIC page,
userspace MMIO completion is necessary to process nested posted
interrupts. This is not a configuration that KVM supports. Rather than
silently ignoring the problem, try to exit to userspace with
KVM_INTERNAL_ERROR.

Note that the event that triggers this error is consumed as a
side-effect of a call to kvm_check_nested_events. On some paths
(notably through kvm_vcpu_check_block), the error is dropped. In any
case, this is an incremental improvement over always ignoring the
error.

Signed-off-by: Jim Mattson <[email protected]>
Message-Id: <20210604172611 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Add a return code to kvm_apic_accept_events

No functional change intended. At present, the only negative value
returned by kvm_check_nested_events is -EBUSY.

Signed-off-by: Jim Mattson <[email protected]>
Message-Id: <20210604172611 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Add a return code to inject_pending_event

No functional change intended. At present, 'r' will always be -EBUSY
on a control transfer to the 'out' label.

Signed-off-by: Jim Mattson <[email protected]>
Message-Id: <20210604172611 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Add a return code to vmx_complete_nested_posted_interrupt

No functional change intended.

Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Message-Id: <20210604172611 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Remove guest mode check from kvm_check_nested_events

A survey of the callsites reveals that they all ensure the vCPU is in
guest mode before calling kvm_check_nested_events. Remove this dead
code so that the only negative value this function returns (at the
moment) is -EBUSY.

Signed-off-by: Jim Mattson <[email protected]>
Message-Id: <20210604172611 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: selftests: x86: Add vmx_nested_tsc_scaling_test

Test that nested TSC scaling works as expected with both L1 and L2
scaled.

Signed-off-by: Ilias Stamatis <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20210526184418 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Enable nested TSC scaling

Calculate the TSC offset and multiplier on nested transitions and expose
the TSC scaling feature to L1.

Signed-off-by: Ilias Stamatis <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20210526184418 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Add vendor callbacks for writing the TSC multiplier

Currently vmx_vcpu_load_vmcs() writes the TSC_MULTIPLIER field of the
VMCS every time the VMCS is loaded. Instead of doing this, set this
field from common code on initialization and whenever the scaling ratio
changes.

Additionally remove vmx->current_tsc_ratio. This field is redundant as
vcpu->arch.tsc_scaling_ratio already tracks the current TSC scaling
ratio. The vmx->current_tsc_ratio field is only used for avoiding
unnecessary writes but it is no longer needed after removing the code
from the VMCS load path.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Ilias Stamatis <[email protected]>
Message-Id: <20210607105438 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Move write_l1_tsc_offset() logic to common code and rename it

The write_l1_tsc_offset() callback has a misleading name. It does not
set L1's TSC offset, it rather updates the current TSC offset which
might be different if a nested guest is executing. Additionally, both
the vmx and svm implementations use the same logic for calculating the
current TSC before writing it to hardware.

Rename the function and move the common logic to the caller. The vmx/svm
specific code now merely sets the given offset to the corresponding
hardware structure.

Signed-off-by: Ilias Stamatis <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20210526184418 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Add functions that calculate the nested TSC fields

When L2 is entered we need to "merge" the TSC multiplier and TSC offset
values of 01 and 12 together.

The merging is done using the following equations:
offset_02 = ((offset_01 * mult_12) >> shift_bits) + offset_12
mult_02 = (mult_01 * mult_12) >> shift_bits

Where shift_bits is kvm_tsc_scaling_ratio_frac_bits.

Signed-off-by: Ilias Stamatis <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20210526184418 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Add functions for retrieving L2 TSC fields from common code

In order to implement as much of the nested TSC scaling logic as
possible in common code, we need these vendor callbacks for retrieving
the TSC offset and the TSC multiplier that L1 has set for L2.

Signed-off-by: Ilias Stamatis <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20210526184418 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Add a TSC multiplier field in VMCS12

This is required for supporting nested TSC scaling.

Signed-off-by: Ilias Stamatis <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20210526184418 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Add a ratio parameter to kvm_scale_tsc()

Sometimes kvm_scale_tsc() needs to use the current scaling ratio and
other times (like when reading the TSC from user space) it needs to use
L1's scaling ratio. Have the caller specify this by passing the ratio as
a parameter.

Signed-off-by: Ilias Stamatis <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20210526184418 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Rename kvm_compute_tsc_offset() to kvm_compute_l1_tsc_offset()

All existing code uses kvm_compute_tsc_offset() passing L1 TSC values to
it. Let's document this by renaming it to kvm_compute_l1_tsc_offset().

Signed-off-by: Ilias Stamatis <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20210526184418 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Store L1's TSC scaling ratio in 'struct kvm_vcpu_arch'

Store L1's scaling ratio in the kvm_vcpu_arch struct like we already do
for L1's TSC offset. This allows for easy save/restore when we enter and
then exit the nested guest.

Signed-off-by: Ilias Stamatis <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20210526184418 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

math64.h: Add mul_s64_u64_shr()

This function is needed for KVM's nested virtualization. The nested TSC
scaling implementation requires multiplying the signed TSC offset with
the unsigned TSC multiplier.

Signed-off-by: Ilias Stamatis <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20210526184418 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Lazily allocate memslot rmaps

If the TDP MMU is in use, wait to allocate the rmaps until the shadow
MMU is actually used. (i.e. a nested VM is launched.) This saves memory
equal to 0.2% of guest memory in cases where the TDP MMU is used and
there are no nested guests involved.

Signed-off-by: Ben Gardon <[email protected]>
Message-Id: <20210518173414 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Skip rmap operations if rmaps not allocated

If only the TDP MMU is being used to manage the memory mappings for a VM,
then many rmap operations can be skipped as they are guaranteed to be
no-ops. This saves some time which would be spent on the rmap operation.
It also avoids acquiring the MMU lock in write mode for many operations.

This makes it safe to run the VM without rmaps allocated, when only
using the TDP MMU and sets the stage for waiting to allocate the rmaps
until they're needed.

Signed-off-by: Ben Gardon <[email protected]>
Message-Id: <20210518173414 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Add a field to control memslot rmap allocation

Add a field to control whether new memslots should have rmaps allocated
for them. As of this change, it's not safe to skip allocating rmaps, so
the field is always set to allocate rmaps. Future changes will make it
safe to operate without rmaps, using the TDP MMU. Then further changes
will allow the rmaps to be allocated lazily when needed for nested
oprtation.

No functional change expected.

Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
Message-Id: <20210518173414 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: mmu: Add slots_arch_lock for memslot arch fields

Add a new lock to protect the arch-specific fields of memslots if they
need to be modified in a kvm->srcu read critical section. A future
commit will use this lock to lazily allocate memslot rmaps for x86.

Signed-off-by: Ben Gardon <[email protected]>
Message-Id: <20210518173414 [email protected]>
[Add Documentation/ hunk. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: mmu: Refactor memslot copy

Factor out copying kvm_memslots from allocating the memory for new ones
in preparation for adding a new lock to protect the arch-specific fields
of the memslots.

No functional change intended.

Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
Message-Id: <20210518173414 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Factor out allocating memslot rmap

Small refactor to facilitate allocating rmaps for all memslots at once.

No functional change expected.

Signed-off-by: Ben Gardon <[email protected]>
Message-Id: <20210518173414 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Deduplicate rmap freeing

Small code deduplication. No functional change expected.

Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
Message-Id: <20210518173414 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Do not write protect huge page in initially-all-set mode

Currently, when dirty logging is started in initially-all-set mode,
we write protect huge pages to prepare for splitting them into
4K pages, and leave normal pages untouched as the logging will
be enabled lazily as dirty bits are cleared.

However, enabling dirty logging lazily is also feasible for huge pages.
This not only reduces the time of start dirty logging, but it also
greatly reduces side-effect on guest when there is high dirty rate.

Signed-off-by: Keqian Zhu <[email protected]>
Message-Id: <20210429034115 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Support write protecting only large pages

Prepare for write protecting large page lazily during dirty log tracking,
for which we will only need to write protect gfns at large page
granularity.

No functional or performance change expected.

Signed-off-by: Keqian Zhu <[email protected]>
Message-Id: <20210429034115 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: hyper-v: Advertise support for fast XMM hypercalls

Now that kvm_hv_flush_tlb() has been patched to support XMM hypercall
inputs, we can start advertising this feature to guests.

Cc: Alexander Graf <[email protected]>
Cc: Evgeny Iakovlev <[email protected]>
Signed-off-by: Siddharth Chandrasekaran <[email protected]>
Message-Id: <e63fc1c61dd2efecbefef239f4f0a598bd552750.1622019134 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: kvm_hv_flush_tlb use inputs from XMM registers

Hyper-V supports the use of XMM registers to perform fast hypercalls.
This allows guests to take advantage of the improved performance of the
fast hypercall interface even though a hypercall may require more than
(the current maximum of) two input registers.

The XMM fast hypercall interface uses six additional XMM registers (XMM0
to XMM5) to allow the guest to pass an input parameter block of up to
112 bytes.

Add framework to read from XMM registers in kvm_hv_hypercall() and use
the additional hypercall inputs from XMM registers in kvm_hv_flush_tlb()
when possible.

Cc: Alexander Graf <[email protected]>
Co-developed-by: Evgeny Iakovlev <[email protected]>
Signed-off-by: Evgeny Iakovlev <[email protected]>
Signed-off-by: Siddharth Chandrasekaran <[email protected]>
Message-Id: <fc62edad33f1920fe5c74dde47d7d0b4275a9012.1622019134 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: hyper-v: Collect hypercall params into struct

As of now there are 7 parameters (and flags) that are used in various
hyper-v hypercall handlers. There are 6 more input/output parameters
passed from XMM registers which are to be added in an upcoming patch.

To make passing arguments to the handlers more readable, capture all
these parameters into a single structure.

Cc: Alexander Graf <[email protected]>
Cc: Evgeny Iakovlev <[email protected]>
Signed-off-by: Siddharth Chandrasekaran <[email protected]>
Message-Id: <273f7ed510a1f6ba177e61b73a5c7bfbee4a4a87.1622019133 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Move FPU register accessors into fpu.h

Hyper-v XMM fast hypercalls use XMM registers to pass input/output
parameters. To access these, hyperv.c can reuse some FPU register
accessors defined in emulator.c. Move them to a common location so both
can access them.

While at it, reorder the parameters of these accessor methods to make
them more readable.

Cc: Alexander Graf <[email protected]>
Cc: Evgeny Iakovlev <[email protected]>
Signed-off-by: Siddharth Chandrasekaran <[email protected]>
Message-Id: <01a85a6560714d4d3637d3d86e5eba65073318fa.1622019133 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Make is_nx_huge_page_enabled an inline function

Function 'is_nx_huge_page_enabled' is called only by kvm/mmu, so make
it as inline fucntion and remove the unnecessary declaration.

Cc: Ben Gardon <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Shaokun Zhang <[email protected]>
Message-Id: <1622102271 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: selftests: Fix kvm_check_cap() assertion

KVM_CHECK_EXTENSION ioctl can return any negative value on error,
and not necessarily -1. Change the assertion to reflect that.

Signed-off-by: Fuad Tabba <[email protected]>
Message-Id: <20210615150443.1183365 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Calculate and check "full" mmu_role for nested MMU

Calculate and check the full mmu_role when initializing the MMU context
for the nested MMU, where "full" means the bits and pieces of the role
that aren't handled by kvm_calc_mmu_role_common().  While the nested MMU
isn't used for shadow paging, things like the number of levels in the
guest's page tables are surprisingly important when walking the guest
page tables.  Failure to reinitialize the nested MMU context if L2's
paging mode changes can result in unexpected and/or missed page faults,
and likely other explosions.

E.g. if an L1 vCPU is running both a 32-bit PAE L2 and a 64-bit L2, the
"common" role calculation will yield the same role for both L2s.  If the
64-bit L2 is run after the 32-bit PAE L2, L0 will fail to reinitialize
the nested MMU context, ultimately resulting in a bad walk of L2's page
tables as the MMU will still have a guest root_level of PT32E_ROOT_LEVEL.

  WARNING: CPU: 4 PID: 167334 at arch/x86/kvm/vmx/vmx.c:3075 ept_save_pdptrs+0x15/0xe0 [kvm_intel]
  Modules linked in: kvm_intel]
  CPU: 4 PID: 167334 Comm: CPU 3/KVM Not tainted 5.13.0-rc1-d849817d5673-reqs #185
  Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
  RIP: 0010:ept_save_pdptrs+0x15/0xe0 [kvm_intel]
  Code: <0f> 0b c3 f6 87 d8 02 00f
  RSP: 0018:ffffbba702dbba00 EFLAGS: 00010202
  RAX: 0000000000000011 RBX: 0000000000000002 RCX: ffffffff810a2c08
  RDX: ffff91d7bc30acc0 RSI: 0000000000000011 RDI: ffff91d7bc30a600
  RBP: ffff91d7bc30a600 R08: 0000000000000010 R09: 0000000000000007
  R10: 0000000000000000 R11: 0000000000000000 R12: ffff91d7bc30a600
  R13: ffff91d7bc30acc0 R14: ffff91d67c123460 R15: 0000000115d7e005
  FS:  00007fe8e9ffb700(0000) GS:ffff91d90fb00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 000000029f15a001 CR4: 00000000001726e0
  Call Trace:
   kvm_pdptr_read+0x3a/0x40 [kvm]
   paging64_walk_addr_generic+0x327/0x6a0 [kvm]
   paging64_gva_to_gpa_nested+0x3f/0xb0 [kvm]
   kvm_fetch_guest_virt+0x4c/0xb0 [kvm]
   __do_insn_fetch_bytes+0x11a/0x1f0 [kvm]
   x86_decode_insn+0x787/0x1490 [kvm]
   x86_decode_emulated_instruction+0x58/0x1e0 [kvm]
   x86_emulate_instruction+0x122/0x4f0 [kvm]
   vmx_handle_exit+0x120/0x660 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0xe25/0x1cb0 [kvm]
   kvm_vcpu_ioctl+0x211/0x5a0 [kvm]
   __x64_sys_ioctl+0x83/0xb0
   do_syscall_64+0x40/0xb0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Cc: Vitaly Kuznetsov <[email protected]>
Cc: [email protected]
Fixes: bf627a928837 ("x86/kvm/mmu: check if MMU reconfiguration is needed in init_kvm_nested_mmu()")
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20210610220026.1364486 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Fix x86_emulator slab cache leak

Commit c9b8b07cded58 (KVM: x86: Dynamically allocate per-vCPU emulation context)
tries to allocate per-vCPU emulation context dynamically, however, the
x86_emulator slab cache is still exiting after the kvm module is unload
as below after destroying the VM and unloading the kvm module.

grep x86_emulator /proc/slabinfo
x86_emulator 36 36 2672 12 8 : tunables 0 0 0 : slabdata 3 3 0

This patch fixes this slab cache leak by destroying the x86_emulator slab cache
when the kvm module is unloaded.

Fixes: c9b8b07cded58 (KVM: x86: Dynamically allocate per-vCPU emulation context)
Cc: [email protected]
Signed-off-by: Wanpeng Li <[email protected]>
Message-Id: <1623387573 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Call SEV Guest Decommission if ASID binding fails

Send SEV_CMD_DECOMMISSION command to PSP firmware if ASID binding
fails. If a failure happens after a successful LAUNCH_START command,
a decommission command should be executed. Otherwise, guest context
will be unfreed inside the AMD SP. After the firmware will not have
memory to allocate more SEV guest context, LAUNCH_START command will
begin to fail with SEV_RET_RESOURCE_LIMIT error.

The existing code calls decommission inside sev_unbind_asid, but it is
not called if a failure happens before guest activation succeeds. If
sev_bind_asid fails, decommission is never called. PSP firmware has a
limit for the number of guests. If sev_asid_binding fails many times,
PSP firmware will not have resources to create another guest context.

Cc: [email protected]
Fixes: 59414c989220 ("KVM: SVM: Add support for KVM_SEV_LAUNCH_START command")
Reported-by: Peter Gonda <[email protected]>
Signed-off-by: Alper Gun <[email protected]>
Reviewed-by: Marc Orr <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20210610174604.2554090 [email protected]>

KVM: x86: Immediately reset the MMU context when the SMM flag is cleared

Immediately reset the MMU context when the vCPU's SMM flag is cleared so
that the SMM flag in the MMU role is always synchronized with the vCPU's
flag.  If RSM fails (which isn't correctly emulated), KVM will bail
without calling post_leave_smm() and leave the MMU in a bad state.

The bad MMU role can lead to a NULL pointer dereference when grabbing a
shadow page's rmap for a page fault as the initial lookups for the gfn
will happen with the vCPU's SMM flag (=0), whereas the rmap lookup will
use the shadow page's SMM flag, which comes from the MMU (=1).  SMM has
an entirely different set of memslots, and so the initial lookup can find
a memslot (SMM=0) and then explode on the rmap memslot lookup (SMM=1).

  general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
  KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
  CPU: 1 PID: 8410 Comm: syz-executor382 Not tainted 5.13.0-rc5-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:__gfn_to_rmap arch/x86/kvm/mmu/mmu.c:935 [inline]
  RIP: 0010:gfn_to_rmap+0x2b0/0x4d0 arch/x86/kvm/mmu/mmu.c:947
  Code: <42> 80 3c 20 00 74 08 4c 89 ff e8 f1 79 a9 00 4c 89 fb 4d 8b 37 44
  RSP: 0018:ffffc90000ffef98 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff888015b9f414 RCX: ffff888019669c40
  RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001
  RBP: 0000000000000001 R08: ffffffff811d9cdb R09: ffffed10065a6002
  R10: ffffed10065a6002 R11: 0000000000000000 R12: dffffc0000000000
  R13: 0000000000000003 R14: 0000000000000001 R15: 0000000000000000
  FS:  000000000124b300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 0000000028e31000 CR4: 00000000001526e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   rmap_add arch/x86/kvm/mmu/mmu.c:965 [inline]
   mmu_set_spte+0x862/0xe60 arch/x86/kvm/mmu/mmu.c:2604
   __direct_map arch/x86/kvm/mmu/mmu.c:2862 [inline]
   direct_page_fault+0x1f74/0x2b70 arch/x86/kvm/mmu/mmu.c:3769
   kvm_mmu_do_page_fault arch/x86/kvm/mmu.h:124 [inline]
   kvm_mmu_page_fault+0x199/0x1440 arch/x86/kvm/mmu/mmu.c:5065
   vmx_handle_exit+0x26/0x160 arch/x86/kvm/vmx/vmx.c:6122
   vcpu_enter_guest+0x3bdd/0x9630 arch/x86/kvm/x86.c:9428
   vcpu_run+0x416/0xc20 arch/x86/kvm/x86.c:9494
   kvm_arch_vcpu_ioctl_run+0x4e8/0xa40 arch/x86/kvm/x86.c:9722
   kvm_vcpu_ioctl+0x70f/0xbb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3460
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:1069 [inline]
   __se_sys_ioctl+0xfb/0x170 fs/ioctl.c:1055
   do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  RIP: 0033:0x440ce9

Cc: [email protected]
Reported-by: [email protected]
Fixes: 9ec19493fb86 ("KVM: x86: clear SMM flags before loading state while leaving SMM")
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20210609185619 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: PPC: Book3S HV: remove ISA v3.0 and v3.1 support from P7/8 path

POWER9 and later processors always go via the P9 guest entry path now.
Remove the remaining support from the P7/8 path.

Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: implement hash host / hash guest support

Implement support for hash guests under hash host. This has to save and
restore the host SLB, and ensure that the MMU is off while switching
into the guest SLB.

POWER9 and later CPUs now always go via the P9 path. The "fast" guest
mode is now renamed to the P9 mode, which is consistent with its
functionality and the rest of the naming.

Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: implement hash guest support

Implement hash guest support. Guest entry/exit has to restore and
save/clear the SLB, plus several other bits to accommodate hash guests
in the P9 path. Radix host, hash guest support is removed from the P7/8
path.

The HPT hcalls and faults are not handled in real mode, which is a
performance regression. A worst-case fork/exit microbenchmark takes 3x
longer after this patch. kbuild benchmark performance is in the noise,
but the slowdown is likely to be noticed somewhere.

For now, accept this penalty for the benefit of simplifying the P7/8
paths and unifying P9 hash with the new code, because hash is a less
important configuration than radix on processors that support it. Hash
will benefit from future optimisations to this path, including possibly
a faster path to handle such hcalls and interrupts without doing a full
exit.

Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: Reflect userspace hcalls to hash guests to support PR KVM

The reflection of sc 1 interrupts from guest PR=1 to the guest kernel is
required to support a hash guest running PR KVM where its guest is
making hcalls with sc 1.

In preparation for hash guest support, add this hcall reflection to the
P9 path. The P7/8 path does this in its realmode hcall handler
(sc_1_fast_return).

Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV: add virtual mode handlers for HPT hcalls and page faults

In order to support hash guests in the P9 path (which does not do real
mode hcalls or page fault handling), these real-mode hash specific
interrupts need to be implemented in virt mode.

Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV: small pseries_do_hcall cleanup

Functionality should not be changed.

Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: Allow all P9 processors to enable nested HV

All radix guests go via the P9 path now, so there is no need to limit
nested HV to processors that support "mixed mode" MMU. Remove the
restriction.

Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV: Remove unused nested HV tests in XICS emulation

Commit f3c18e9342a44 ("KVM: PPC: Book3S HV: Use XICS hypercalls when
running as a nested hypervisor") added nested HV tests in XICS
hypercalls, but not all are required.

* icp_eoi is only called by kvmppc_deliver_irq_passthru which is only
  called by kvmppc_check_passthru which is only caled by
  kvmppc_read_one_intr.

* kvmppc_read_one_intr is only called by kvmppc_read_intr which is only
  called by the L0 HV rmhandlers code.

* kvmhv_rm_send_ipi is called by:
  - kvmhv_interrupt_vcore which is only called by kvmhv_commence_exit
    which is only called by the L0 HV rmhandlers code.
  - icp_send_hcore_msg which is only called by icp_rm_set_vcpu_irq.
  - icp_rm_set_vcpu_irq which is only called by icp_rm_try_update
  - icp_rm_set_vcpu_irq is not nested HV safe because it writes to
    LPCR directly without a kvmhv_on_pseries test. Nested handlers
    should not in general be using the rm handlers.

The important test seems to be in kvmppc_ipi_thread, which sends the
virt-mode H_IPI handler kick to use smp_call_function rather than
msgsnd.

Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV: Remove virt mode checks from real mode handlers

Now that the P7/8 path no longer supports radix, real-mode handlers
do not need to deal with being called in virt mode.

This change effectively reverts commit acde25726bc6 ("KVM: PPC: Book3S
HV: Add radix checks in real-mode hypercall handlers").

It removes a few more real-mode tests in rm hcall handlers, which
allows the indirect ops for the xive module to be removed from the
built-in xics rm handlers.

kvmppc_h_random is renamed to kvmppc_rm_h_random to be a bit more
descriptive and consistent with other rm handlers.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Cédric Le Goater <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV: Remove radix guest support from P7/8 path

The P9 path now runs all supported radix guest combinations, so
remove radix guest support from the P7/8 path.

Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV: Remove support for dependent threads mode on P9

Dependent-threads mode is the normal KVM mode for pre-POWER9 SMT
processors, where all threads in a core (or subcore) would run the same
partition at the same time, or they would run the host.

This design was mandated by MMU state that is shared between threads in
a processor, so the synchronisation point is in hypervisor real-mode
that has essentially no shared state, so it's safe for multiple threads
to gather and switch to the correct mode.

It is implemented by having the host unplug all secondary threads and
always run in SMT1 mode, and host QEMU threads essentially represent
virtual cores that wake these secondary threads out of unplug when the
ioctl is called to run the guest. This happens via a side-path that is
mostly invisible to the rest of the Linux host and the secondary threads
still appear to be unplugged.

POWER9 / ISA v3.0 has a more flexible MMU design that is independent
per-thread and allows a much simpler KVM implementation. Before the new
"P9 fast path" was added that began to take advantage of this, POWER9
support was implemented in the existing path which has support to run
in the dependent threads mode. So it was not much work to add support to
run POWER9 in this dependent threads mode.

The mode is not required by the POWER9 MMU (although "mixed-mode" hash /
radix MMU limitations of early processors were worked around using this
mode). But it is one way to run SMT guests without running different
guests or guest and host on different threads of the same core, so it
could avoid or reduce some SMT attack surfaces without turning off SMT
entirely.

This security feature has some real, if indeterminate, value. However
the old path is lagging in features (nested HV), and with this series
the new P9 path adds remaining missing features (radix prefetch bug
and hash support, in later patches), so POWER9 dependent threads mode
support would be the only remaining reason to keep that code in and keep
supporting POWER9/POWER10 in the old path. So here we make the call to
drop this feature.

Remove dependent threads mode support for POWER9 and above processors.
Systems can still achieve this security by disabling SMT entirely, but
that would generally come at a larger performance cost for guests.

Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV: Implement radix prefetch workaround by disabling MMU

Rather than partition the guest PID space + flush a rogue guest PID to
work around this problem, instead fix it by always disabling the MMU when
switching in or out of guest MMU context in HV mode.

This may be a bit less efficient, but it is a lot less complicated and
allows the P9 path to trivally implement the workaround too. Newer CPUs
are not subject to this issue.

Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: Switch to guest MMU context as late as possible

Move MMU context switch as late as reasonably possible to minimise code
running with guest context switched in. This becomes more important when
this code may run in real-mode, with later changes.

Move WARN_ON as early as possible so program check interrupts are less
likely to tangle everything up.

Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: Add helpers for OS SPR handling

This is a first step to wrapping supervisor and user SPR saving and
loading up into helpers, which will then be called independently in
bare metal and nested HV cases in order to optimise SPR access.

Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: Move SPR loading after expiry time check

This is wasted work if the time limit is exceeded.

Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: Improve exit timing accounting coverage

The C conversion caused exit timing to become a bit cramped. Expand it
to cover more of the entry and exit code.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: Read machine check registers while MSR[RI] is 0

SRR0/1, DAR, DSISR must all be protected from machine check which can
clobber them. Ensure MSR[RI] is clear while they are live.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: inline kvmhv_load_hv_regs_and_go into __kvmhv_vcpu_entry_p9

Now the initial C implementation is done, inline more HV code to make
rearranging things easier.

And rename __kvmhv_vcpu_entry_p9 to drop the leading underscores as it's
now C, and is now a more complete vcpu entry.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Fabiano Rosas <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: Implement the rest of the P9 path in C

Almost all logic is moved to C, by introducing a new in_guest mode for
the P9 path that branches very early in the KVM interrupt handler to P9
exit code.

The main P9 entry and exit assembly is now only about 160 lines of low
level stack setup and register save/restore, plus a bad-interrupt
handler.

There are two motivations for this, the first is just make the code more
maintainable being in C. The second is to reduce the amount of code
running in a special KVM mode, "realmode". In quotes because with radix
it is no longer necessarily real-mode in the MMU, but it still has to be
treated specially because it may be in real-mode, and has various
important registers like PID, DEC, TB, etc set to guest. This is hostile
to the rest of Linux and can't use arbitrary kernel functionality or be
instrumented well.

This initial patch is a reasonably faithful conversion of the asm code,
but it does lack any loop to return quickly back into the guest without
switching out of realmode in the case of unimportant or easily handled
interrupts. As explained in previous changes, handling HV interrupts
very quickly in this low level realmode is not so important for P9
performance, and are important to avoid for security, observability,
debugability reasons.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: Stop handling hcalls in real-mode in the P9 path

In the interest of minimising the amount of code that is run in
"real-mode", don't handle hcalls in real mode in the P9 path. This
requires some new handlers for H_CEDE and xics-on-xive to be added
before xive is pulled or cede logic is checked.

This introduces a change in radix guest behaviour where radix guests
that execute 'sc 1' in userspace now get a privilege fault whereas
previously the 'sc 1' would be reflected as a syscall interrupt to the
guest kernel. That reflection is only required for hash guests that run
PR KVM.

Background:

In POWER8 and earlier processors, it is very expensive to exit from the
HV real mode context of a guest hypervisor interrupt, and switch to host
virtual mode. On those processors, guest->HV interrupts reach the
hypervisor with the MMU off because the MMU is loaded with guest context
(LPCR, SDR1, SLB), and the other threads in the sub-core need to be
pulled out of the guest too. Then the primary must save off guest state,
invalidate SLB and ERAT, and load up host state before the MMU can be
enabled to run in host virtual mode (~= regular Linux mode).

Hash guests also require a lot of hcalls to run due to the nature of the
MMU architecture and paravirtualisation design. The XICS interrupt
controller requires hcalls to run.

So KVM traditionally tries hard to avoid the full exit, by handling
hcalls and other interrupts in real mode as much as possible.

By contrast, POWER9 has independent MMU context per-thread, and in radix
mode the hypervisor is in host virtual memory mode when the HV interrupt
is taken. Radix guests do not require significant hcalls to manage their
translations, and xive guests don't need hcalls to handle interrupts. So
it's much less important for performance to handle hcalls in real mode on
POWER9.

One caveat is that the TCE hcalls are performance critical, real-mode
variants introduced for POWER8 in order to achieve 10GbE performance.
Real mode TCE hcalls were found to be less important on POWER9, which
was able to drive 40GBe networking without them (using the virt mode
hcalls) but performance is still important. These hcalls will benefit
from subsequent guest entry/exit optimisation including possibly a
faster "partial exit" that does not entirely switch to host context to
handle the hcall.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: Cédric Le Goater <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: Move radix MMU switching instructions together

Switching the MMU from radix<->radix mode is tricky particularly as the
MMU can remain enabled and requires a certain sequence of SPR updates.
Move these together into their own functions.

This also includes the radix TLB check / flush because it's tied in to
MMU switching due to tlbiel getting LPID from LPIDR.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: Move xive vcpu context management into kvmhv_p9_guest_entry

Move the xive management up so the low level register switching can be
pushed further down in a later patch. XIVE MMIO CI operations can run in
higher level code with machine checks, tracing, etc., available.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: Reduce irq_work vs guest decrementer races

irq_work's use of the DEC SPR is racy with guest<->host switch and guest
entry which flips the DEC interrupt to guest, which could lose a host
work interrupt.

This patch closes one race, and attempts to comment another class of
races.

Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: Move setting HDEC after switching to guest LPCR

LPCR[HDICE]=0 suppresses hypervisor decrementer exceptions on some
processors, so it must be enabled before HDEC is set.

Rather than set it in the host LPCR then setting HDEC, move the HDEC
update to after the guest MMU context (including LPCR) is loaded.
There shouldn't be much concern with delaying HDEC by some 10s or 100s
of nanoseconds by setting it a bit later.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: Fabiano Rosas <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV P9: implement kvmppc_xive_pull_vcpu in C

This is more symmetric with kvmppc_xive_push_vcpu, and has the advantage
that it runs with the MMU on.

The extra test added to the asm will go away with a future change.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Cédric Le Goater <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S 64: Minimise hcall handler calling convention differences

This sets up the same calling convention from interrupt entry to
KVM interrupt handler for system calls as exists for other interrupt
types.

This is a better API, it uses a save area rather than SPR, and it has
more registers free to use. Using a single common API helps maintain
it, and it becomes easier to use in C in a later patch.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S 64: move bad_host_intr check to HV handler

The bad_host_intr check will never be true with PR KVM, move
it to HV code.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S 64: Move interrupt early register setup to KVM

Like the earlier patch for hcalls, KVM interrupt entry requires a
different calling convention than the Linux interrupt handlers
set up. Move the code that converts from one to the other into KVM.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Fabiano Rosas <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S 64: Move hcall early register setup to KVM

System calls / hcalls have a different calling convention than
other interrupts, so there is code in the KVMTEST to massage these
into the same form as other interrupt handlers.

Move this work into the KVM hcall handler. This means teaching KVM
a little more about the low level interrupt handler setup, PACA save
areas, etc., although that's not obviously worse than the current
approach of coming up with an entirely different interrupt register
/ save convention.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Fabiano Rosas <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S 64: add hcall interrupt handler

Add a separate hcall entry point. This can be used to deal with the
different calling convention.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Daniel Axtens <[email protected]>
Reviewed-by: Fabiano Rosas <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S 64: Move GUEST_MODE_SKIP test into KVM

Move the GUEST_MODE_SKIP logic into KVM code. This is quite a KVM
internal detail that has no real need to be in common handlers.

Add a comment explaining the what and why of KVM "skip" interrupts.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Daniel Axtens <[email protected]>
Reviewed-by: Fabiano Rosas <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S 64: move KVM interrupt entry to a common entry point

Rather than bifurcate the call depending on whether or not HV is
possible, and have the HV entry test for PR, just make a single
common point which does the demultiplexing. This makes it simpler
to add another type of exit handler.

Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Daniel Axtens <[email protected]>
Reviewed-by: Fabiano Rosas <[email protected]>
Acked-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: x86: Fix fall-through warnings for Clang

In preparation to enable -Wimplicit-fallthrough for Clang, fix a couple
of warnings by explicitly adding break statements instead of just letting
the code fall through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <[email protected]>
Message-Id: <20210528200756.GA39320@embeddedor>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: fix doc warnings

Fix kernel-doc warnings:

arch/x86/kvm/svm/avic.c:233: warning: Function parameter or member 'activate' not described in 'avic_update_access_page'
arch/x86/kvm/svm/avic.c:233: warning: Function parameter or member 'kvm' not described in 'avic_update_access_page'
arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'e' not described in 'get_pi_vcpu_info'
arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'kvm' not described in 'get_pi_vcpu_info'
arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'svm' not described in 'get_pi_vcpu_info'
arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'vcpu_info' not described in 'get_pi_vcpu_info'
arch/x86/kvm/svm/avic.c:1009: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst

Signed-off-by: ChenXiaoSong <[email protected]>
Message-Id: <20210609122217.2967131 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: selftests: Fix compiling errors when initializing the static structure

Errors like below were produced from test_util.c when compiling the KVM
selftests on my local platform.

lib/test_util.c: In function 'vm_mem_backing_src_alias':
lib/test_util.c:177:12: error: initializer element is not constant
.flag = anon_flags,
^~~~~~~~~~
lib/test_util.c:177:12: note: (near initialization for 'aliases[0].flag')

The reason is that we are using non-const expressions to initialize the
static structure, which will probably trigger a compiling error/warning
on stricter GCC versions. Fix it by converting the two const variables
"anon_flags" and "anon_huge_flags" into more stable macros.

Fixes: b3784bc28ccc0 ("KVM: selftests: refactor vm_mem_backing_src_type flags")
Reported-by: Zenghui Yu <[email protected]>
Signed-off-by: Yanan Wang <[email protected]>
Message-Id: <20210610085418 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: LAPIC: Restore guard to prevent illegal APIC register access

Per the SDM, "any access that touches bytes 4 through 15 of an APIC
register may cause undefined behavior and must not be executed."
Worse, such an access in kvm_lapic_reg_read can result in a leak of
kernel stack contents. Prior to commit 01402cf81051 ("kvm: LAPIC:
write down valid APIC registers"), such an access was explicitly
disallowed. Restore the guard that was removed in that commit.

Fixes: 01402cf81051 ("kvm: LAPIC: write down valid APIC registers")
Signed-off-by: Jim Mattson <[email protected]>
Reported-by: syzbot <[email protected]>
Message-Id: <20210602205224.3189316 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: fix previous commit for 32-bit builds

array_index_nospec does not work for uint64_t on 32-bit builds.
However, the size of a memory slot must be less than 20 bits wide
on those system, since the memory slot must fit in the user
address space. So just store it in an unsigned long.

Signed-off-by: Paolo Bonzini <[email protected]>

kvm: avoid speculation-based attacks from out-of-range memslot accesses

KVM's mechanism for accessing guest memory translates a guest physical
address (gpa) to a host virtual address using the right-shifted gpa
(also known as gfn) and a struct kvm_memory_slot.  The translation is
performed in __gfn_to_hva_memslot using the following formula:

      hva = slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE

It is expected that gfn falls within the boundaries of the guest's
physical memory.  However, a guest can access invalid physical addresses
in such a way that the gfn is invalid.

__gfn_to_hva_memslot is called from kvm_vcpu_gfn_to_hva_prot, which first
retrieves a memslot through __gfn_to_memslot.  While __gfn_to_memslot
does check that the gfn falls within the boundaries of the guest's
physical memory or not, a CPU can speculate the result of the check and
continue execution speculatively using an illegal gfn. The speculation
can result in calculating an out-of-bounds hva.  If the resulting host
virtual address is used to load another guest physical address, this
is effectively a Spectre gadget consisting of two consecutive reads,
the second of which is data dependent on the first.

Right now it's not clear if there are any cases in which this is
exploitable.  One interesting case was reported by the original author
of this patch, and involves visiting guest page tables on x86.  Right
now these are not vulnerable because the hva read goes through get_user(),
which contains an LFENCE speculation barrier.  However, there are
patches in progress for x86 uaccess.h to mask kernel addresses instead of
using LFENCE; once these land, a guest could use speculation to read
from the VMM's ring 3 address space.  Other architectures such as ARM
already use the address masking method, and would be susceptible to
this same kind of data-dependent access gadgets.  Therefore, this patch
proactively protects from these attacks by masking out-of-bounds gfns
in __gfn_to_hva_memslot, which blocks speculation of invalid hvas.

Sean Christopherson noted that this patch does not cover
kvm_read_guest_offset_cached.  This however is limited to a few bytes
past the end of the cache, and therefore it is unlikely to be useful in
the context of building a chain of data dependent accesses.

Reported-by: Artemiy Margaritov <[email protected]>
Co-developed-by: Artemiy Margaritov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>