Git Repo - linux.git/log

KVM: x86/mmu: Refactor the zap loop for recovering NX lpages

Refactor the zap loop in kvm_recover_nx_lpages() to be a for loop that
iterates on to_zap and drop the !to_zap check that leads to the in-loop
calling of kvm_mmu_commit_zap_page(). The in-loop commit when to_zap
hits zero is superfluous now that there's an unconditional commit after
the loop to handle the case where lpage_disallowed_mmu_pages is emptied.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200923183735 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Commit zap of remaining invalid pages when recovering lpages

Call kvm_mmu_commit_zap_page() after exiting the "prepare zap" loop in
kvm_recover_nx_lpages() to finish zapping pages in the unlikely event
that the loop exited due to lpage_disallowed_mmu_pages being empty.
Because the recovery thread drops mmu_lock() when rescheduling, it's
possible that lpage_disallowed_mmu_pages could be emptied by a different
thread without to_zap reaching zero despite to_zap being derived from
the number of disallowed lpages.

Fixes: 1aa9b9572b105 ("kvm: x86: mmu: Recovery of shattered NX large pages")
Cc: Junaid Shahid <[email protected]>
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200923183735 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Rename ops.h to vmx_ops.h

Rename ops.h to vmx_ops.h to allow adding a tdx_ops.h in the future
without causing massive confusion.

Trust Domain Extensions (TDX) is built on VMX, but KVM cannot directly
access the VMCS(es) for a TDX guest, thus TDX will need its own "ops"
implementation for wrapping the low level operations.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200923183112 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Extract posted interrupt support to separate files

Extract the posted interrupt code so that it can be reused for Trust
Domain Extensions (TDX), which requires posted interrupts and can use
KVM VMX's implementation almost verbatim. TDX is different enough from
raw VMX that it is highly desirable to implement the guts of TDX in a
separate file, i.e. reusing posted interrupt code by shoving TDX support
into vmx.c would be a mess.

Signed-off-by: Xiaoyao Li <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200923183112 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Bail early from final #PF handling on spurious faults

Detect spurious page faults, e.g. page faults that occur when multiple
vCPUs simultaneously access a not-present page, and skip the SPTE write,
prefetch, and stats update for spurious faults.

Note, the performance benefits of skipping the write and prefetch are
likely negligible, and the false positive stats adjustment is probably
lost in the noise. The primary motivation is to play nice with TDX's
SEPT in the long term. SEAMCALLs (to program SEPT entries) are quite
costly, e.g. thousands of cycles, and a spurious SEPT update will result
in a SEAMCALL error (which KVM will ideally treat as fatal).

Reported-by: Kai Huang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200923220425 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Return unique RET_PF_* values if the fault was fixed

Introduce RET_PF_FIXED and RET_PF_SPURIOUS to provide unique return
values instead of overloading RET_PF_RETRY.  In the short term, the
unique values add clarity to the code and RET_PF_SPURIOUS will be used
by set_spte() to avoid unnecessary work for spurious faults.

In the long term, TDX will use RET_PF_FIXED to deterministically map
memory during pre-boot.  The page fault flow may bail early for benign
reasons, e.g. if the mmu_notifier fires for an unrelated address.  With
only RET_PF_RETRY, it's impossible for the caller to distinguish between
"cool, page is mapped" and "darn, need to try again", and thus cannot
handle benign cases like the mmu_notifier retry.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200923220425 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Invert RET_PF_* check when falling through to emulation

Explicitly check for RET_PF_EMULATE instead of implicitly doing the same
by checking for !RET_PF_RETRY (RET_PF_INVALID is handled earlier). This
will adding new RET_PF_ types in future patches without breaking the
emulation path.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200923220425 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Return -EIO if page fault returns RET_PF_INVALID

Exit to userspace with an error if the MMU is buggy and returns
RET_PF_INVALID when servicing a page fault. This will allow a future
patch to invert the emulation path, i.e. emulate only on RET_PF_EMULATE
instead of emulating on anything but RET_PF_RETRY. This technically
means that KVM will exit to userspace instead of emulating on
RET_PF_INVALID, but practically speaking it's a nop as the MMU never
returns RET_PF_INVALID.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200923220425 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/MMU: Recursively zap nested TDP SPs when zapping last/only parent

Recursively zap all to-be-orphaned children, unsynced or otherwise, when
zapping a shadow page for a nested TDP MMU. KVM currently only zaps the
unsynced child pages, but not the synced ones. This can create problems
over time when running many nested guests because it leaves unlinked
pages which will not be freed until the page quota is hit. With the
default page quota of 20 shadow pages per 1000 guest pages, this looks
like a memory leak and can degrade MMU performance.

In a recent benchmark, substantial performance degradation was observed:
An L1 guest was booted with 64G memory.
2G nested Windows guests were booted, 10 at a time for 20
iterations. (200 total boots)
Windows was used in this benchmark because they touch all of their
memory on startup.
By the end of the benchmark, the nested guests were taking ~10% longer
to boot. With this patch there is no degradation in boot time.
Without this patch the benchmark ends with hundreds of thousands of
stale EPT02 pages cluttering up rmaps and the page hash map. As a
result, VM shutdown is also much slower: deleting memslot 0 was
observed to take over a minute. With this patch it takes just a
few miliseconds.

Cc: Peter Shier <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200923221406 [email protected]>
Reviewed-by: Ben Gardon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Move flush logic from mmu_page_zap_pte() to FNAME(invlpg)

Move the logic that controls whether or not FNAME(invlpg) needs to flush
fully into FNAME(invlpg) so that mmu_page_zap_pte() doesn't return a
value. This allows a future patch to redefine the return semantics for
mmu_page_zap_pte() so that it can recursively zap orphaned child shadow
pages for nested TDP MMUs.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200923221406 [email protected]>
Reviewed-by: Ben Gardon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: hyper-v: disallow configuring SynIC timers with no SynIC

Hyper-V Synthetic timers require SynIC but we don't seem to check that
upon HV_X64_MSR_STIMER[X]_CONFIG/HV_X64_MSR_STIMER0_COUNT writes. Make
the behavior match synic_set_msr().

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20200924145757.1035782 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: hyper-v: Mention SynDBG CPUID leaves in api.rst

We forgot to update KVM_GET_SUPPORTED_HV_CPUID's documentation in api.rst
when SynDBG leaves were added.

While on it, fix 'KVM_GET_SUPPORTED_CPUID' copy-paste error.

Fixes: f97f5a56f597 ("x86/kvm/hyper-v: Add support for synthetic debugger interface")
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20200924145757.1035782 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Stash 'kvm' in a local variable in kvm_mmu_free_roots()

To make kvm_mmu_free_roots() a bit more readable, capture 'kvm' in a
local variable instead of doing vcpu->kvm over and over (and over).

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200923191204 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Add a helper and macros to reduce boilerplate for sec exec ctls

Add a helper function and several wrapping macros to consolidate the
copy-paste code in vmx_compute_secondary_exec_control() for adjusting
controls that are dependent on guest CPUID bits.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200925003011 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Rename RDTSCP secondary exec control name to insert "ENABLE"

Rename SECONDARY_EXEC_RDTSCP to SECONDARY_EXEC_ENABLE_RDTSCP in
preparation for consolidating the logic for adjusting secondary exec
controls based on the guest CPUID model.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200923165048 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Unconditionally clear CPUID.INVPCID if !CPUID.PCID

If PCID is not exposed to the guest, clear INVPCID in the guest's CPUID
even if the VMCS INVPCID enable is not supported. This will allow
consolidating the secondary execution control adjustment code without
having to special case INVPCID.

Technically, this fixes a bug where !CPUID.PCID && CPUID.INVCPID would
result in unexpected guest behavior (#UD instead of #GP/#PF), but KVM
doesn't support exposing INVPCID if it's not supported in the VMCS, i.e.
such a config is broken/bogus no matter what.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200923165048 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Rename vmx_*_supported() helpers to cpu_has_vmx_*()

Rename helpers for a few controls to conform to the more prevelant style
of cpu_has_vmx_<feature>(). Consistent names will allow adding macros
to consolidate the boilerplate code for adjusting secondary execution
controls.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200923165048 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

cpuidle-haltpoll: fix error comments in arch_haltpoll_disable

The 'arch_haltpoll_disable' is used to disable guest halt poll.
Correct the comments.

Fixes: a1c4423b02b21 ("cpuidle-haltpoll: disable host side polling when kvm virtualized")
Signed-off-by: Li Qiang <[email protected]>
Message-Id: <20200924155800 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Use "illegal GPA" helper for PT/RTIT output base check

Use kvm_vcpu_is_illegal_gpa() to check for a legal GPA when validating a
PT output base instead of open coding a clever, but difficult to read,
variant. Code readability is far more important than shaving a few uops
in a slow path.

No functional change intended.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200924194250 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Move illegal GPA helper out of the MMU code

Rename kvm_mmu_is_illegal_gpa() to kvm_vcpu_is_illegal_gpa() and move it
to cpuid.h so that's it's colocated with cpuid_maxphyaddr(). The helper
is not MMU specific and will gain a user that is completely unrelated to
the MMU in a future patch.

No functional change intended.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200924194250 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Replace MSR_IA32_RTIT_OUTPUT_BASE_MASK with helper function

Replace the subtly not-a-constant MSR_IA32_RTIT_OUTPUT_BASE_MASK with a
proper helper function to check whether or not the specified base is
valid. Blindly referencing the local 'vcpu' is especially nasty.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200924194250 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Unexport cpuid_query_maxphyaddr()

Stop exporting cpuid_query_maxphyaddr() now that it's not being abused
by VMX.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200924194250 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Use precomputed MAXPHYADDR for RTIT base MSR check

Use cpuid_maxphyaddr() instead of cpuid_query_maxphyaddr() for the
RTIT base MSR check. There is no reason to recompute MAXPHYADDR as the
precomputed version is synchronized with CPUID updates, and
MSR_IA32_RTIT_OUTPUT_BASE is not written between stuffing CPUID and
refreshing vcpu->arch.maxphyaddr.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200924194250 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Do not perform emulation for INVD intercept

The INVD instruction is emulated as a NOP, just skip the instruction
instead.

Signed-off-by: Tom Lendacky <[email protected]>
Message-Id: <addd41be2fbf50f5f4059e990a2a0cff182d2136.1600972918 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SEV: shorten comments around sev_clflush_pages

Very similar content is present in four comments in sev.c. Unfortunately
there are small differences that make it harder to place the comment
in sev_clflush_pages itself, but at least we can make it more concise.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Mark SEV launch secret pages as dirty.

The LAUNCH_SECRET command performs encryption of the
launch secret memory contents. Mark pinned pages as
dirty, before unpinning them.
This matches the logic in sev_launch_update_data().

Signed-off-by: Cfir Cohen <[email protected]>
Message-Id: <20200808003746 [email protected]>
Reviewed-by: Brijesh Singh <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Add VM-Enter failed tracepoints for super early checks

Add tracepoints for the early consistency checks in nested_vmx_run().
The "VMLAUNCH vs. VMRESUME" check in particular is useful to trace, as
there is no architectural way to check VMCS.LAUNCH_STATE, and subtle
bugs such as VMCLEAR on the wrong HPA can lead to confusing errors in
the L1 VMM.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200812180615 [email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nSVM: CR3 MBZ bits are only 63:52

Commit 761e4169346553c180bbd4a383aedd72f905bc9a created a wrong mask for the
CR3 MBZ bits. According to APM vol 2, only the upper 12 bits are MBZ.

Fixes: 761e41693465 ("KVM: nSVM: Check that MBZ bits in CR3 and CR4 are not set on vmrun of nested guests", 2020-07-08)
Signed-off-by: Krish Sadhukhan <[email protected]>
Message-Id: <20200829004824 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: emulating RDPID failure shall return #UD rather than #GP

Per Intel's SDM, RDPID takes a #UD if it is unsupported, which is more or
less what KVM is emulating when MSR_TSC_AUX is not available. In fact,
there are no scenarios in which RDPID is supposed to #GP.

Fixes: fb6d4d340e ("KVM: x86: emulate RDPID")
Signed-off-by: Robert Hoo <[email protected]>
Message-Id: <1598581422 [email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Morph notification vector IRQ on nested VM-Enter to pending PI

On successful nested VM-Enter, check for pending interrupts and convert
the highest priority interrupt to a pending posted interrupt if it
matches L2's notification vector.  If the vCPU receives a notification
interrupt before nested VM-Enter (assuming L1 disables IRQs before doing
VM-Enter), the pending interrupt (for L1) should be recognized and
processed as a posted interrupt when interrupts become unblocked after
VM-Enter to L2.

This fixes a bug where L1/L2 will get stuck in an infinite loop if L1 is
trying to inject an interrupt into L2 by setting the appropriate bit in
L2's PIR and sending a self-IPI prior to VM-Enter (as opposed to KVM's
method of manually moving the vector from PIR->vIRR/RVI).  KVM will
observe the IPI while the vCPU is in L1 context and so won't immediately
morph it to a posted interrupt for L2.  The pending interrupt will be
seen by vmx_check_nested_events(), cause KVM to force an immediate exit
after nested VM-Enter, and eventually be reflected to L1 as a VM-Exit.
After handling the VM-Exit, L1 will see that L2 has a pending interrupt
in PIR, send another IPI, and repeat until L2 is killed.

Note, posted interrupts require virtual interrupt deliveriy, and virtual
interrupt delivery requires exit-on-interrupt, ergo interrupts will be
unconditionally unmasked on VM-Enter if posted interrupts are enabled.

Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing")
Cc: [email protected]
Cc: Liran Alon <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200812175129 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MIPS: clean up redundant kvm_run parameters in assembly

In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. For historical reasons, many kvm-related function parameters
retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
patch does a unified cleanup of these remaining redundant parameters.

Signed-off-by: Tianjia Zhang <[email protected]>
Reviewed-by: Huacai Chen <[email protected]>
Tested-by: Jiaxun Yang <[email protected]>
Message-Id: <20200623131418 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Add tracepoint for cr_interception

Add trace_kvm_cr_write and trace_kvm_cr_read for svm.

Signed-off-by: Haiwei Li <[email protected]>
Message-Id: <f3031602-db3b-c4fe-b719-d402663b0a2b@gmail.com>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Analyze is_guest_mode() in svm_vcpu_run()

Analyze is_guest_mode() in svm_vcpu_run() instead of svm_exit_handlers_fastpath()
in conformity with VMX version.

Suggested-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Message-Id: <1600066548 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Invoke NMI handler via indirect call instead of INTn

Rework NMI VM-Exit handling to invoke the kernel handler by function
call instead of INTn. INTn microcode is relatively expensive, and
aligning the IRQ and NMI handling will make it easier to update KVM
should some newfangled method for invoking the handlers come along.

Suggested-by: Andi Kleen <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200915191505 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Move IRQ invocation to assembly subroutine

Move the asm blob that invokes the appropriate IRQ handler after VM-Exit
into a proper subroutine. Unconditionally create a stack frame in the
subroutine so that, as objtool sees things, the function has standard
stack behavior. The dynamic stack adjustment makes using unwind hints
problematic.

Suggested-by: Josh Poimboeuf <[email protected]>
Cc: Uros Bizjak <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200915191505 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Add kvm_x86_ops hook to short circuit emulation

Replace the existing kvm_x86_ops.need_emulation_on_page_fault() with a
more generic is_emulatable(), and unconditionally call the new function
in x86_emulate_instruction().

KVM will use the generic hook to support multiple security related
technologies that prevent emulation in one way or another. Similar to
the existing AMD #NPF case where emulation of the current instruction is
not possible due to lack of information, AMD's SEV-ES and Intel's SGX
and TDX will introduce scenarios where emulation is impossible due to
the guest's register state being inaccessible. And again similar to the
existing #NPF case, emulation can be initiated by kvm_mmu_page_fault(),
i.e. outside of the control of vendor-specific code.

While the cause and architecturally visible behavior of the various
cases are different, e.g. SGX will inject a #UD, AMD #NPF is a clean
resume or complete shutdown, and SEV-ES and TDX "return" an error, the
impact on the common emulation code is identical: KVM must stop
emulation immediately and resume the guest.

Query is_emulatable() in handle_ud() as well so that the
force_emulation_prefix code doesn't incorrectly modify RIP before
calling emulate_instruction() in the absurdly unlikely scenario that
KVM encounters forced emulation in conjunction with "do not emulate".

Cc: Tom Lendacky <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200915232702 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: use __GFP_ZERO instead of clear_page()

Use __GFP_ZERO while alloc_page().

Signed-off-by: Haiwei Li <[email protected]>
Message-Id: <20200916083621 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: KVM needs to unset "unrestricted guest" VM-execution control in vmcs02 if vmcs12 doesn't set it

Currently, prepare_vmcs02_early() does not check if the "unrestricted guest"
VM-execution control in vmcs12 is turned off and leaves the corresponding
bit on in vmcs02. Due to this setting, vmentry checks which are supposed to
render the nested guest state as invalid when this VM-execution control is
not set, are passing in hardware.

This patch turns off the "unrestricted guest" VM-execution control in vmcs02
if vmcs12 has turned it off.

Suggested-by: Jim Mattson <[email protected]>
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Krish Sadhukhan <[email protected]>
Message-Id: <20200921081027 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: fix MSR_IA32_TSC read for nested migration

MSR reads/writes should always access the L1 state, since the (nested)
hypervisor should intercept all the msrs it wants to adjust, and these
that it doesn't should be read by the guest as if the host had read it.

However IA32_TSC is an exception. Even when not intercepted, guest still
reads the value + TSC offset.
The write however does not take any TSC offset into account.

This is documented in Intel's SDM and seems also to happen on AMD as well.

This creates a problem when userspace wants to read the IA32_TSC value and then
write it. (e.g for migration)

In this case it reads L2 value but write is interpreted as an L1 value.
To fix this make the userspace initiated reads of IA32_TSC return L1 value
as well.

Huge thanks to Dave Gilbert for helping me understand this very confusing
semantic of MSR writes.

Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20200921103805 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: use struct_size() and flex_array_size() helpers in kvm_io_bus_unregister_dev()

Make use of the struct_size() helper to avoid any potential type
mistakes and protect against potential integer overflows
Make use of the flex_array_size() helper to calculate the size of a
flexible array member within an enclosing structure

Suggested-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: Rustam Kovhaev <[email protected]>
Message-Id: <20200918120500 [email protected]>
Reviewed-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Enable INVPCID feature on AMD

The following intercept bit has been added to support VMEXIT
for INVPCID instruction:
Code    Name            Cause
A2h     VMEXIT_INVPCID  INVPCID instruction

The following bit has been added to the VMCB layout control area
to control intercept of INVPCID:
Byte Offset     Bit(s)    Function
14h             2         intercept INVPCID

Enable the interceptions when the the guest is running with shadow
page table enabled and handle the tlbflush based on the invpcid
instruction type.

For the guests with nested page table (NPT) support, the INVPCID
feature works as running it natively. KVM does not need to do any
special handling in this case.

AMD documentation for INVPCID feature is available at "AMD64
Architecture Programmer’s Manual Volume 2: System Programming,
Pub. 24593 Rev. 3.34(or later)"

The documentation can be obtained at the links below:
Link: https://www.amd.com/system/files/TechDocs/24593.pdf
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <159985255929.11252.17346684135277453258.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Move handling of INVPCID types to x86

INVPCID instruction handling is mostly same across both VMX and
SVM. So, move the code to common x86.c.

Signed-off-by: Babu Moger <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <159985255212.11252.10322694343971983487.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Rename and move the function vmx_handle_memory_failure to x86.c

Handling of kvm_read/write_guest_virt*() errors can be moved to common
code. The same code can be used by both VMX and SVM.

Signed-off-by: Babu Moger <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <159985254493.11252.6603092560732507607.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Remove set_cr_intercept, clr_cr_intercept and is_cr_intercept

Remove set_cr_intercept, clr_cr_intercept and is_cr_intercept. Instead
call generic svm_set_intercept, svm_clr_intercept an dsvm_is_intercep
tfor all cr intercepts.

Signed-off-by: Babu Moger <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <159985253016.11252.16945893859439811480.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Add new intercept word in vmcb_control_area

The new intercept bits have been added in vmcb control area to support
few more interceptions. Here are the some of them.
- INTERCEPT_INVLPGB,
- INTERCEPT_INVLPGB_ILLEGAL,
- INTERCEPT_INVPCID,
- INTERCEPT_MCOMMIT,
- INTERCEPT_TLBSYNC,

Add a new intercept word in vmcb_control_area to support these instructions.
Also update kvm_nested_vmrun trace function to support the new addition.

AMD documentation for these instructions is available at "AMD64
Architecture Programmer’s Manual Volume 2: System Programming, Pub. 24593
Rev. 3.34(or later)"

The documentation can be obtained at the links below:
Link: https://www.amd.com/system/files/TechDocs/24593.pdf
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <159985251547.11252.16994139329949066945.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Modify 64 bit intercept field to two 32 bit vectors

Convert all the intercepts to one array of 32 bit vectors in
vmcb_control_area. This makes it easy for future intercept vector
additions. Also update trace functions.

Signed-off-by: Babu Moger <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <159985250813.11252.5736581193881040525.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Modify intercept_exceptions to generic intercepts

Modify intercept_exceptions to generic intercepts in vmcb_control_area. Use
the generic vmcb_set_intercept, vmcb_clr_intercept and vmcb_is_intercept to
set/clear/test the intercept_exceptions bits.

Signed-off-by: Babu Moger <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <159985250037.11252.1361972528657052410.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Change intercept_dr to generic intercepts

Modify intercept_dr to generic intercepts in vmcb_control_area. Use
the generic vmcb_set_intercept, vmcb_clr_intercept and vmcb_is_intercept
to set/clear/test the intercept_dr bits.

Signed-off-by: Babu Moger <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <159985249255.11252.10000868032136333355.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Change intercept_cr to generic intercepts

Change intercept_cr to generic intercepts in vmcb_control_area.
Use the new vmcb_set_intercept, vmcb_clr_intercept and vmcb_is_intercept
where applicable.

Signed-off-by: Babu Moger <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <159985248506.11252.9081085950784508671.stgit@bmoger-ubuntu>
[Change constant names. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Introduce vmcb_(set_intercept/clr_intercept/_is_intercept)

This is in preparation for the future intercept vector additions.

Add new functions vmcb_set_intercept, vmcb_clr_intercept and vmcb_is_intercept
using kernel APIs __set_bit, __clear_bit and test_bit espectively.

Signed-off-by: Babu Moger <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <159985247876.11252.16039238014239824460.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nSVM: Remove unused field

host_intercept_exceptions is not used anywhere. Clean it up.

Signed-off-by: Babu Moger <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <159985252277.11252.8819848322175521354.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: refactor exit labels in svm_create_vcpu

Kernel coding style suggests not to use labels like error1,error2

Suggested-by: Jim Mattson <[email protected]>
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20200827171145 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: use __GFP_ZERO instead of clear_page

Another small refactoring.

Suggested-by: Jim Mattson <[email protected]>
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20200827171145 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: refactor msr permission bitmap allocation

Replace svm_vcpu_init_msrpm with svm_vcpu_alloc_msrpm, that also allocates
the msr bitmap and add svm_vcpu_free_msrpm to free it.

This will be used later to move the nested msr permission bitmap allocation
to nested.c

Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20200827171145 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nSVM: rename nested vmcb to vmcb12

This is to be more consistient with VMX, and to support
upcoming addition of vmcb02

Hopefully no functional changes.

Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20200827171145 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: rename a variable in the svm_create_vcpu

The 'page' is to hold the vcpu's vmcb so name it as such to
avoid confusion.

Signed-off-by: Maxim Levitsky <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <20200827171145 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: LAPIC: Reduce world switch latency caused by timer_advance_ns

All the checks in lapic_timer_int_injected(), __kvm_wait_lapic_expire(), and
these function calls waste cpu cycles when the timer mode is not tscdeadline.
We can observe ~1.3% world switch time overhead by kvm-unit-tests/vmexit.flat
vmcall testing on AMD server. This patch reduces the world switch latency
caused by timer_advance_ns feature when the timer mode is not tscdeadline by
simpling move the check against apic->lapic_timer.expired_tscdeadline much
earlier.

Signed-off-by: Wanpeng Li <[email protected]>
Message-Id: <1599731444 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: LAPIC: Narrow down the kick target vCPU

The kick after setting KVM_REQ_PENDING_TIMER is used to handle the timer
fires on a different pCPU which vCPU is running on. This kick costs about
1000 clock cycles and we don't need this when injecting already-expired
timer or when using the VMX preemption timer because
kvm_lapic_expired_hv_timer() is called from the target vCPU.

Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Message-Id: <1599731444 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: LAPIC: Guarantee the timer is in tsc-deadline mode when setting

Check apic_lvtt_tscdeadline() mode directly instead of apic_lvtt_oneshot()
and apic_lvtt_period() to guarantee the timer is in tsc-deadline mode when
wrmsr MSR_IA32_TSCDEADLINE.

Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Message-Id: <1599731444 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: LAPIC: Return 0 when getting the tscdeadline timer if the lapic is hw disabled

Return 0 when getting the tscdeadline timer if the lapic is hw disabled.

Suggested-by: Paolo Bonzini <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Message-Id: <1599731444 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: LAPIC: Fix updating DFR missing apic map recalculation

There is missing apic map recalculation after updating DFR, if it is
INIT RESET, in x2apic mode, local apic is software enabled before.
This patch fix it by introducing the function kvm_apic_set_dfr() to
be called in INIT RESET handling path.

Signed-off-by: Wanpeng Li <[email protected]>
Message-Id: <1597827327 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm/eventfd: move wildcard calculation outside loop

There is no need to calculate wildcard in each iteration
since wildcard is not changed.

Signed-off-by: Yi Li <[email protected]>
Message-Id: <20200911055652.3041762 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Simplify the initialization of nested_vmx_msrs

The nested VMX controls MSRs can be initialized by the global capability
values stored in vmcs_config.

Signed-off-by: Chenyi Qiang <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Message-Id: <20200828085622 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Fix VMX controls MSRs setup when nested VMX enabled

KVM supports the nested VM_{EXIT, ENTRY}_LOAD_IA32_PERF_GLOBAL_CTRL and
VM_{ENTRY_LOAD, EXIT_CLEAR}_BNDCFGS, but they are not exposed by the
system ioctl KVM_GET_MSR. Add them to the setup of nested VMX controls MSR.

Signed-off-by: Chenyi Qiang <[email protected]>
Message-Id: <20200828085622 [email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nSVM: Avoid freeing uninitialized pointers in svm_set_nested_state()

The save and ctl pointers are passed uninitialized to kfree() when
svm_set_nested_state() follows the 'goto out_set_gif' path. While the
issue could've been fixed by initializing these on-stack varialbles to
NULL, it seems preferable to eliminate 'out_set_gif' label completely as
it is not actually a failure path and duplicating a single svm_set_gif()
call doesn't look too bad.

[ bp: Drop obscure Addresses-Coverity: tag. ]

Fixes: 6ccbd29ade0d ("KVM: SVM: nested: Don't allocate VMCB structures on stack")
Reported-by: Dan Carpenter <[email protected]>
Reported-by: Joerg Roedel <[email protected]>
Reported-by: Colin King <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Acked-by: Joerg Roedel <[email protected]>
Tested-by: Tom Lendacky <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Paolo Bonzini <[email protected]>

Merge tag 'kvm-ppc-next-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD

PPC KVM update for 5.10

- Fix for running nested guests with in-kernel IRQ chip
- Fix race condition causing occasional host hard lockup
- Minor cleanups and bugfixes

Merge branch 'x86-seves-for-paolo' of https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD

KVM: PPC: Book3S: Fix symbol undeclared warnings

Build the kernel with `C=2`:
arch/powerpc/kvm/book3s_hv_nested.c:572:25: warning: symbol
'kvmhv_alloc_nested' was not declared. Should it be static?
arch/powerpc/kvm/book3s_64_mmu_radix.c:350:6: warning: symbol
'kvmppc_radix_set_pte_at' was not declared. Should it be static?
arch/powerpc/kvm/book3s_hv.c:3568:5: warning: symbol
'kvmhv_p9_guest_entry' was not declared. Should it be static?
arch/powerpc/kvm/book3s_hv_rm_xics.c:767:15: warning: symbol 'eoi_rc'
was not declared. Should it be static?
arch/powerpc/kvm/book3s_64_vio_hv.c:240:13: warning: symbol
'iommu_tce_kill_rm' was not declared. Should it be static?
arch/powerpc/kvm/book3s_64_vio.c:492:6: warning: symbol
'kvmppc_tce_iommu_do_map' was not declared. Should it be static?
arch/powerpc/kvm/book3s_pr.c:572:6: warning: symbol 'kvmppc_set_pvr_pr'
was not declared. Should it be static?

Those symbols are used only in the files that define them so make them
static to fix the warnings.

Signed-off-by: Wang Wensheng <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S: Remove redundant initialization of variable ret

The variable ret is being initialized with '-ENOMEM' that is meaningless.
So remove it.

Signed-off-by: Jing Xiangfeng <[email protected]>
Reviewed-by: Fabiano Rosas <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S HV: XIVE: Convert to DEFINE_SHOW_ATTRIBUTE

Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.

Signed-off-by: Qinglang Miao <[email protected]>
Reviewed-by: Cédric Le Goater <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

arm64: Run ARCH_WORKAROUND_1 enabling code on all CPUs

Commit 73f381660959 ("arm64: Advertise mitigation of Spectre-v2, or lack
thereof") changed the way we deal with ARCH_WORKAROUND_1, by moving most
of the enabling code to the .matches() callback.

This has the unfortunate effect that the workaround gets only enabled on
the first affected CPU, and no other.

In order to address this, forcefully call the .matches() callback from a
.cpu_enable() callback, which brings us back to the original behaviour.

Fixes: 73f381660959 ("arm64: Advertise mitigation of Spectre-v2, or lack thereof")
Cc: <[email protected]>
Reviewed-by: Suzuki K Poulose <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: Make use of ARCH_WORKAROUND_1 even when KVM is not enabled

We seem to be pretending that we don't have any firmware mitigation
when KVM is not compiled in, which is not quite expected.

Bring back the mitigation in this case.

Fixes: 4db61fef16a1 ("arm64: kvm: Modernize __smccc_workaround_1_smc_start annotations")
Cc: <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

Merge tag 'kvm-s390-master-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-master

KVM: s390: add documentation for KVM_CAP_S390_DIAG318

diag318 code was merged in 5.9-rc1, let us add some
missing documentation

Merge tag 'kvmarm-fixes-5.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master

KVM/arm64 fixes for 5.9, take #2

- Fix handling of S1 Page Table Walk permission fault at S2
on instruction fetch
- Cleanup kvm_vcpu_dabt_iswrite()

Revert "KVM: Check the allocation of pv cpu mask"

The commit 0f990222108d ("KVM: Check the allocation of pv cpu mask") we
have in 5.9-rc5 has two issue:
1) Compilation fails for !CONFIG_SMP, see:
https://bugzilla.kernel.org/show_bug.cgi?id=209285

2) This commit completely disables PV TLB flush, see
https://lore.kernel.org/kvm/[email protected]/

The allocation problem is likely a theoretical one, if we don't
have memory that early in boot process we're likely doomed anyway.
Let's solve it properly later.

This reverts commit 0f990222108d214a0924d920e6095b58107d7b59.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: arm64: Remove S1PTW check from kvm_vcpu_dabt_iswrite()

Now that kvm_vcpu_trap_is_write_fault() checks for S1PTW, there
is no need for kvm_vcpu_dabt_iswrite() to do the same thing, as
we already check for this condition on all existing paths.

Drop the check and add a comment instead.

Signed-off-by: Marc Zyngier <[email protected]>
Reviewed-by: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch

KVM currently assumes that an instruction abort can never be a write.
This is in general true, except when the abort is triggered by
a S1PTW on instruction fetch that tries to update the S1 page tables
(to set AF, for example).

This can happen if the page tables have been paged out and brought
back in without seeing a direct write to them (they are thus marked
read only), and the fault handling code will make the PT executable(!)
instead of writable. The guest gets stuck forever.

In these conditions, the permission fault must be considered as
a write so that the Stage-1 update can take place. This is essentially
the I-side equivalent of the problem fixed by 60e21a0ef54c ("arm64: KVM:
Take S1 walks into account when determining S2 write faults").

Update kvm_is_write_fault() to return true on IABT+S1PTW, and introduce
kvm_vcpu_trap_is_exec_fault() that only return true when no faulting
on a S1 fault. Additionally, kvm_vcpu_dabt_iss1tw() is renamed to
kvm_vcpu_abt_iss1tw(), as the above makes it plain that it isn't
specific to data abort.

Signed-off-by: Marc Zyngier <[email protected]>
Reviewed-by: Will Deacon <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]

Merge branch 'kvm-arm64/misc-5.10' into kvmarm-master/next

Signed-off-by: Marc Zyngier <[email protected]>

Merge branch 'kvm-arm64/pt-new' into kvmarm-master/next

Signed-off-by: Marc Zyngier <[email protected]>
# Conflicts:
# arch/arm64/kvm/mmu.c

KVM: arm64: Fix doc warnings in mmu code

Fix following warnings caused by mismatch bewteen function parameters
and comments.
arch/arm64/kvm/mmu.c:128: warning: Function parameter or member 'mmu' not described in '__unmap_stage2_range'
arch/arm64/kvm/mmu.c:128: warning: Function parameter or member 'may_block' not described in '__unmap_stage2_range'
arch/arm64/kvm/mmu.c:128: warning: Excess function parameter 'kvm' description in '__unmap_stage2_range'
arch/arm64/kvm/mmu.c:499: warning: Function parameter or member 'writable' not described in 'kvm_phys_addr_ioremap'
arch/arm64/kvm/mmu.c:538: warning: Function parameter or member 'mmu' not described in 'stage2_wp_range'
arch/arm64/kvm/mmu.c:538: warning: Excess function parameter 'kvm' description in 'stage2_wp_range'

Signed-off-by: Xiaofei Tan <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Acked-by: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: Do not flush memslot if FWB is supported

As a result of a KVM_SET_USER_MEMORY_REGION ioctl, KVM flushes the
dcache for the memslot being changed to ensure a consistent view of memory
between the host and the guest: the host runs with caches enabled, and
it is possible for the data written by the hypervisor to still be in the
caches, but the guest is running with stage 1 disabled, meaning data
accesses are to Device-nGnRnE memory, bypassing the caches entirely.

Flushing the dcache is not necessary when KVM enables FWB, because it
forces the guest to uses cacheable memory accesses.

The current behaviour does not change, as the dcache flush helpers execute
the cache operation only if FWB is not enabled, but walking the stage 2
table is avoided.

Signed-off-by: Alexandru Elisei <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: vgic-debug: Convert to use DEFINE_SEQ_ATTRIBUTE macro

Use DEFINE_SEQ_ATTRIBUTE macro to simplify the code.

Signed-off-by: Liu Shixin <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: Fix inject_fault.c kernel-doc warnings

Fix kernel-doc warnings.
arch/arm64/kvm/inject_fault.c:210: warning: Function parameter or member
'vcpu' not described in 'kvm_inject_undefined'

Signed-off-by: Tian Tao <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Acked-by: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: Try PMD block mappings if PUD mappings are not supported

When userspace uses hugetlbfs for the VM memory, user_mem_abort() tries to
use the same block size to map the faulting IPA in stage 2. If stage 2
cannot the same block mapping because the block size doesn't fit in the
memslot or the memslot is not properly aligned, user_mem_abort() will fall
back to a page mapping, regardless of the block size. We can do better for
PUD backed hugetlbfs by checking if a PMD block mapping is supported before
deciding to use a page.

vma_pagesize is an unsigned long, use 1UL instead of 1ULL when assigning
its value.

Signed-off-by: Alexandru Elisei <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: PPC: Book3S HV: Set LPCR[HDICE] before writing HDEC

POWER8 and POWER9 machines have a hardware deviation where generation
of a hypervisor decrementer exception is suppressed if the HDICE bit
in the LPCR register is 0 at the time when the HDEC register
decrements from 0 to -1.  When entering a guest, KVM first writes the
HDEC register with the time until it wants the CPU to exit the guest,
and then writes the LPCR with the guest value, which includes
HDICE = 1.  If HDEC decrements from 0 to -1 during the interval
between those two events, it is possible that we can enter the guest
with HDEC already negative but no HDEC exception pending, meaning that
no HDEC interrupt will occur while the CPU is in the guest, or at
least not until HDEC wraps around.  Thus it is possible for the CPU to
keep executing in the guest for a long time; up to about 4 seconds on
POWER8, or about 4.46 years on POWER9 (except that the host kernel
hard lockup detector will fire first).

To fix this, we set the LPCR[HDICE] bit before writing HDEC on guest
entry.

Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S HV: Do not allocate HPT for a nested guest

The current nested KVM code does not support HPT guests. This is
informed/enforced in some ways:

- Hosts < P9 will not be able to enable the nested HV feature;

- The nested hypervisor MMU capabilities will not contain
  KVM_CAP_PPC_MMU_HASH_V3;

- QEMU reflects the MMU capabilities in the
  'ibm,arch-vec-5-platform-support' device-tree property;

- The nested guest, at 'prom_parse_mmu_model' ignores the
  'disable_radix' kernel command line option if HPT is not supported;

- The KVM_PPC_CONFIGURE_V3_MMU ioctl will fail if trying to use HPT.

There is, however, still a way to start a HPT guest by using
max-compat-cpu=power8 at the QEMU machine options. This leads to the
guest being set to use hash after QEMU calls the KVM_PPC_ALLOCATE_HTAB
ioctl.

With the guest set to hash, the nested hypervisor goes through the
entry path that has no knowledge of nesting (kvmppc_run_vcpu) and
crashes when it tries to execute an hypervisor-privileged (mtspr
HDEC) instruction at __kvmppc_vcore_entry:

root@L1:~ $ qemu-system-ppc64 -machine pseries,max-cpu-compat=power8 ...

<snip>
[  538.543303] CPU: 83 PID: 25185 Comm: CPU 0/KVM Not tainted 5.9.0-rc4 #1
[  538.543355] NIP:  c00800000753f388 LR: c00800000753f368 CTR: c0000000001e5ec0
[  538.543417] REGS: c0000013e91e33b0 TRAP: 0700   Not tainted  (5.9.0-rc4)
[  538.543470] MSR:  8000000002843033 <SF,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 22422882  XER: 20040000
[  538.543546] CFAR: c00800000753f4b0 IRQMASK: 3
               GPR00: c0080000075397a0 c0000013e91e3640 c00800000755e600 0000000080000000
               GPR04: 0000000000000000 c0000013eab19800 c000001394de0000 00000043a054db72
               GPR08: 00000000003b1652 0000000000000000 0000000000000000 c0080000075502e0
               GPR12: c0000000001e5ec0 c0000007ffa74200 c0000013eab19800 0000000000000008
               GPR16: 0000000000000000 c00000139676c6c0 c000000001d23948 c0000013e91e38b8
               GPR20: 0000000000000053 0000000000000000 0000000000000001 0000000000000000
               GPR24: 0000000000000001 0000000000000001 0000000000000000 0000000000000001
               GPR28: 0000000000000001 0000000000000053 c0000013eab19800 0000000000000001
[  538.544067] NIP [c00800000753f388] __kvmppc_vcore_entry+0x90/0x104 [kvm_hv]
[  538.544121] LR [c00800000753f368] __kvmppc_vcore_entry+0x70/0x104 [kvm_hv]
[  538.544173] Call Trace:
[  538.544196] [c0000013e91e3640] [c0000013e91e3680] 0xc0000013e91e3680 (unreliable)
[  538.544260] [c0000013e91e3820] [c0080000075397a0] kvmppc_run_core+0xbc8/0x19d0 [kvm_hv]
[  538.544325] [c0000013e91e39e0] [c00800000753d99c] kvmppc_vcpu_run_hv+0x404/0xc00 [kvm_hv]
[  538.544394] [c0000013e91e3ad0] [c0080000072da4fc] kvmppc_vcpu_run+0x34/0x48 [kvm]
[  538.544472] [c0000013e91e3af0] [c0080000072d61b8] kvm_arch_vcpu_ioctl_run+0x310/0x420 [kvm]
[  538.544539] [c0000013e91e3b80] [c0080000072c7450] kvm_vcpu_ioctl+0x298/0x778 [kvm]
[  538.544605] [c0000013e91e3ce0] [c0000000004b8c2c] sys_ioctl+0x1dc/0xc90
[  538.544662] [c0000013e91e3dc0] [c00000000002f9a4] system_call_exception+0xe4/0x1c0
[  538.544726] [c0000013e91e3e20] [c00000000000d140] system_call_common+0xf0/0x27c
[  538.544787] Instruction dump:
[  538.544821] f86d1098 60000000 60000000 48000099 e8ad0fe8 e8c500a0 e9264140 75290002
[  538.544886] 7d1602a6 7cec42a6 40820008 7d0807b4 <7d164ba6> 7d083a14 f90d10a0 480104fd
[  538.544953] ---[ end trace 74423e2b948c2e0c ]---

This patch makes the KVM_PPC_ALLOCATE_HTAB ioctl fail when running in
the nested hypervisor, causing QEMU to abort.

Reported-by: Satheesh Rajendran <[email protected]>
Signed-off-by: Fabiano Rosas <[email protected]>
Reviewed-by: Greg Kurz <[email protected]>
Reviewed-by: David Gibson <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Don't return -ENOTSUPP to userspace in ioctls

ENOTSUPP is a linux only thingy, the value of which is unknown to
userspace, not to be confused with ENOTSUP which linux maps to
EOPNOTSUPP, as permitted by POSIX [1]:

[EOPNOTSUPP]
Operation not supported on socket. The type of socket (address family
or protocol) does not support the requested operation. A conforming
implementation may assign the same values for [EOPNOTSUPP] and [ENOTSUP].

Return -EOPNOTSUPP instead of -ENOTSUPP for the following ioctls:
- KVM_GET_FPU for Book3s and BookE
- KVM_SET_FPU for Book3s and BookE
- KVM_GET_DIRTY_LOG for BookE

This doesn't affect QEMU which doesn't call the KVM_GET_FPU and
KVM_SET_FPU ioctls on POWER anyway since they are not supported,
and _buggily_ ignores anything but -EPERM for KVM_GET_DIRTY_LOG.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html

Signed-off-by: Greg Kurz <[email protected]>
Acked-by: Thadeu Lima de Souza Cascardo <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

Merge branch 'kvm-arm64/nvhe-hyp-context' into kvmarm-master/next

Signed-off-by: Marc Zyngier <[email protected]>

KVM: arm64: nVHE: Fix pointers during SMCCC convertion

The host need not concern itself with the pointer differences for the
hyp interfaces that are shared between VHE and nVHE so leave it to the
hyp to handle.

As the SMCCC function IDs are converted into function calls, it is a
suitable place to also convert any pointer arguments into hyp pointers.
This, additionally, eases the reuse of the handlers in different
contexts.

Signed-off-by: Andrew Scull <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: nVHE: Migrate hyp-init to SMCCC

To complete the transition to SMCCC, the hyp initialization is given a
function ID. This looks neater than comparing the hyp stub function IDs
to the page table physical address.

Some care is taken to only clobber x0-3 before the host context is saved
as only those registers can be clobbered accoring to SMCCC. Fortunately,
only a few acrobatics are needed. The possible new tpidr_el2 is moved to
the argument in x2 so that it can be stashed in tpidr_el2 early to free
up a scratch register. The page table configuration then makes use of
x0-2.

Signed-off-by: Andrew Scull <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: nVHE: Migrate hyp interface to SMCCC

Rather than passing arbitrary function pointers to run at hyp, define
and equivalent set of SMCCC functions.

Since the SMCCC functions are strongly tied to the original function
prototypes, it is not expected for the host to ever call an invalid ID
but a warning is raised if this does ever occur.

As __kvm_vcpu_run is used for every switch between the host and a guest,
it is explicitly singled out to be identified before the other function
IDs to improve the performance of the hot path.

Signed-off-by: Andrew Scull <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

smccc: Use separate variables for args and results

Using the same register-bound variable for both arguments and results
means these values share a type. That type must allow the arguments to
be assigned to it and must also be assignable to the unsigned long
fields of struct arm_smccc_res.

This restriction on types causes compiler warnings when the argument
cannot be implicitly assigned to an unsigned long, for example the
pointers that are used in the KVM hyp interface.

By separating the arguments and results into their own variables, the
type constraint is lifted allowing the arguments to avoid the need for
any type conversion.

Signed-off-by: Andrew Scull <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Cc: Sudeep Holla <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

smccc: Define vendor hyp owned service call region

Vendor specific hypervisor services have their own region of function
identifiers reserved by SMCCC. Extend the list of owners to include this
case.

Signed-off-by: Andrew Scull <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Cc: Sudeep Holla <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: nVHE: Pass pointers consistently to hyp-init

Rather than some being kernel pointer and others being hyp pointers,
standardize on all pointers being hyp pointers.

Signed-off-by: Andrew Scull <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: nVHE: Handle hyp panics

Restore the host context when panicking from hyp to give the best chance
of the panic being clean.

The host requires that registers be preserved such as x18 for the shadow
callstack. If the panic is caused by an exception from EL1, the host
context is still valid so the panic can return straight back to the
host. If the panic comes from EL2 then it's most likely that the hyp
context is active and the host context needs to be restored.

There are windows before and after the host context is saved and
restored that restoration is attempted incorrectly and the panic won't
be clean.

Signed-off-by: Andrew Scull <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: nVHE: Switch to hyp context for EL2

Save and restore the host context when switching to and from hyp. This
gives hyp its own context that the host will not see as a step towards a
full trust boundary between the two.

SP_EL0 and pointer authentication keys are currently shared between the
host and hyp so don't need to be switched yet.

Signed-off-by: Andrew Scull <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: Share context save and restore macros

To avoid duplicating the context save and restore macros, move them into
a shareable header.

Signed-off-by: Andrew Scull <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: Restore hyp when panicking in guest context

If the guest context is loaded when a panic is triggered, restore the
hyp context so e.g. the shadow call stack works when hyp_panic() is
called and SP_EL0 is valid when the host's panic() is called.

Use the hyp context's __hyp_running_vcpu field to track when hyp
transitions to and from the guest vcpu so the exception handlers know
whether the context needs to be restored.

Signed-off-by: Andrew Scull <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: Update context references from host to hyp

Hyp now has its own nominal context for saving and restoring its state
when switching to and from a guest. Update the related comments and
utilities to match the new name.

Signed-off-by: Andrew Scull <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: Introduce hyp context

During __guest_enter, save and restore from a new hyp context rather
than the host context. This is preparation for separation of the hyp and
host context in nVHE.

Signed-off-by: Andrew Scull <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]