Git Repo - linux.git/log

x86/kvm/svm: Add hardirq tracing on guest enter/exit

Entering guest mode is more or less the same as returning to user
space. From an instrumentation point of view both leave kernel mode and the
transition to guest or user mode reenables interrupts on the host. In user
mode an interrupt is served directly and in guest mode it causes a VM exit
which then handles or reinjects the interrupt.

The transition from guest mode or user mode to kernel mode disables
interrupts, which needs to be recorded in instrumentation to set the
correct state again.

This is important for e.g. latency analysis because otherwise the execution
time in guest or user mode would be wrongly accounted as interrupt disabled
and could trigger false positives.

Add hardirq tracing to guest enter/exit functions in the same way as it
is done in the user mode enter/exit code, respecting the RCU requirements.

Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Alexandre Chartre <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Message-Id: <20200708195321.934715094@linutronix.de>
Signed-off-by: Paolo Bonzini <[email protected]>

x86/kvm/vmx: Add hardirq tracing to guest enter/exit

Entering guest mode is more or less the same as returning to user
space. From an instrumentation point of view both leave kernel mode and the
transition to guest or user mode reenables interrupts on the host. In user
mode an interrupt is served directly and in guest mode it causes a VM exit
which then handles or reinjects the interrupt.

The transition from guest mode or user mode to kernel mode disables
interrupts, which needs to be recorded in instrumentation to set the
correct state again.

This is important for e.g. latency analysis because otherwise the execution
time in guest or user mode would be wrongly accounted as interrupt disabled
and could trigger false positives.

Add hardirq tracing to guest enter/exit functions in the same way as it
is done in the user mode enter/exit code, respecting the RCU requirements.

Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Alexandre Chartre <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Message-Id: <20200708195321.822002354@linutronix.de>
Signed-off-by: Paolo Bonzini <[email protected]>

x86/kvm: Move context tracking where it belongs

Context tracking for KVM happens way too early in the vcpu_run()
code. Anything after guest_enter_irqoff() and before guest_exit_irqoff()
cannot use RCU and should also be not instrumented.

The current way of doing this covers way too much code. Move it closer to
the actual vmenter/exit code.

Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Alexandre Chartre <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Message-Id: <20200708195321.724574345@linutronix.de>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: replace kvm_spec_ctrl_test_value with runtime test on the host

To avoid complex and in some cases incorrect logic in
kvm_spec_ctrl_test_value, just try the guest's given value on the host
processor instead, and if it doesn't #GP, allow the guest to set it.

One such case is when host CPU supports STIBP mitigation
but doesn't support IBRS (as is the case with some Zen2 AMD cpus),
and in this case we were giving guest #GP when it tried to use STIBP

The reason why can can do the host test is that IA32_SPEC_CTRL msr is
passed to the guest, after the guest sets it to a non zero value
for the first time (due to performance reasons),
and as as result of this, it is pointless to emulate #GP condition on
this first access, in a different way than what the host CPU does.

This is based on a patch from Sean Christopherson, who suggested this idea.

Fixes: 6441fa6178f5 ("KVM: x86: avoid incorrect writes to host MSR_IA32_SPEC_CTRL")
Cc: [email protected]
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20200708115731 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM/x86: pmu: Fix #GP condition check for RDPMC emulation

In guest protected mode, if the current privilege level
is not 0 and the PCE flag in the CR4 register is cleared,
we will inject a #GP for RDPMC usage.

Signed-off-by: Like Xu <[email protected]>
Message-Id: <20200708074409 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: take as_id into account when checking PGD

OVMF booted guest running on shadow pages crashes on TRIPLE FAULT after
enabling paging from SMM. The crash is triggered from mmu_check_root() and
is caused by kvm_is_visible_gfn() searching through memslots with as_id = 0
while vCPU may be in a different context (address space).

Introduce kvm_vcpu_is_visible_gfn() and use it from mmu_check_root().

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20200708140023.1476020 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Move kvm_x86_ops.vcpu_after_set_cpuid() into kvm_vcpu_after_set_cpuid()

kvm_x86_ops.vcpu_after_set_cpuid() is used to update vmx/svm specific
vcpu settings based on updated CPUID settings. So it's supposed to be
called after CPUIDs are updated, i.e., kvm_update_cpuid_runtime().

Currently, kvm_update_cpuid_runtime() only updates CPUID bits of OSXSAVE,
APIC, OSPKE, MWAIT, KVM_FEATURE_PV_UNHALT and CPUID(0xD,0).ebx and
CPUID(0xD, 1).ebx. None of them is consumed by vmx/svm's
update_vcpu_after_set_cpuid(). So there is no dependency between them.

Move kvm_x86_ops.vcpu_after_set_cpuid() into kvm_vcpu_after_set_cpuid() is
obviously more reasonable.

Signed-off-by: Xiaoyao Li <[email protected]>
Message-Id: <20200709043426 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Rename cpuid_update() callback to vcpu_after_set_cpuid()

The name of callback cpuid_update() is misleading that it's not about
updating CPUID settings of vcpu but updating the configurations of vcpu
based on the CPUIDs. So rename it to vcpu_after_set_cpuid().

Signed-off-by: Xiaoyao Li <[email protected]>
Message-Id: <20200709043426 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Rename kvm_update_cpuid() to kvm_vcpu_after_set_cpuid()

Now there is no updating CPUID bits behavior in kvm_update_cpuid(),
rename it to kvm_vcpu_after_set_cpuid().

Signed-off-by: Xiaoyao Li <[email protected]>
Message-Id: <20200709043426 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Extract kvm_update_cpuid_runtime() from kvm_update_cpuid()

Beside called in kvm_vcpu_ioctl_set_cpuid*(), kvm_update_cpuid() is also
called 5 places else in x86.c and 1 place else in lapic.c. All those 6
places only need the part of updating guest CPUIDs (OSXSAVE, OSPKE, APIC,
KVM_FEATURE_PV_UNHALT, ...) based on the runtime vcpu state, so extract
them as a separate kvm_update_cpuid_runtime().

Signed-off-by: Xiaoyao Li <[email protected]>
Message-Id: <20200709043426 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Introduce kvm_check_cpuid()

Use kvm_check_cpuid() to validate if userspace provides legal cpuid
settings and call it before KVM take any action to update CPUID or
update vcpu states based on given CPUID settings.

Signed-off-by: Xiaoyao Li <[email protected]>
Message-Id: <20200709043426 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Move kvm_apic_set_version() to kvm_update_cpuid()

There is no dependencies between kvm_apic_set_version() and
kvm_update_cpuid() because kvm_apic_set_version() queries X2APIC CPUID bit,
which is not touched/changed by kvm_update_cpuid().

Obviously, kvm_apic_set_version() belongs to the category of updating
vcpu model.

Signed-off-by: Xiaoyao Li <[email protected]>
Message-Id: <20200708065054 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: lapic: Use guest_cpuid_has() in kvm_apic_set_version()

Only code cleanup and no functional change.

Signed-off-by: Xiaoyao Li <[email protected]>
Message-Id: <20200708065054 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Go on updating other CPUID leaves when leaf 1 is absent

As handling of bits out of leaf 1 added over time, kvm_update_cpuid()
should not return directly if leaf 1 is absent, but should go on
updateing other CPUID leaves.

Keep the update of apic->lapic_timer.timer_mode_mask in a separate
wrapper, to minimize churn for code since it will be moved out of this
function in a future patch.

Signed-off-by: Xiaoyao Li <[email protected]>
Message-Id: <20200708065054 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Reset vcpu->arch.cpuid_nent to 0 if SET_CPUID* fails

Current implementation keeps userspace input of CPUID configuration and
cpuid->nent even if kvm_update_cpuid() fails. Reset vcpu->arch.cpuid_nent
to 0 for the case of failure as a simple fix.

Besides, update the doc to explicitly state that if IOCTL SET_CPUID*
fail KVM gives no gurantee that previous valid CPUID configuration is
kept.

Signed-off-by: Xiaoyao Li <[email protected]>
Message-Id: <20200708065054 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: limit the maximum number of vPMU fixed counters to 3

Some new Intel platforms (such as TGL) already have the
fourth fixed counter TOPDOWN.SLOTS, but it has not been
fully enabled on KVM and the host.

Therefore, we limit edx.split.num_counters_fixed to 3,
so that it does not break the kvm-unit-tests PMU test
case and bad-handled userspace.

Signed-off-by: Like Xu <[email protected]>
Message-Id: <20200624015928 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nSVM: Check that MBZ bits in CR3 and CR4 are not set on vmrun of nested guests

According to section "Canonicalization and Consistency Checks" in APM vol. 2
the following guest state is illegal:

"Any MBZ bit of CR3 is set."
"Any MBZ bit of CR4 is set."

Suggeted-by: Paolo Bonzini <[email protected]>
Signed-off-by: Krish Sadhukhan <[email protected]>
Message-Id: <1594168797 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Make CR4.VMXE reserved for the guest

CR4.VMXE is reserved unless the VMX CPUID bit is set. On Intel,
it is also tested by vmx_set_cr4, but AMD relies on kvm_valid_cr4,
so fix it.

Reviewed-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Create mask for guest CR4 reserved bits in kvm_update_cpuid()

Instead of creating the mask for guest CR4 reserved bits in kvm_valid_cr4(),
do it in kvm_update_cpuid() so that it can be reused instead of creating it
each time kvm_valid_cr4() is called.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Krish Sadhukhan <[email protected]>
Message-Id: <1594168797 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: Read PDPTEs on CR0.CD and CR0.NW changes

According to the SDM, when PAE paging would be in use following a
MOV-to-CR0 that modifies any of CR0.CD, CR0.NW, or CR0.PG, then the
PDPTEs are loaded from the address in CR3. Previously, kvm only loaded
the PDPTEs when PAE paging would be in use following a MOV-to-CR0 that
modified CR0.PG.

Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Reviewed-by: Peter Shier <[email protected]>
Message-Id: <20200707223630 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

xen: Mark "xen_nopvspin" parameter obsolete

Map "xen_nopvspin" to "nopvspin", fix stale description of "xen_nopvspin"
as we use qspinlock now.

Signed-off-by: Zhenzhong Duan <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

x86/kvm: Add "nopvspin" parameter to disable PV spinlocks

There are cases where a guest tries to switch spinlocks to bare metal
behavior (e.g. by setting "xen_nopvspin" on XEN platform and
"hv_nopvspin" on HYPER_V).

That feature is missed on KVM, add a new parameter "nopvspin" to disable
PV spinlocks for KVM guest.

The new 'nopvspin' parameter will also replace Xen and Hyper-V specific
parameters in future patches.

Define variable nopvsin as global because it will be used in future
patches as above.

Signed-off-by: Zhenzhong Duan <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krcmar <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Jim Mattson <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

x86/kvm: Change print code to use pr_*() format

pr_*() is preferred than printk(KERN_* ...), after change all the print
in arch/x86/kernel/kvm.c will have "kvm-guest: xxx" style.

No functional change.

Signed-off-by: Zhenzhong Duan <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krcmar <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Jim Mattson <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Revert "KVM: X86: Fix setup the virt_spin_lock_key before static key get initialized"

This reverts commit 34226b6b70980a8f81fff3c09a2c889f77edeeff.

Commit 8990cac6e5ea ("x86/jump_label: Initialize static branching
early") adds jump_label_init() call in setup_arch() to make static
keys initialized early, so we could use the original simpler code
again.

The similar change for XEN is in commit 090d54bcbc54 ("Revert
"x86/paravirt: Set up the virt_spin_lock_key after static keys get
initialized"")

Signed-off-by: Zhenzhong Duan <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krcmar <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Jim Mattson <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Rename page_header() to to_shadow_page()

Rename KVM's accessor for retrieving a 'struct kvm_mmu_page' from the
associated host physical address to better convey what the function is
doing.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200622202034 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Add sptep_to_sp() helper to wrap shadow page lookup

Introduce sptep_to_sp() to reduce the boilerplate code needed to get the
shadow page associated with a spte pointer, and to improve readability
as it's not immediately obvious that "page_header" is a KVM-specific
accessor for retrieving a shadow page.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200622202034 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Make kvm_mmu_page definition and accessor internal-only

Make 'struct kvm_mmu_page' MMU-only, nothing outside of the MMU should
be poking into the gory details of shadow pages.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200622202034 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Add MMU-internal header

Add mmu/mmu_internal.h to hold declarations and definitions that need
to be shared between various mmu/ files, but should not be used by
anything outside of the MMU.

Begin populating mmu_internal.h with declarations of the helpers used by
page_track.c.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200622202034 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Move kvm_mmu_available_pages() into mmu.c

Move kvm_mmu_available_pages() from mmu.h to mmu.c, it has a single
caller and has no business being exposed via mmu.h.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200622202034 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Move mmu_audit.c and mmutrace.h into the mmu/ sub-directory

Move mmu_audit.c and mmutrace.h under mmu/ where they belong.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200622202034 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Exit to userspace on make_mmu_pages_available() error

Propagate any error returned by make_mmu_pages_available() out to
userspace instead of resuming the guest if the error occurs while
handling a page fault. Now that zapping the oldest MMU pages skips
active roots, i.e. fails if and only if there are no zappable pages,
there is no chance for a false positive, i.e. no chance of returning a
spurious error to userspace.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200623193542 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Batch zap MMU pages when shrinking the slab

Use the recently introduced kvm_mmu_zap_oldest_mmu_pages() to batch zap
MMU pages when shrinking a slab.  This fixes a long standing issue where
KVM's shrinker implementation is completely ineffective due to zapping
only a single page.  E.g. without batch zapping, forcing a scan via
drop_caches basically has no impact on a VM with ~2k shadow pages.  With
batch zapping, the number of shadow pages can be reduced to a few
hundred pages in one or two runs of drop_caches.

Note, if the default batch size (currently 128) is problematic, e.g.
zapping 128 pages holds mmu_lock for too long, KVM can bound the batch
size by setting @batch in mmu_shrinker.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200623193542 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Batch zap MMU pages when recycling oldest pages

Collect MMU pages for zapping in a loop when making MMU pages available,
and skip over active roots when doing so as zapping an active root can
never immediately free up a page. Batching the zapping avoids multiple
remote TLB flushes and remedies the issue where the loop would bail
early if an active root was encountered.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200623193542 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Don't put invalid SPs back on the list of active pages

Delete a shadow page from the invalidation list instead of throwing it
back on the list of active pages when it's a root shadow page with
active users. Invalid active root pages will be explicitly freed by
mmu_free_root_page() when the root_count hits zero, i.e. they don't need
to be put on the active list to avoid leakage.

Use sp->role.invalid to detect that a shadow page has already been
zapped, i.e. is not on a list.

WARN if an invalid page is encountered when zapping pages, as it should
now be impossible.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200623193542 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Optimize MMU page cache lookup for fully direct MMUs

Skip the unsync checks and the write flooding clearing for fully direct
MMUs, which are guaranteed to not have unsync'd or indirect pages (write
flooding detection only applies to indirect pages). For TDP, this
avoids unnecessary memory reads and writes, and for the write flooding
count will also avoid dirtying a cache line (unsync_child_bitmap itself
consumes a cache line, i.e. write_flooding_count is guaranteed to be in
a different cache line than parent_ptes).

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200623194027 [email protected]>
Reviewed-By: Jon Cargille <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Avoid multiple hash lookups in kvm_get_mmu_page()

Refactor for_each_valid_sp() to take the list of shadow pages instead of
retrieving it from a gfn to avoid doing the gfn->list hash and lookup
multiple times during kvm_get_mmu_page().

Cc: Peter Feiner <[email protected]>
Cc: Jon Cargille <[email protected]>
Cc: Jim Mattson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200623194027 [email protected]>
Reviewed-By: Jon Cargille <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Use VMCALL and VMMCALL mnemonics in kvm_para.h

Current minimum required version of binutils is 2.23,
which supports VMCALL and VMMCALL instruction mnemonics.

Replace the byte-wise specification of VMCALL and
VMMCALL with these proper mnemonics.

Signed-off-by: Uros Bizjak <[email protected]>
CC: Paolo Bonzini <[email protected]>
Message-Id: <20200623183439 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Rename svm_nested_virtualize_tpr() to nested_svm_virtualize_tpr()

Match the naming with other nested svm functions.

No functional changes.

Signed-off-by: Joerg Roedel <[email protected]>
Message-Id: <20200625080325 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Add svm_ prefix to set/clr/is_intercept()

Make clear the symbols belong to the SVM code when they are built-in.

No functional changes.

Signed-off-by: Joerg Roedel <[email protected]>
Message-Id: <20200625080325 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Add vmcb_ prefix to mark_*() functions

Make it more clear what data structure these functions operate on.

No functional changes.

Signed-off-by: Joerg Roedel <[email protected]>
Message-Id: <20200625080325 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Rename struct nested_state to svm_nested_state

Renaming is only needed in the svm.h header file.

No functional changes.

Signed-off-by: Joerg Roedel <[email protected]>
Message-Id: <20200625080325 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Wrap VM-Fail valid path in generic VM-Fail helper

Add nested_vmx_fail() to wrap VM-Fail paths that _may_ result in VM-Fail
Valid to make it clear at the call sites that the Valid flavor isn't
guaranteed.

Suggested-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200609015607 [email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: Set last_vmentry_cpu in vcpu_enter_guest

Since this field is now in kvm_vcpu_arch, clean things up a little by
setting it in vendor-agnostic code: vcpu_enter_guest. Note that it
must be set after the call to kvm_x86_ops.run(), since it can't be
updated before pre_sev_run().

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Reviewed-by: Peter Shier <[email protected]>
Message-Id: <20200603235623 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: Move last_cpu into kvm_vcpu_arch as last_vmentry_cpu

Both the vcpu_vmx structure and the vcpu_svm structure have a
'last_cpu' field. Move the common field into the kvm_vcpu_arch
structure. For clarity, rename it to 'last_vmentry_cpu.'

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Reviewed-by: Peter Shier <[email protected]>
Message-Id: <20200603235623 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: Add "last CPU" to some KVM_EXIT information

More often than not, a failed VM-entry in an x86 production
environment is induced by a defective CPU. To help identify the bad
hardware, include the id of the last logical CPU to run a vCPU in the
information provided to userspace on a KVM exit for failed VM-entry or
for KVM internal errors not associated with emulation. The presence of
this additional information is indicated by a new capability,
KVM_CAP_LAST_CPU.

Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Reviewed-by: Peter Shier <[email protected]>
Message-Id: <20200603235623 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: vmx: Add last_cpu to struct vcpu_vmx

As we already do in svm, record the last logical processor on which a
vCPU has run, so that it can be communicated to userspace for
potential hardware errors.

Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Reviewed-by: Peter Shier <[email protected]>
Message-Id: <20200603235623 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: svm: Always set svm->last_cpu on VMRUN

Previously, this field was only set when using SEV. Set it for all
vCPU configurations, so that it can be communicated to userspace for
diagnosing potential hardware errors.

Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Reviewed-by: Peter Shier <[email protected]>
Message-Id: <20200603235623 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: svm: Prefer vcpu->cpu to raw_smp_processor_id()

The current logical processor id is cached in vcpu->cpu. Use it
instead of raw_smp_processor_id() when a kvm_vcpu struct is available.

Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Message-Id: <20200603235623 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: report sev_pin_memory errors with PTR_ERR

Callers of sev_pin_memory() treat
NULL differently:

sev_launch_secret()/svm_register_enc_region() return -ENOMEM
sev_dbg_crypt() returns -EFAULT.

Switching to ERR_PTR() preserves the error and enables cleaner reporting of
different kinds of failures.

Suggested-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: convert get_user_pages() --> pin_user_pages()

This code was using get_user_pages*(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages*() + put_page() calls to
pin_user_pages*() + unpin_user_pages() calls.

There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.

[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/

Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Jim Mattson <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: John Hubbard <[email protected]>
Message-Id: <20200526062207.1360225 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: fix svn_pin_memory()'s use of get_user_pages_fast()

There are two problems in svn_pin_memory():

1) The return value of get_user_pages_fast() is stored in an
unsigned long, although the declared return value is of type int.
This will not cause any symptoms, but it is misleading.
Fix this by changing the type of npinned to "int".

2) The number of pages passed into get_user_pages_fast() is stored
in an unsigned long, even though get_user_pages_fast() accepts an
int. This means that it is possible to silently overflow the number
of pages.

Fix this by adding a WARN_ON_ONCE() and an early error return. The
npages variable is left as an unsigned long for convenience in
checking for overflow.

Fixes: 89c505809052 ("KVM: SVM: Add support for KVM_SEV_LAUNCH_UPDATE_DATA command")
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Jim Mattson <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: John Hubbard <[email protected]>
Message-Id: <20200526062207.1360225 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nSVM: Check that DR6[63:32] and DR7[64:32] are not set on vmrun of nested guests

According to section "Canonicalization and Consistency Checks" in APM vol. 2
the following guest state is illegal:

    "DR6[63:32] are not zero."
    "DR7[63:32] are not zero."
    "Any MBZ bit of EFER is set."

Signed-off-by: Krish Sadhukhan <[email protected]>
Message-Id: <20200522221954 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Move the check for upper 32 reserved bits of DR6 to separate function

Signed-off-by: Krish Sadhukhan <[email protected]>
Message-Id: <20200522221954 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Do the same ignore_msrs check for feature msrs

Logically the ignore_msrs and report_ignored_msrs should also apply to feature
MSRs. Add them in.

Signed-off-by: Peter Xu <[email protected]>
Message-Id: <20200622220442 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Move ignore_msrs handling upper the stack

MSR accesses can be one of:

  (1) KVM internal access,
  (2) userspace access (e.g., via KVM_SET_MSRS ioctl),
  (3) guest access.

The ignore_msrs was previously handled by kvm_get_msr_common() and
kvm_set_msr_common(), which is the bottom of the msr access stack.  It's
working in most cases, however it could dump unwanted warning messages to dmesg
even if kvm get/set the msrs internally when calling __kvm_set_msr() or
__kvm_get_msr() (e.g. kvm_cpuid()).  Ideally we only want to trap cases (2)
or (3), but not (1) above.

To achieve this, move the ignore_msrs handling upper until the callers of
__kvm_get_msr() and __kvm_set_msr().  To identify the "msr missing" event, a
new return value (KVM_MSR_RET_INVALID==2) is used for that.

Signed-off-by: Peter Xu <[email protected]>
Message-Id: <20200622220442 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Make .write_log_dirty a nested operation

Move .write_log_dirty() into kvm_x86_nested_ops to help differentiate it
from the non-nested dirty log hooks. And because it's a nested-only
operation.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200622215832 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: WARN if PML emulation helper is invoked outside of nested guest

WARN if vmx_write_pml_buffer() is called outside of guest mode instead
of silently ignoring the condition. The only caller is nested EPT's
ept_update_accessed_dirty_bits(), which should only be reachable when
L2 is active.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200622215832 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Drop kvm_arch_write_log_dirty() wrapper

Drop kvm_arch_write_log_dirty() in favor of invoking .write_log_dirty()
directly from FNAME(update_accessed_dirty_bits). "kvm_arch" is usually
used for x86 functions that are invoked from generic KVM, and implies
that there are external callers, neither of which is true.

Remove the check for a non-NULL kvm_x86_ops hook as the call is wrapped
in PTTYPE_EPT and is unconditionally set by VMX.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200622215832 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: async_pf: change kvm_setup_async_pf()/kvm_arch_setup_async_pf() return type to bool

Unlike normal 'int' functions returning '0' on success, kvm_setup_async_pf()/
kvm_arch_setup_async_pf() return '1' when a job to handle page fault
asynchronously was scheduled and '0' otherwise. To avoid the confusion
change return type to 'bool'.

No functional change intended.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20200615121334 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: drop KVM_PV_REASON_PAGE_READY case from kvm_handle_page_fault()

KVM guest code in Linux enables APF only when KVM_FEATURE_ASYNC_PF_INT
is supported, this means we will never see KVM_PV_REASON_PAGE_READY
when handling page fault vmexit in KVM.

While on it, make sure we only follow genuine page fault path when
APF reason is zero. If we happen to see something else this means
that the underlying hypervisor is misbehaving. Leave WARN_ON_ONCE()
to catch that.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Merge branch 'kvm-master' into HEAD

Merge 5.8-rc bugfixes.

Merge branch 'kvm-async-pf-int' into HEAD

KVM: MIPS: fix spelling mistake "Exteneded" -> "Extended"

There is a spelling mistake in a couple of kvm_err messages. Fix them.

Signed-off-by: Colin Ian King <[email protected]>
Message-Id: <20200615082636 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Merge tag 'kvmarm-fixes-5.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master

KVM/arm fixes for 5.8, take #3

- Disable preemption on context-switching PMU EL0 state happening
on system register trap
- Don't clobber X0 when tearing down KVM via a soft reset (kexec)

KVM: arm64: Stop clobbering x0 for HVC_SOFT_RESTART

HVC_SOFT_RESTART is given values for x0-2 that it should installed
before exiting to the new address so should not set x0 to stub HVC
success or failure code.

Fixes: af42f20480bf1 ("arm64: hyp-stub: Zero x0 on successful stub handling")
Cc: [email protected]
Signed-off-by: Andrew Scull <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: PMU: Fix per-CPU access in preemptible context

Commit 07da1ffaa137 ("KVM: arm64: Remove host_cpu_context
member from vcpu structure") has, by removing the host CPU
context pointer, exposed that kvm_vcpu_pmu_restore_guest
is called in preemptible contexts:

[  266.932442] BUG: using smp_processor_id() in preemptible [00000000] code: qemu-system-aar/779
[  266.939721] caller is debug_smp_processor_id+0x20/0x30
[  266.944157] CPU: 2 PID: 779 Comm: qemu-system-aar Tainted: G            E     5.8.0-rc3-00015-g8d4aa58b2fe3 #1374
[  266.954268] Hardware name: amlogic w400/w400, BIOS 2020.04 05/22/2020
[  266.960640] Call trace:
[  266.963064]  dump_backtrace+0x0/0x1e0
[  266.966679]  show_stack+0x20/0x30
[  266.969959]  dump_stack+0xe4/0x154
[  266.973338]  check_preemption_disabled+0xf8/0x108
[  266.977978]  debug_smp_processor_id+0x20/0x30
[  266.982307]  kvm_vcpu_pmu_restore_guest+0x2c/0x68
[  266.986949]  access_pmcr+0xf8/0x128
[  266.990399]  perform_access+0x8c/0x250
[  266.994108]  kvm_handle_sys_reg+0x10c/0x2f8
[  266.998247]  handle_exit+0x78/0x200
[  267.001697]  kvm_arch_vcpu_ioctl_run+0x2ac/0xab8

Note that the bug was always there, it is only the switch to
using percpu accessors that made it obvious.
The fix is to wrap these accesses in a preempt-disabled section,
so that we sample a coherent context on trap from the guest.

Fixes: 435e53fb5e21 ("arm64: KVM: Enable VHE support for :G/:H perf event modifiers")
Cc:: Andrew Murray <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>

KVM: VMX: Use KVM_POSSIBLE_CR*_GUEST_BITS to initialize guest/host masks

Use the "common" KVM_POSSIBLE_CR*_GUEST_BITS defines to initialize the
CR0/CR4 guest host masks instead of duplicating most of the CR4 mask and
open coding the CR0 mask. SVM doesn't utilize the masks, i.e. the masks
are effectively VMX specific even if they're not named as such. This
avoids duplicate code, better documents the guest owned CR0 bit, and
eliminates the need for a build-time assertion to keep VMX and x86
synchronized.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200703040422 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Mark CR4.TSD as being possibly owned by the guest

Mark CR4.TSD as being possibly owned by the guest as that is indeed the
case on VMX. Without TSD being tagged as possibly owned by the guest, a
targeted read of CR4 to get TSD could observe a stale value. This bug
is benign in the current code base as the sole consumer of TSD is the
emulator (for RDTSC) and the emulator always "reads" the entirety of CR4
when grabbing bits.

Add a build-time assertion in to ensure VMX doesn't hand over more CR4
bits without also updating x86.

Fixes: 52ce3c21aec3 ("x86,kvm,vmx: Don't trap writes to CR4.TSD")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200703040422 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Inject #GP if guest attempts to toggle CR4.LA57 in 64-bit mode

Inject a #GP on MOV CR4 if CR4.LA57 is toggled in 64-bit mode, which is
illegal per Intel's SDM:

  CR4.LA57
    57-bit linear addresses (bit 12 of CR4) ... blah blah blah ...
    This bit cannot be modified in IA-32e mode.

Note, the pseudocode for MOV CR doesn't call out the fault condition,
which is likely why the check was missed during initial development.
This is arguably an SDM bug and will hopefully be fixed in future
release of the SDM.

Fixes: fd8cb433734ee ("KVM: MMU: Expose the LA57 feature to VM.")
Cc: [email protected]
Reported-by: Sebastien Boeuf <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200703021714 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: use more precise cast and do not drop __user

Sparse complains on a call to get_compat_sigset, fix it. The "if"
right above explains that sigmask_arg->sigset is basically a
compat_sigset_t.

Signed-off-by: Paolo Bonzini <[email protected]>

Merge tag 'kvmarm-fixes-5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master

KVM/arm fixes for 5.8, take #2

- Make sure a vcpu becoming non-resident doesn't race against the doorbell delivery
- Only advertise pvtime if accounting is enabled
- Return the correct error code if reset fails with SVE
- Make sure that pseudo-NMI functions are annotated as __always_inline

KVM: x86: bit 8 of non-leaf PDPEs is not reserved

Bit 8 would be the "global" bit, which does not quite make sense for non-leaf
page table entries. Intel ignores it; AMD ignores it in PDEs and PDPEs, but
reserves it in PML4Es.

Probably, earlier versions of the AMD manual documented it as reserved in PDPEs
as well, and that behavior made it into KVM as well as kvm-unit-tests; fix it.

Cc: [email protected]
Reported-by: Nadav Amit <[email protected]>
Fixes: a0c0feb57992 ("KVM: x86: reserve bit 8 of non-leaf PDPEs and PML4Es in 64-bit mode on AMD", 2014-09-03)
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Fix async pf caused null-ptr-deref

Syzbot reported that:

  CPU: 1 PID: 6780 Comm: syz-executor153 Not tainted 5.7.0-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:__apic_accept_irq+0x46/0xb80
  Call Trace:
   kvm_arch_async_page_present+0x7de/0x9e0
   kvm_check_async_pf_completion+0x18d/0x400
   kvm_arch_vcpu_ioctl_run+0x18bf/0x69f0
   kvm_vcpu_ioctl+0x46a/0xe20
   ksys_ioctl+0x11a/0x180
   __x64_sys_ioctl+0x6f/0xb0
   do_syscall_64+0xf6/0x7d0
   entry_SYSCALL_64_after_hwframe+0x49/0xb3

The testcase enables APF mechanism in MSR_KVM_ASYNC_PF_EN with ASYNC_PF_INT
enabled w/o setting MSR_KVM_ASYNC_PF_INT before, what's worse, interrupt
based APF 'page ready' event delivery depends on in kernel lapic, however,
we didn't bail out when lapic is not in kernel during guest setting
MSR_KVM_ASYNC_PF_EN which causes the null-ptr-deref in host later.
This patch fixes it.

Reported-by: [email protected]
Fixes: 2635b5c4a0 (KVM: x86: interrupt based APF 'page ready' event delivery)
Signed-off-by: Wanpeng Li <[email protected]>
Message-Id: <1593426391 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Merge tag 'kvm-s390-master-5.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-master

The current number of KVM_IRQCHIP_NUM_PINS results in an order 3
allocation (32kb) for each guest start/restart which can result in OOM
killer activity when kernel memory is fragmented enough.

This fix reduces the number of iopins as s390 doesn't use them, hence
reducing the memory footprint.

KVM: arm64: vgic-v4: Plug race between non-residency and v4.1 doorbell

When making a vPE non-resident because it has hit a blocking WFI,
the doorbell can fire at any time after the write to the RD.
Crucially, it can fire right between the write to GICR_VPENDBASER
and the write to the pending_last field in the its_vpe structure.

This means that we would overwrite pending_last with stale data,
and potentially not wakeup until some unrelated event (such as
a timer interrupt) puts the vPE back on the CPU.

GICv4 isn't affected by this as we actively mask the doorbell on
entering the guest, while GICv4.1 automatically manages doorbell
delivery without any hypervisor-driven masking.

Use the vpe_lock to synchronize such update, which solves the
problem altogether.

Fixes: ae699ad348cdc ("irqchip/gic-v4.1: Move doorbell management to the GICv4 abstraction layer")
Reported-by: Zenghui Yu <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>

KVM: VMX: Remove vcpu_vmx's defunct copy of host_pkru

Remove vcpu_vmx.host_pkru, which got left behind when PKRU support was
moved to common x86 code.

No functional change intended.

Fixes: 37486135d3a7b ("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c")
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200617034123 [email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: allow TSC to differ by NTP correction bounds without TSC scaling

The Linux TSC calibration procedure is subject to small variations
(its common to see +-1 kHz difference between reboots on a given CPU, for example).

So migrating a guest between two hosts with identical processor can fail, in case
of a small variation in calibrated TSC between them.

Without TSC scaling, the current kernel interface will either return an error
(if user_tsc_khz <= tsc_khz) or enable TSC catchup mode.

This change enables the following TSC tolerance check to
accept KVM_SET_TSC_KHZ within tsc_tolerance_ppm (which is 250ppm by default).

        /*
         * Compute the variation in TSC rate which is acceptable
         * within the range of tolerance and decide if the
         * rate being applied is within that bounds of the hardware
         * rate.  If so, no scaling or compensation need be done.
         */
        thresh_lo = adjust_tsc_khz(tsc_khz, -tsc_tolerance_ppm);
        thresh_hi = adjust_tsc_khz(tsc_khz, tsc_tolerance_ppm);
        if (user_tsc_khz < thresh_lo || user_tsc_khz > thresh_hi) {
                pr_debug("kvm: requested TSC rate %u falls outside tolerance [%u,%u]\n", user_tsc_khz, thresh_lo, thresh_hi);
                use_scaling = 1;
        }

NTP daemon in the guest can correct this difference (NTP can correct upto 500ppm).

Signed-off-by: Marcelo Tosatti <[email protected]>
Message-Id: <20200616114741 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Fix MSR range of APIC registers in X2APIC mode

Only MSR address range 0x800 through 0x8ff is architecturally reserved
and dedicated for accessing APIC registers in x2APIC mode.

Fixes: 0105d1a52640 ("KVM: x2apic interface to lapic")
Signed-off-by: Xiaoyao Li <[email protected]>
Message-Id: <20200616073307 [email protected]>
Cc: [email protected]
Reviewed-by: Sean Christopherson <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROL

Remove support for context switching between the guest's and host's
desired UMWAIT_CONTROL.  Propagating the guest's value to hardware isn't
required for correct functionality, e.g. KVM intercepts reads and writes
to the MSR, and the latency effects of the settings controlled by the
MSR are not architecturally visible.

As a general rule, KVM should not allow the guest to control power
management settings unless explicitly enabled by userspace, e.g. see
KVM_CAP_X86_DISABLE_EXITS.  E.g. Intel's SDM explicitly states that C0.2
can improve the performance of SMT siblings.  A devious guest could
disable C0.2 so as to improve the performance of their workloads at the
detriment to workloads running in the host or on other VMs.

Wholesale removal of UMWAIT_CONTROL context switching also fixes a race
condition where updates from the host may cause KVM to enter the guest
with the incorrect value.  Because updates are are propagated to all
CPUs via IPI (SMP function callback), the value in hardware may be
stale with respect to the cached value and KVM could enter the guest
with the wrong value in hardware.  As above, the guest can't observe the
bad value, but it's a weird and confusing wart in the implementation.

Removal also fixes the unnecessary usage of VMX's atomic load/store MSR
lists.  Using the lists is only necessary for MSRs that are required for
correct functionality immediately upon VM-Enter/VM-Exit, e.g. EFER on
old hardware, or for MSRs that need to-the-uop precision, e.g. perf
related MSRs.  For UMWAIT_CONTROL, the effects are only visible in the
kernel via TPAUSE/delay(), and KVM doesn't do any form of delay in
vcpu_vmx_run().  Using the atomic lists is undesirable as they are more
expensive than direct RDMSR/WRMSR.

Furthermore, even if giving the guest control of the MSR is legitimate,
e.g. in pass-through scenarios, it's not clear that the benefits would
outweigh the overhead.  E.g. saving and restoring an MSR across a VMX
roundtrip costs ~250 cycles, and if the guest diverged from the host
that cost would be paid on every run of the guest.  In other words, if
there is a legitimate use case then it should be enabled by a new
per-VM capability.

Note, KVM still needs to emulate MSR_IA32_UMWAIT_CONTROL so that it can
correctly expose other WAITPKG features to the guest, e.g. TPAUSE,
UMWAIT and UMONITOR.

Fixes: 6e3ba4abcea56 ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL")
Cc: [email protected]
Cc: Jingqi Liu <[email protected]>
Cc: Tao Xu <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200623005135 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Plumb L2 GPA through to PML emulation

Explicitly pass the L2 GPA to kvm_arch_write_log_dirty(), which for all
intents and purposes is vmx_write_pml_buffer(), instead of having the
latter pull the GPA from vmcs.GUEST_PHYSICAL_ADDRESS. If the dirty bit
update is the result of KVM emulation (rare for L2), then the GPA in the
VMCS may be stale and/or hold a completely unrelated GPA.

Fixes: c5f983f6e8455 ("nVMX: Implement emulated Page Modification Logging")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200622215832 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Avoid mixing gpa_t with gfn_t in walk_addr_generic()

translate_gpa() returns a GPA, assigning it to 'real_gfn' seems obviously
wrong. There is no real issue because both 'gpa_t' and 'gfn_t' are u64 and
we don't use the value in 'real_gfn' as a GFN, we do

real_gfn = gpa_to_gfn(real_gfn);

instead. 'If you see a "buffalo" sign on an elephant's cage, do not trust
your eyes', but let's fix it for good.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20200622151435 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: LAPIC: ensure APIC map is up to date on concurrent update requests

The following race can cause lost map update events:

         cpu1                            cpu2

                                apic_map_dirty = true
  ------------------------------------------------------------
                                kvm_recalculate_apic_map:
                                     pass check
                                         mutex_lock(&kvm->arch.apic_map_lock);
                                         if (!kvm->arch.apic_map_dirty)
                                     and in process of updating map
  -------------------------------------------------------------
    other calls to
       apic_map_dirty = true         might be too late for affected cpu
  -------------------------------------------------------------
                                     apic_map_dirty = false
  -------------------------------------------------------------
    kvm_recalculate_apic_map:
    bail out on
      if (!kvm->arch.apic_map_dirty)

To fix it, record the beginning of an update of the APIC map in
apic_map_dirty.  If another APIC map change switches apic_map_dirty
back to DIRTY during the update, kvm_recalculate_apic_map should not
make it CLEAN, and the other caller will go through the slow path.

Reported-by: Igor Mammedov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: lapic: fix broken vcpu hotplug

Guest fails to online hotplugged CPU with error
smpboot: do_boot_cpu failed(-1) to wakeup CPU#4

It's caused by the fact that kvm_apic_set_state(), which used to call
recalculate_apic_map() unconditionally and pulled hotplugged CPU into
apic map, is updating map conditionally on state changes. In this case
the APIC map is not considered dirty and the is not updated.

Fix the issue by forcing unconditional update from kvm_apic_set_state(),
like it used to be.

Fixes: 4abaffce4d25a ("KVM: LAPIC: Recalculate apic map in batch")
Cc: [email protected]
Signed-off-by: Igor Mammedov <[email protected]>
Message-Id: <20200622160830 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: arm64: pvtime: Ensure task delay accounting is enabled

Ensure we're actually accounting run_delay before we claim that we'll
expose it to the guest. If we're not, then we just pretend like steal
time isn't supported in order to avoid any confusion.

Signed-off-by: Andrew Jones <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: Fix kvm_reset_vcpu() return code being incorrect with SVE

If SVE is enabled then 'ret' can be assigned the return value of
kvm_vcpu_enable_sve() which may be 0 causing future "goto out" sites to
erroneously return 0 on failure rather than -EINVAL as expected.

Remove the initialisation of 'ret' and make setting the return value
explicit to avoid this situation in the future.

Fixes: 9a3cdf26e336 ("KVM: arm64/sve: Allow userspace to enable SVE for vcpus")
Cc: [email protected]
Reported-by: James Morse <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: Annotate hyp NMI-related functions as __always_inline

The "inline" keyword is a hint for the compiler to inline a function.  The
functions system_uses_irq_prio_masking() and gic_write_pmr() are used by
the code running at EL2 on a non-VHE system, so mark them as
__always_inline to make sure they'll always be part of the .hyp.text
section.

This fixes the following splat when trying to run a VM:

[   47.625273] Kernel panic - not syncing: HYP panic:
[   47.625273] PS:a00003c9 PC:0000ca0b42049fc4 ESR:86000006
[   47.625273] FAR:0000ca0b42049fc4 HPFAR:0000000010001000 PAR:0000000000000000
[   47.625273] VCPU:0000000000000000
[   47.647261] CPU: 1 PID: 217 Comm: kvm-vcpu-0 Not tainted 5.8.0-rc1-ARCH+ #61
[   47.654508] Hardware name: Globalscale Marvell ESPRESSOBin Board (DT)
[   47.661139] Call trace:
[   47.663659]  dump_backtrace+0x0/0x1cc
[   47.667413]  show_stack+0x18/0x24
[   47.670822]  dump_stack+0xb8/0x108
[   47.674312]  panic+0x124/0x2f4
[   47.677446]  panic+0x0/0x2f4
[   47.680407] SMP: stopping secondary CPUs
[   47.684439] Kernel Offset: disabled
[   47.688018] CPU features: 0x240402,20002008
[   47.692318] Memory Limit: none
[   47.695465] ---[ end Kernel panic - not syncing: HYP panic:
[   47.695465] PS:a00003c9 PC:0000ca0b42049fc4 ESR:86000006
[   47.695465] FAR:0000ca0b42049fc4 HPFAR:0000000010001000 PAR:0000000000000000
[   47.695465] VCPU:0000000000000000 ]---

The instruction abort was caused by the code running at EL2 trying to fetch
an instruction which wasn't mapped in the EL2 translation tables. Using
objdump showed the two functions as separate symbols in the .text section.

Fixes: 85738e05dc38 ("arm64: kvm: Unmask PMR before entering guest")
Cc: [email protected]
Signed-off-by: Alexandru Elisei <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Acked-by: James Morse <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

Revert "KVM: VMX: Micro-optimize vmexit time when not exposing PMU"

Guest crashes are observed on a Cascade Lake system when 'perf top' is
launched on the host, e.g.

BUG: unable to handle kernel paging request at fffffe0000073038
PGD 7ffa7067 P4D 7ffa7067 PUD 7ffa6067 PMD 7ffa5067 PTE ffffffffff120
Oops: 0000 [#1] SMP PTI
CPU: 1 PID: 1 Comm: systemd Not tainted 4.18.0+ #380
...
Call Trace:
  serial8250_console_write+0xfe/0x1f0
  call_console_drivers.constprop.0+0x9d/0x120
  console_unlock+0x1ea/0x460

Call traces are different but the crash is imminent. The problem was
blindly bisected to the commit 041bc42ce2d0 ("KVM: VMX: Micro-optimize
vmexit time when not exposing PMU"). It was also confirmed that the
issue goes away if PMU is exposed to the guest.

With some instrumentation of the guest we can see what is being switched
(when we do atomic_switch_perf_msrs()):

vmx_vcpu_run: switching 2 msrs
vmx_vcpu_run: switching MSR38f guest: 70000000d host: 70000000f
vmx_vcpu_run: switching MSR3f1 guest: 0 host: 2

The current guess is that PEBS (MSR_IA32_PEBS_ENABLE, 0x3f1) is to blame.
Regardless of whether PMU is exposed to the guest or not, PEBS needs to
be disabled upon switch.

This reverts commit 041bc42ce2d0efac3b85bbb81dea8c74b81f4ef9.

Reported-by: Maxime Coquelin <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20200619094046 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: s390: reduce number of IO pins to 1

The current number of KVM_IRQCHIP_NUM_PINS results in an order 3
allocation (32kb) for each guest start/restart. This can result in OOM
killer activity even with free swap when the memory is fragmented
enough:

kernel: qemu-system-s39 invoked oom-killer: gfp_mask=0x440dc0(GFP_KERNEL_ACCOUNT|__GFP_COMP|__GFP_ZERO), order=3, oom_score_adj=0
kernel: CPU: 1 PID: 357274 Comm: qemu-system-s39 Kdump: loaded Not tainted 5.4.0-29-generic #33-Ubuntu
kernel: Hardware name: IBM 8562 T02 Z06 (LPAR)
kernel: Call Trace:
kernel: ([<00000001f848fe2a>] show_stack+0x7a/0xc0)
kernel:  [<00000001f8d3437a>] dump_stack+0x8a/0xc0
kernel:  [<00000001f8687032>] dump_header+0x62/0x258
kernel:  [<00000001f8686122>] oom_kill_process+0x172/0x180
kernel:  [<00000001f8686abe>] out_of_memory+0xee/0x580
kernel:  [<00000001f86e66b8>] __alloc_pages_slowpath+0xd18/0xe90
kernel:  [<00000001f86e6ad4>] __alloc_pages_nodemask+0x2a4/0x320
kernel:  [<00000001f86b1ab4>] kmalloc_order+0x34/0xb0
kernel:  [<00000001f86b1b62>] kmalloc_order_trace+0x32/0xe0
kernel:  [<00000001f84bb806>] kvm_set_irq_routing+0xa6/0x2e0
kernel:  [<00000001f84c99a4>] kvm_arch_vm_ioctl+0x544/0x9e0
kernel:  [<00000001f84b8936>] kvm_vm_ioctl+0x396/0x760
kernel:  [<00000001f875df66>] do_vfs_ioctl+0x376/0x690
kernel:  [<00000001f875e304>] ksys_ioctl+0x84/0xb0
kernel:  [<00000001f875e39a>] __s390x_sys_ioctl+0x2a/0x40
kernel:  [<00000001f8d55424>] system_call+0xd8/0x2c8

As far as I can tell s390x does not use the iopins as we bail our for
anything other than KVM_IRQ_ROUTING_S390_ADAPTER and the chip/pin is
only used for KVM_IRQ_ROUTING_IRQCHIP. So let us use a small number to
reduce the memory footprint.

Signed-off-by: Christian Borntraeger <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: VMX: Add helpers to identify interrupt type from intr_info

Add is_intr_type() and is_intr_type_n() to consolidate the boilerplate
code for querying a specific type of interrupt given an encoded value
from VMCS.VM_{ENTER,EXIT}_INTR_INFO, with and without an associated
vector respectively.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200609014518 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm/svm: disable KCSAN for svm_vcpu_run()

For some reasons, running a simple qemu-kvm command with KCSAN will
reset AMD hosts. It turns out svm_vcpu_run() could not be instrumented.
Disable it for now.

# /usr/libexec/qemu-kvm -name ubuntu-18.04-server-cloudimg -cpu host
-smp 2 -m 2G -hda ubuntu-18.04-server-cloudimg.qcow2

=== console output ===
Kernel 5.6.0-next-20200408+ on an x86_64

hp-dl385g10-05 login:

<...host reset...>

HPE ProLiant System BIOS A40 v1.20 (03/09/2018)
(C) Copyright 1982-2018 Hewlett Packard Enterprise Development LP
Early system initialization, please wait...

Signed-off-by: Qian Cai <[email protected]>
Message-Id: <20200415153709 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Switch KVM guest to using interrupts for page ready APF delivery

KVM now supports using interrupt for 'page ready' APF event delivery and
legacy mechanism was deprecated. Switch KVM guests to the new one.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20200525144125 [email protected]>
[Use HYPERVISOR_CALLBACK_VECTOR instead of a separate vector. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MIPS: Fix a build error for !CPU_LOONGSON64

During the KVM merging progress, a CONFIG_CPU_LOONGSON64 guard in commit
7f2a83f1c2a941ebfee53 ("KVM: MIPS: Add CPUCFG emulation for Loongson-3")
is missing by accident. So add it to avoid building error.

Fixes: 7f2a83f1c2a941ebfee53 ("KVM: MIPS: Add CPUCFG emulation for Loongson-3")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
Message-Id: <1592204215 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Linux 5.8-rc1

Merge tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux

Pull SafeSetID update from Micah Morton:
"Add additional LSM hooks for SafeSetID

  SafeSetID is capable of making allow/deny decisions for set*uid calls
  on a system, and we want to add similar functionality for set*gid
  calls.

  The work to do that is not yet complete, so probably won't make it in
  for v5.8, but we are looking to get this simple patch in for v5.8
  since we have it ready.

  We are planning on the rest of the work for extending the SafeSetID
  LSM being merged during the v5.9 merge window"

* tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux:
  security: Add LSM hooks to set*gid syscalls

security: Add LSM hooks to set*gid syscalls

The SafeSetID LSM uses the security_task_fix_setuid hook to filter
set*uid() syscalls according to its configured security policy. In
preparation for adding analagous support in the LSM for set*gid()
syscalls, we add the requisite hook here. Tested by putting print
statements in the security_task_fix_setgid hook and seeing them get hit
during kernel boot.

Signed-off-by: Thomas Cedeno <[email protected]>
Signed-off-by: Micah Morton <[email protected]>

Merge tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
"This reverts the direct io port to iomap infrastructure of btrfs
  merged in the first pull request. We found problems in invalidate page
  that don't seem to be fixable as regressions or without changing iomap
  code that would not affect other filesystems.

  There are four reverts in total, but three of them are followup
  cleanups needed to revert a43a67a2d715 cleanly. The result is the
  buffer head based implementation of direct io.

  Reverts are not great, but under current circumstances I don't see
  better options"

* tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Revert "btrfs: switch to iomap_dio_rw() for dio"
  Revert "fs: remove dio_end_io()"
  Revert "btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK"
  Revert "btrfs: split btrfs_direct_IO to read and write part"

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from David Miller:

1) Fix cfg80211 deadlock, from Johannes Berg.

2) RXRPC fails to send norigications, from David Howells.

3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from
    Geliang Tang.

4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu.

5) The ucc_geth driver needs __netdev_watchdog_up exported, from
    Valentin Longchamp.

6) Fix hashtable memory leak in dccp, from Wang Hai.

7) Fix how nexthops are marked as FDB nexthops, from David Ahern.

8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni.

9) Fix crashes in tipc_disc_rcv(), from Tuong Lien.

10) Fix link speed reporting in iavf driver, from Brett Creeley.

11) When a channel is used for XSK and then reused again later for XSK,
    we forget to clear out the relevant data structures in mlx5 which
    causes all kinds of problems. Fix from Maxim Mikityanskiy.

12) Fix memory leak in genetlink, from Cong Wang.

13) Disallow sockmap attachments to UDP sockets, it simply won't work.
    From Lorenz Bauer.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
  net: ethernet: ti: ale: fix allmulti for nu type ale
  net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init
  net: atm: Remove the error message according to the atomic context
  bpf: Undo internal BPF_PROBE_MEM in BPF insns dump
  libbpf: Support pre-initializing .bss global variables
  tools/bpftool: Fix skeleton codegen
  bpf: Fix memlock accounting for sock_hash
  bpf: sockmap: Don't attach programs to UDP sockets
  bpf: tcp: Recv() should return 0 when the peer socket is closed
  ibmvnic: Flush existing work items before device removal
  genetlink: clean up family attributes allocations
  net: ipa: header pad field only valid for AP->modem endpoint
  net: ipa: program upper nibbles of sequencer type
  net: ipa: fix modem LAN RX endpoint id
  net: ipa: program metadata mask differently
  ionic: add pcie_print_link_status
  rxrpc: Fix race between incoming ACK parser and retransmitter
  net/mlx5: E-Switch, Fix some error pointer dereferences
  net/mlx5: Don't fail driver on failure to create debugfs
  net/mlx5e: CT: Fix ipv6 nat header rewrite actions
  ...

Revert "btrfs: switch to iomap_dio_rw() for dio"

This reverts commit a43a67a2d715540c1368b9501a22b0373b5874c0.

This patch reverts the main part of switching direct io implementation
to iomap infrastructure. There's a problem in invalidate page that
couldn't be solved as regression in this development cycle.

The problem occurs when buffered and direct io are mixed, and the ranges
overlap. Although this is not recommended, filesystems implement
measures or fallbacks to make it somehow work. In this case, fallback to
buffered IO would be an option for btrfs (this already happens when
direct io is done on compressed data), but the change would be needed in
the iomap code, bringing new semantics to other filesystems.

Another problem arises when again the buffered and direct ios are mixed,
invalidation fails, then -EIO is set on the mapping and fsync will fail,
though there's no real error.

There have been discussions how to fix that, but revert seems to be the
least intrusive option.

Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/
Signed-off-by: David Sterba <[email protected]>

net: ethernet: ti: ale: fix allmulti for nu type ale

On AM65xx MCU CPSW2G NUSS and 66AK2E/L NUSS allmulti setting does not allow
unregistered mcast packets to pass.

This happens, because ALE VLAN entries on these SoCs do not contain port
masks for reg/unreg mcast packets, but instead store indexes of
ALE_VLAN_MASK_MUXx_REG registers which intended for store port masks for
reg/unreg mcast packets.
This path was missed by commit 9d1f6447274f ("net: ethernet: ti: ale: fix
seeing unreg mcast packets with promisc and allmulti disabled").

Hence, fix it by taking into account ALE type in cpsw_ale_set_allmulti().

Fixes: 9d1f6447274f ("net: ethernet: ti: ale: fix seeing unreg mcast packets with promisc and allmulti disabled")
Signed-off-by: Grygorii Strashko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init

The ALE parameters structure is created on stack, so it has to be reset
before passing to cpsw_ale_create() to avoid garbage values.

Fixes: 93a76530316a ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver")
Signed-off-by: Grygorii Strashko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>