Git Repo - linux.git/log

KVM: MIPS/T&E: Restore host asid on return to host

We only need the guest ASID loaded while in guest context, i.e. while
running guest code and while handling guest exits. We load the guest
ASID when entering the guest, however we restore the host ASID later
than necessary, when the VCPU state is saved i.e. vcpu_put() or slightly
earlier if preempted after returning to the host.

This mismatch is both unpleasant and causes redundant host ASID restores
in kvm_trap_emul_vcpu_put(). Lets explicitly restore the host ASID when
returning to the host, and don't bother restoring the host ASID on
context switch in unless we're already in guest context.

Signed-off-by: James Hogan <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Radim Krčmář" <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: [email protected]
Cc: [email protected]

KVM: MIPS: Add vcpu_run() & vcpu_reenter() callbacks

Add implementation callbacks for entering the guest (vcpu_run()) and
reentering the guest (vcpu_reenter()), allowing implementation specific
operations to be performed before entering the guest or after returning
to the host without cluttering kvm_arch_vcpu_ioctl_run().

This allows the T&E specific lazy user GVA flush to be moved into
trap_emul.c, along with disabling of the HTW. We also move
kvm_mips_deliver_interrupts() as VZ will need to restore the guest timer
state prior to delivering interrupts.

Signed-off-by: James Hogan <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Radim Krčmář" <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: [email protected]
Cc: [email protected]

KVM: MIPS: Remove duplicated ASIDs from vcpu

The kvm_vcpu_arch structure contains both mm_structs for allocating MMU
contexts (primarily the ASID) but it also copies the resulting ASIDs
into guest_{user,kernel}_asid[] arrays which are referenced from uasm
generated code.

This duplication doesn't seem to serve any purpose, and it gets in the
way of generalising the ASID handling across guest kernel/user modes, so
lets just extract the ASID straight out of the mm_struct on demand, and
in fact there are convenient cpu_context() and cpu_asid() macros for
doing so.

To reduce the verbosity of this code we do also add kern_mm and user_mm
local variables where the kernel and user mm_structs are used.

Signed-off-by: James Hogan <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Radim Krčmář" <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: [email protected]
Cc: [email protected]

KVM: MIPS/MMU: Move preempt/ASID handling to implementation

The MIPS KVM host and guest GVA ASIDs may need regenerating when
scheduling a process in guest context, which is done from the
kvm_arch_vcpu_load() / kvm_arch_vcpu_put() functions in mmu.c.

However this is a fairly implementation specific detail. VZ for example
may use GuestIDs instead of normal ASIDs to distinguish mappings
belonging to different guests, and even on VZ without GuestID the root
TLB will be used differently to trap & emulate.

Trap & emulate GVA ASIDs only relate to the user part of the full
address space, so can be left active during guest exit handling (guest
context) to allow guest instructions to be easily read and translated.

VZ root ASIDs however are for GPA mappings so can't be left active
during normal kernel code. They also aren't useful for accessing guest
virtual memory, and we should have CP0_BadInstr[P] registers available
to provide encodings of trapping guest instructions anyway.

Therefore move the ASID preemption handling into the implementation
callback.

Signed-off-by: James Hogan <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Radim Krčmář" <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: [email protected]
Cc: [email protected]

KVM: MIPS: Convert get/set_regs -> vcpu_load/put

Convert the get_regs() and set_regs() callbacks to vcpu_load() and
vcpu_put(), which provide a cpu argument and more closely match the
kvm_arch_vcpu_load() / kvm_arch_vcpu_put() that they are called by.

This is in preparation for moving ASID management into the
implementations.

Signed-off-by: James Hogan <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Radim Krčmář" <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: [email protected]
Cc: [email protected]

KVM: MIPS/MMU: Simplify ASID restoration

KVM T&E uses an ASID for guest kernel mode and an ASID for guest user
mode. The current ASID is saved when the guest is scheduled out, and
restored when scheduling back in, with checks for whether the ASID needs
to be regenerated.

This isn't really necessary as the ASID can be easily determined by the
current guest mode, so lets simplify it to just read the required ASID
from guest_kernel_asid or guest_user_asid even if the ASID hasn't been
regenerated.

Signed-off-by: James Hogan <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Radim Krčmář" <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: [email protected]
Cc: [email protected]

KVM: MIPS: Drop partial KVM_NMI implementation

MIPS incompletely implements the KVM_NMI ioctl to supposedly perform a
CPU reset, but all it actually does is invalidate the ASIDs. It doesn't
expose the KVM_CAP_USER_NMI capability which is supposed to indicate the
presence of the KVM_NMI ioctl, and no user software actually uses it on
MIPS.

Since this is dead code that would technically need updating for GVA
page table handling in upcoming patches, remove it now. If we wanted to
implement NMI injection later it can always be done properly along with
the KVM_CAP_USER_NMI capability, and if we wanted to implement a proper
CPU reset it would be better done with a separate ioctl.

Signed-off-by: James Hogan <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Radim Krčmář" <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: [email protected]
Cc: [email protected]

Merge MIPS prerequisites

Merge in MIPS prerequisites from GVA page tables and GPA page tables
series. The same branch can also merge into the MIPS tree.

Signed-off-by: James Hogan <[email protected]>

MIPS: Add return errors to protected cache ops

The protected cache ops contain no out of line fixup code to return an
error code in the event of a fault, with the cache op being skipped in
that case. For KVM however we'd like to detect this case as page
faulting will be disabled so it could happen during normal operation if
the GVA page tables were flushed, and need to be handled by the caller.

Add the out-of-line fixup code to load the error value -EFAULT into the
return variable, and adapt the protected cache line functions to pass
the error back to the caller.

Signed-off-by: James Hogan <[email protected]>
Acked-by: Ralf Baechle <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Radim Krčmář" <[email protected]>
Cc: [email protected]
Cc: [email protected]

MIPS: Export some tlbex internals for KVM to use

Export to TLB exception code generating functions so that KVM can
construct a fast TLB refill handler for guest context without
reinventing the wheel quite so much.

Signed-off-by: James Hogan <[email protected]>
Acked-by: Ralf Baechle <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Radim Krčmář" <[email protected]>
Cc: [email protected]
Cc: [email protected]

MIPS: uasm: Add include guards in asm/uasm.h

Add include guards in asm/uasm.h to allow it to be safely used by a new
header asm/tlbex.h in the next patch to expose TLB exception building
functions for KVM to use.

Signed-off-by: James Hogan <[email protected]>
Acked-by: Ralf Baechle <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Radim Krčmář" <[email protected]>
Cc: [email protected]
Cc: [email protected]

MIPS: Export pgd/pmd symbols for KVM

Export pmd_init(), invalid_pmd_table and tlbmiss_handler_setup_pgd to
GPL kernel modules so that MIPS KVM can use the inline page table
management functions and switch between page tables:

- pmd_init() will be used directly by KVM to initialise newly allocated
  pmd tables with invalid lower level table pointers.

- invalid_pmd_table is used by pud_present(), pud_none(), and
  pud_clear(), which KVM will use to test and clear pud entries.

- tlbmiss_handler_setup_pgd() will be called by KVM entry code to switch
  to the appropriate GVA page tables.

Signed-off-by: James Hogan <[email protected]>
Acked-by: Ralf Baechle <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Radim Krčmář" <[email protected]>
Cc: [email protected]
Cc: [email protected]

MIPS: Move pgd_alloc() out of header

pgd_alloc() references init_mm which is not exported to modules. In
order for KVM to be able to use pgd_alloc() to allocate GVA page tables,
move pgd_alloc() into a new pgtable.c file and export it to modules.

Signed-off-by: James Hogan <[email protected]>
Acked-by: Ralf Baechle <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Radim Krčmář" <[email protected]>
Cc: [email protected]
Cc: [email protected]

MIPS: KVM: Return directly after a failed copy_from_user() in kvm_arch_vcpu_ioctl()

* Return directly after a call of the function "copy_from_user" failed
in a case block.

* Delete the jump label "out" which became unnecessary with
this refactoring.

Signed-off-by: Markus Elfring <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: James Hogan <[email protected]>

KVM: arm/arm64: Remove kvm_vgic_inject_mapped_irq

The only benefit of having kvm_vgic_inject_mapped_irq separate from
kvm_vgic_inject_irq is that we pass a boolean that we use for error
checking on the injection path.

While this could potentially help in some aspect of robustness, it's
also a little bit of a defensive move, and arguably callers into the
vgic should have make sure they have marked their virtual IRQs as mapped
if required.

Acked-by: Marc Zyngier <[email protected]>
Reviewed-by: Andre Przywara <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>

KVM: PPC: Book3S HV: Advertise availablity of HPT resizing on KVM HV

This updates the KVM_CAP_SPAPR_RESIZE_HPT capability to advertise the
presence of in-kernel HPT resizing on KVM HV.

Signed-off-by: David Gibson <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S HV: KVM-HV HPT resizing implementation

This adds the "guts" of the implementation for the HPT resizing PAPR
extension. It has the code to allocate and clear a new HPT, rehash an
existing HPT's entries into it, and accomplish the switchover for a
KVM guest from the old HPT to the new one.

Signed-off-by: David Gibson <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation

This adds a not yet working outline of the HPT resizing PAPR
extension. Specifically it adds the necessary ioctl() functions,
their basic steps, the work function which will handle preparation for
the resize, and synchronization between these, the guest page fault
path and guest HPT update path.

The actual guts of the implementation isn't here yet, so for now the
calls will always fail.

Signed-off-by: David Gibson <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S HV: Create kvmppc_unmap_hpte_helper()

The kvm_unmap_rmapp() function, called from certain MMU notifiers, is used
to force all guest mappings of a particular host page to be set ABSENT, and
removed from the reverse mappings.

For HPT resizing, we will have some cases where we want to set just a
single guest HPTE ABSENT and remove its reverse mappings. To prepare with
this, we split out the logic from kvm_unmap_rmapp() to evict a single HPTE,
moving it to a new helper function.

Signed-off-by: David Gibson <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size

The KVM_PPC_ALLOCATE_HTAB ioctl() is used to set the size of hashed page
table (HPT) that userspace expects a guest VM to have, and is also used to
clear that HPT when necessary (e.g. guest reboot).

At present, once the ioctl() is called for the first time, the HPT size can
never be changed thereafter - it will be cleared but always sized as from
the first call.

With upcoming HPT resize implementation, we're going to need to allow
userspace to resize the HPT at reset (to change it back to the default size
if the guest changed it).

So, we need to allow this ioctl() to change the HPT size.

This patch also updates Documentation/virtual/kvm/api.txt to reflect
the new behaviour. In fact the documentation was already slightly
incorrect since 572abd5 "KVM: PPC: Book3S HV: Don't fall back to
smaller HPT size in allocation ioctl"

Signed-off-by: David Gibson <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S HV: Split HPT allocation from activation

Currently, kvmppc_alloc_hpt() both allocates a new hashed page table (HPT)
and sets it up as the active page table for a VM.  For the upcoming HPT
resize implementation we're going to want to allocate HPTs separately from
activating them.

So, split the allocation itself out into kvmppc_allocate_hpt() and perform
the activation with a new kvmppc_set_hpt() function.  Likewise we split
kvmppc_free_hpt(), which just frees the HPT, from kvmppc_release_hpt()
which unsets it as an active HPT, then frees it.

We also move the logic to fall back to smaller HPT sizes if the first try
fails into the single caller which used that behaviour,
kvmppc_hv_setup_htab_rma().  This introduces a slight semantic change, in
that previously if the initial attempt at CMA allocation failed, we would
fall back to attempting smaller sizes with the page allocator.  Now, we
try first CMA, then the page allocator at each size.  As far as I can tell
this change should be harmless.

To match, we make kvmppc_free_hpt() just free the actual HPT itself.  The
call to kvmppc_free_lpid() that was there, we move to the single caller.

Signed-off-by: David Gibson <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S HV: Don't store values derivable from HPT order

Currently the kvm_hpt_info structure stores the hashed page table's order,
and also the number of HPTEs it contains and a mask for its size. The
last two can be easily derived from the order, so remove them and just
calculate them as necessary with a couple of helper inlines.

Signed-off-by: David Gibson <[email protected]>
Reviewed-by: Thomas Huth <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S HV: Gather HPT related variables into sub-structure

Currently, the powerpc kvm_arch structure contains a number of variables
tracking the state of the guest's hashed page table (HPT) in KVM HV. This
patch gathers them all together into a single kvm_hpt_info substructure.
This makes life more convenient for the upcoming HPT resizing
implementation.

Signed-off-by: David Gibson <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S HV: Rename kvm_alloc_hpt() for clarity

The difference between kvm_alloc_hpt() and kvmppc_alloc_hpt() is not at
all obvious from the name. In practice kvmppc_alloc_hpt() allocates an HPT
by whatever means, and calls kvm_alloc_hpt() which will attempt to allocate
it with CMA only.

To make this less confusing, rename kvm_alloc_hpt() to kvm_alloc_hpt_cma().
Similarly, kvm_release_hpt() is renamed kvm_free_hpt_cma().

Signed-off-by: David Gibson <[email protected]>
Reviewed-by: Thomas Huth <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S HV: HPT resizing documentation and reserved numbers

This adds a new powerpc-specific KVM_CAP_SPAPR_RESIZE_HPT capability to
advertise whether KVM is capable of handling the PAPR extensions for
resizing the hashed page table during guest runtime.  It also adds
definitions for two new VM ioctl()s to implement this extension, and
documentation of the same.

Note that, HPT resizing is already possible with KVM PR without kernel
modification, since the HPT is managed within userspace (qemu).  The
capability defined here will only be set where an in-kernel implementation
of resizing is necessary, i.e. for KVM HV.  To determine if the userspace
resize implementation can be used, it's necessary to check
KVM_CAP_PPC_ALLOC_HTAB.  Unfortunately older kernels incorrectly set
KVM_CAP_PPC_ALLOC_HTAB even with KVM PR.  If userspace it want to support
resizing with KVM PR on such kernels, it will need a workaround.

Signed-off-by: David Gibson <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

Documentation: Correct duplicate section number in kvm/api.txt

Both KVM_CREATE_SPAPR_TCE_64 and KVM_REINJECT_CONTROL have section number
4.98 in Documentation/virtual/kvm/api.txt, presumably due to a naive merge.
This corrects the duplication.

[[email protected] - correct section numbers for following sections,
KVM_PPC_CONFIGURE_V3_MMU and KVM_PPC_GET_RMMU_INFO, as well.]

Signed-off-by: David Gibson <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next

This merges in the POWER9 radix MMU host and guest support, which
was put into a topic branch because it touches both powerpc and
KVM code.

Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S HV: Enable radix guest support

This adds a few last pieces of the support for radix guests:

* Implement the backends for the KVM_PPC_CONFIGURE_V3_MMU and
  KVM_PPC_GET_RMMU_INFO ioctls for radix guests

* On POWER9, allow secondary threads to be on/off-lined while guests
  are running.

* Set up LPCR and the partition table entry for radix guests.

* Don't allocate the rmap array in the kvm_memory_slot structure
  on radix.

* Don't try to initialize the HPT for radix guests, since they don't
  have an HPT.

* Take out the code that prevents the HV KVM module from
  initializing on radix hosts.

At this stage, we only support radix guests if the host is running
in radix mode, and only support HPT guests if the host is running in
HPT mode.  Thus a guest cannot switch from one mode to the other,
which enables some simplifications.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

KVM: PPC: Book3S HV: Invalidate ERAT on guest entry/exit for POWER9 DD1

On POWER9 DD1, we need to invalidate the ERAT (effective to real
address translation cache) when changing the PIDR register, which
we do as part of guest entry and exit.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

KVM: PPC: Book3S HV: Allow guest exit path to have MMU on

If we allow LPCR[AIL] to be set for radix guests, then interrupts from
the guest to the host can be delivered by the hardware with relocation
on, and thus the code path starting at kvmppc_interrupt_hv can be
executed in virtual mode (MMU on) for radix guests (previously it was
only ever executed in real mode).

Most of the code is indifferent to whether the MMU is on or off, but
the calls to OPAL that use the real-mode OPAL entry code need to
be switched to use the virtual-mode code instead.  The affected
calls are the calls to the OPAL XICS emulation functions in
kvmppc_read_one_intr() and related functions.  We test the MSR[IR]
bit to detect whether we are in real or virtual mode, and call the
opal_rm_* or opal_* function as appropriate.

The other place that depends on the MMU being off is the optimization
where the guest exit code jumps to the external interrupt vector or
hypervisor doorbell interrupt vector, or returns to its caller (which
is __kvmppc_vcore_entry).  If the MMU is on and we are returning to
the caller, then we don't need to use an rfid instruction since the
MMU is already on; a simple blr suffices.  If there is an external
or hypervisor doorbell interrupt to handle, we branch to the
relocation-on version of the interrupt vector.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

KVM: PPC: Book3S HV: Invalidate TLB on radix guest vcpu movement

With radix, the guest can do TLB invalidations itself using the tlbie
(global) and tlbiel (local) TLB invalidation instructions.  Linux guests
use local TLB invalidations for translations that have only ever been
accessed on one vcpu.  However, that doesn't mean that the translations
have only been accessed on one physical cpu (pcpu) since vcpus can move
around from one pcpu to another.  Thus a tlbiel might leave behind stale
TLB entries on a pcpu where the vcpu previously ran, and if that task
then moves back to that previous pcpu, it could see those stale TLB
entries and thus access memory incorrectly.  The usual symptom of this
is random segfaults in userspace programs in the guest.

To cope with this, we detect when a vcpu is about to start executing on
a thread in a core that is a different core from the last time it
executed.  If that is the case, then we mark the core as needing a
TLB flush and then send an interrupt to any thread in the core that is
currently running a vcpu from the same guest.  This will get those vcpus
out of the guest, and the first one to re-enter the guest will do the
TLB flush.  The reason for interrupting the vcpus executing on the old
core is to cope with the following scenario:

CPU 0 CPU 1 CPU 4
(core 0) (core 0) (core 1)

VCPU 0 runs task X      VCPU 1 runs
core 0 TLB gets
entries from task X
VCPU 0 moves to CPU 4
VCPU 0 runs task X
Unmap pages of task X
tlbiel

(still VCPU 1) task X moves to VCPU 1
task X runs
task X sees stale TLB
entries

That is, as soon as the VCPU starts executing on the new core, it
could unmap and tlbiel some page table entries, and then the task
could migrate to one of the VCPUs running on the old core and
potentially see stale TLB entries.

Since the TLB is shared between all the threads in a core, we only
use the bit of kvm->arch.need_tlb_flush corresponding to the first
thread in the core.  To ensure that we don't have a window where we
can miss a flush, this moves the clearing of the bit from before the
actual flush to after it.  This way, two threads might both do the
flush, but we prevent the situation where one thread can enter the
guest before the flush is finished.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

KVM: PPC: Book3S HV: Make HPT-specific hypercalls return error in radix mode

If the guest is in radix mode, then it doesn't have a hashed page
table (HPT), so all of the hypercalls that manipulate the HPT can't
work and should return an error. This adds checks to make them
return H_FUNCTION ("function not supported").

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

KVM: PPC: Book3S HV: Implement dirty page logging for radix guests

This adds code to keep track of dirty pages when requested (that is,
when memslot->dirty_bitmap is non-NULL) for radix guests. We use the
dirty bits in the PTEs in the second-level (partition-scoped) page
tables, together with a bitmap of pages that were dirty when their
PTE was invalidated (e.g., when the page was paged out). This bitmap
is stored in the first half of the memslot->dirty_bitmap area, and
kvm_vm_ioctl_get_dirty_log_hv() now uses the second half for the
bitmap that gets returned to userspace.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

KVM: PPC: Book3S HV: MMU notifier callbacks for radix guests

This adapts our implementations of the MMU notifier callbacks
(unmap_hva, unmap_hva_range, age_hva, test_age_hva, set_spte_hva)
to call radix functions when the guest is using radix. These
implementations are much simpler than for HPT guests because we
have only one PTE to deal with, so we don't need to traverse
rmap chains.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

KVM: PPC: Book3S HV: Page table construction and page faults for radix guests

This adds the code to construct the second-level ("partition-scoped" in
architecturese) page tables for guests using the radix MMU.  Apart from
the PGD level, which is allocated when the guest is created, the rest
of the tree is all constructed in response to hypervisor page faults.

As well as hypervisor page faults for missing pages, we also get faults
for reference/change (RC) bits needing to be set, as well as various
other error conditions.  For now, we only set the R or C bit in the
guest page table if the same bit is set in the host PTE for the
backing page.

This code can take advantage of the guest being backed with either
transparent or ordinary 2MB huge pages, and insert 2MB page entries
into the guest page tables.  There is no support for 1GB huge pages
yet.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

KVM: PPC: Book3S HV: Modify guest entry/exit paths to handle radix guests

This adds code to branch around the parts that radix guests don't
need - clearing and loading the SLB with the guest SLB contents,
saving the guest SLB contents on exit, and restoring the host SLB
contents.

Since the host is now using radix, we need to save and restore the
host value for the PID register.

On hypervisor data/instruction storage interrupts, we don't do the
guest HPT lookup on radix, but just save the guest physical address
for the fault (from the ASDR register) in the vcpu struct.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

KVM: PPC: Book3S HV: Add basic infrastructure for radix guests

This adds a field in struct kvm_arch and an inline helper to
indicate whether a guest is a radix guest or not, plus a new file
to contain the radix MMU code, which currently contains just a
translate function which knows how to traverse the guest page
tables to translate an address.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

KVM: PPC: Book3S HV: Use ASDR for HPT guests on POWER9

POWER9 adds a register called ASDR (Access Segment Descriptor
Register), which is set by hypervisor data/instruction storage
interrupts to contain the segment descriptor for the address
being accessed, assuming the guest is using HPT translation.
(For radix guests, it contains the guest real address of the
access.)

Thus, for HPT guests on POWER9, we can use this register rather
than looking up the SLB with the slbfee. instruction.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

KVM: PPC: Book3S HV: Set process table for HPT guests on POWER9

This adds the implementation of the KVM_PPC_CONFIGURE_V3_MMU ioctl
for HPT guests on POWER9. With this, we can return 1 for the
KVM_CAP_PPC_MMU_HASH_V3 capability.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

KVM: PPC: Book3S HV: Add userspace interfaces for POWER9 MMU

This adds two capabilities and two ioctls to allow userspace to
find out about and configure the POWER9 MMU in a guest.  The two
capabilities tell userspace whether KVM can support a guest using
the radix MMU, or using the hashed page table (HPT) MMU with a
process table and segment tables.  (Note that the MMUs in the
POWER9 processor cores do not use the process and segment tables
when in HPT mode, but the nest MMU does).

The KVM_PPC_CONFIGURE_V3_MMU ioctl allows userspace to specify
whether a guest will use the radix MMU or the HPT MMU, and to
specify the size and location (in guest space) of the process
table.

The KVM_PPC_GET_RMMU_INFO ioctl gives userspace information about
the radix MMU.  It returns a list of supported radix tree geometries
(base page size and number of bits indexed at each level of the
radix tree) and the encoding used to specify the various page
sizes for the TLB invalidate entry instruction.

Initially, both capabilities return 0 and the ioctls return -EINVAL,
until the necessary infrastructure for them to operate correctly
is added.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

powerpc/64: Allow for relocation-on interrupts from guest to host

With host and guest both using radix translation, it is feasible
for the host to take interrupts that come from the guest with
relocation on, and that is in fact what the POWER9 hardware will
do when LPCR[AIL] = 3. All such interrupts use HSRR0/1 not SRR0/1
except for system call with LEV=1 (hcall).

Therefore this adds the KVM tests to the _HV variants of the
relocation-on interrupt handlers, and adds the KVM test to the
relocation-on system call entry point.

We also instantiate the relocation-on versions of the hypervisor
data storage and instruction interrupt handlers, since these can
occur with relocation on in radix guests.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

powerpc/64: Make type of partition table flush depend on partition type

When changing a partition table entry on POWER9, we do a particular
form of the tlbie instruction which flushes all TLBs and caches of
the partition table for a given logical partition ID (LPID).
This instruction has a field in the instruction word, labelled R
(radix), which should be 1 if the partition was previously a radix
partition and 0 if it was a HPT partition. This implements that
logic.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

powerpc/64: Export pgtable_cache and pgtable_cache_add for KVM

This exports the pgtable_cache array and the pgtable_cache_add
function so that HV KVM can use them for allocating radix page
tables for guests.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

powerpc/64: More definitions for POWER9

This adds definitions for bits in the DSISR register which are used
by POWER9 for various translation-related exception conditions, and
for some more bits in the partition table entry that will be needed
by KVM.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

powerpc/64: Enable use of radix MMU under hypervisor on POWER9

To use radix as a guest, we first need to tell the hypervisor via
the ibm,client-architecture call first that we support POWER9 and
architecture v3.00, and that we can do either radix or hash and
that we would like to choose later using an hcall (the
H_REGISTER_PROC_TBL hcall).

Then we need to check whether the hypervisor agreed to us using
radix. We need to do this very early on in the kernel boot process
before any of the MMU initialization is done. If the hypervisor
doesn't agree, we can't use radix and therefore clear the radix
MMU feature bit.

Later, when we have set up our process table, which points to the
radix tree for each process, we need to install that using the
H_REGISTER_PROC_TBL hcall.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

powerpc/pseries: Fixes for the "ibm,architecture-vec-5" options

This fixes the byte index values for some of the option bits in
the "ibm,architectur-vec-5" property. The "platform facilities options"
bits are in byte 17 not byte 14, so the upper 8 bits of their
definitions need to be 0x11 not 0x0E. The "sub processor support" option
is in byte 21 not byte 15.

Note none of these options are actually looked up in
"ibm,architecture-vec-5" at this time, so there is no bug.

When checking whether option bits are set, we should check that
the offset of the byte being checked is less than the vector
length that we got from the hypervisor.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

powerpc/64: Don't try to use radix MMU under a hypervisor

Currently, if the kernel is running on a POWER9 processor under a
hypervisor, it will try to use the radix MMU even though it doesn't have
the necessary code to use radix under a hypervisor (it doesn't negotiate
use of radix, and it doesn't do the H_REGISTER_PROC_TBL hcall). The
result is that the guest kernel will crash when it tries to turn on the
MMU.

This fixes it by looking for the /chosen/ibm,architecture-vec-5
property, and if it exists, clears the radix MMU feature bit, before we
decide whether to initialize for radix or HPT. This property is created
by the hypervisor as a result of the guest calling the
ibm,client-architecture-support method to indicate its capabilities, so
it will indicate whether the hypervisor agreed to us using radix.

Systems without a hypervisor may have this property also (for example,
skiboot creates it), so we check the HV bit in the MSR to see whether we
are running as a guest or not. If we are in hypervisor mode, then we can
do whatever we like including using the radix MMU.

The reason for using this property is that in future, when we have
support for using radix under a hypervisor, we will need to check this
property to see whether the hypervisor agreed to us using radix.

Fixes: 2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init routines")
Cc: [email protected] # v4.7+
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

KVM: PPC: Book3S: 64-bit CONFIG_RELOCATABLE support for interrupts

64-bit Book3S exception handlers must find the dynamic kernel base
to add to the target address when branching beyond __end_interrupts,
in order to support kernel running at non-0 physical address.

Support this in KVM by branching with CTR, similarly to regular
interrupt handlers. The guest CTR saved in HSTATE_SCRATCH1 and
restored after the branch.

Without this, the host kernel hangs and crashes randomly when it is
running at a non-0 address and a KVM guest is started.

Signed-off-by: Nicholas Piggin <[email protected]>
Acked-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

arm/arm64: KVM: Get rid of KVM_MEMSLOT_INCOHERENT

KVM_MEMSLOT_INCOHERENT is not used anymore, as we've killed its
only use in the arm/arm64 MMU code. Let's remove the last artifacts.

Reviewed-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>

arm/arm64: KVM: Stop propagating cacheability status of a faulted page

Now that we unconditionally flush newly mapped pages to the PoC,
there is no need to care about the "uncached" status of individual
pages - they must all be visible all the way down.

Reviewed-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>

arm/arm64: KVM: Enforce unconditional flush to PoC when mapping to stage-2

When we fault in a page, we flush it to the PoC (Point of Coherency)
if the faulting vcpu has its own caches off, so that it can observe
the page we just brought it.

But if the vcpu has its caches on, we skip that step. Bad things
happen when *another* vcpu tries to access that page with its own
caches disabled. At that point, there is no garantee that the
data has made it to the PoC, and we access stale data.

The obvious fix is to always flush to PoC when a page is faulted
in, no matter what the state of the vcpu is.

Cc: [email protected]
Fixes: 2d58b733c876 ("arm64: KVM: force cache clean on page fault when caches are off")
Reviewed-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>

KVM: arm/arm64: Documentation: Update arm-vgic-v3.txt

Update error code returned for Invalid CPU interface register
value and access in AArch32 mode.

Acked-by: Christoffer Dall <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Vijaya Kumar K <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>

KVM: arm/arm64: vgic: Implement KVM_DEV_ARM_VGIC_GRP_LEVEL_INFO ioctl

Userspace requires to store and restore of line_level for
level triggered interrupts using ioctl KVM_DEV_ARM_VGIC_GRP_LEVEL_INFO.

Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Vijaya Kumar K <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>

KVM: arm/arm64: vgic: Implement VGICv3 CPU interface access

VGICv3 CPU interface registers are accessed using
KVM_DEV_ARM_VGIC_CPU_SYSREGS ioctl. These registers are accessed
as 64-bit. The cpu MPIDR value is passed along with register id.
It is used to identify the cpu for registers access.

The VM that supports SEIs expect it on destination machine to handle
guest aborts and hence checked for ICC_CTLR_EL1.SEIS compatibility.
Similarly, VM that supports Affinity Level 3 that is required for AArch64
mode, is required to be supported on destination machine. Hence checked
for ICC_CTLR_EL1.A3V compatibility.

The arch/arm64/kvm/vgic-sys-reg-v3.c handles read and write of VGIC
CPU registers for AArch64.

For AArch32 mode, arch/arm/kvm/vgic-v3-coproc.c file is created but
APIs are not implemented.

Updated arch/arm/include/uapi/asm/kvm.h with new definitions
required to compile for AArch32.

The version of VGIC v3 specification is defined here
Documentation/virtual/kvm/devices/arm-vgic-v3.txt

Acked-by: Christoffer Dall <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Pavel Fedin <[email protected]>
Signed-off-by: Vijaya Kumar K <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>

KVM: arm/arm64: vgic: Introduce VENG0 and VENG1 fields to vmcr struct

ICC_VMCR_EL2 supports virtual access to ICC_IGRPEN1_EL1.Enable
and ICC_IGRPEN0_EL1.Enable fields. Add grpen0 and grpen1 member
variables to struct vmcr to support read and write of these fields.

Also refactor vgic_set_vmcr and vgic_get_vmcr() code.
Drop ICH_VMCR_CTLR_SHIFT and ICH_VMCR_CTLR_MASK macros and instead
use ICH_VMCR_EOI* and ICH_VMCR_CBPR* macros.

Signed-off-by: Vijaya Kumar K <[email protected]>
Reviewed-by: Christoffer Dall <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>

irqchip/gic-v3: Add missing system register definitions

Define register definitions for ICH_VMCR_EL2, ICC_CTLR_EL1 and
ICH_VTR_EL2, ICC_BPR0_EL1, ICC_BPR1_EL1 registers.

Signed-off-by: Vijaya Kumar K <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>

KVM: arm/arm64: vgic: Introduce find_reg_by_id()

In order to implement vGICv3 CPU interface access, we will need to perform
table lookup of system registers. We would need both index_to_params() and
find_reg() exported for that purpose, but instead we export a single
function which combines them both.

Signed-off-by: Pavel Fedin <[email protected]>
Signed-off-by: Vijaya Kumar K <[email protected]>
Reviewed-by: Andre Przywara <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Acked-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>

KVM: arm/arm64: vgic: Add distributor and redistributor access

VGICv3 Distributor and Redistributor registers are accessed using
KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_REDIST_REGS
with KVM_SET_DEVICE_ATTR and KVM_GET_DEVICE_ATTR ioctls.
These registers are accessed as 32-bit and cpu mpidr
value passed along with register offset is used to identify the
cpu for redistributor registers access.

The version of VGIC v3 specification is defined here
Documentation/virtual/kvm/devices/arm-vgic-v3.txt

Also update arch/arm/include/uapi/asm/kvm.h to compile for
AArch32 mode.

Signed-off-by: Vijaya Kumar K <[email protected]>
Reviewed-by: Christoffer Dall <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>

KVM: arm/arm64: vgic: Implement support for userspace access

Read and write of some registers like ISPENDR and ICPENDR
from userspace requires special handling when compared to
guest access for these registers.

Refer to Documentation/virtual/kvm/devices/arm-vgic-v3.txt
for handling of ISPENDR, ICPENDR registers handling.

Add infrastructure to support guest and userspace read
and write for the required registers
Also moved vgic_uaccess from vgic-mmio-v2.c to vgic-mmio.c

Signed-off-by: Vijaya Kumar K <[email protected]>
Reviewed-by: Christoffer Dall <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>

KVM: s390: Add debug logging to basic cpu model interface

Let's log something for changes in facilities, cpuid and ibc now that we
have a cpu model in QEMU. All of these calls are pretty seldom, so we
will not spill the log, the they will help to understand pontential
guest issues, for example if some instructions are fenced off.

As the s390 debug feature has a limited amount of parameters and
strings must not go away we limit the facility printing to 3 double
words, instead of building that list dynamically. This should be enough
for several years. If we ever exceed 3 double words then the logging
will be incomplete but no functional impact will happen.

Signed-off-by: Christian Borntraeger <[email protected]>

Merge tag 'kvm-s390-master-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kernelorgnext

avoid merge conflicts, pull update for master
also into next.

KVM: s390: Fix RRBE return code not being CC

reset_guest_reference_bit needs to return the CC, so we can set it in
the guest PSW when emulating RRBE. Right now it only returns 0.

Let's fix that.

Signed-off-by: Janosch Frank <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>

KVM: s390: guestdbg: filter PER i-fetch on EXECUTE properly

When we get a PER i-fetch event on an EXECUTE or EXECUTE RELATIVE LONG
instruction, because the executed instruction generated a PER i-fetch
event, then the PER address points at the EXECUTE function, not the
fetched one.

Therefore, when filtering PER events, we have to take care of the
really fetched instruction, which we can only get by reading in guest
virtual memory.

For icpt code 4 and 56, we directly have additional information about an
EXECUTE instruction at hand. For icpt code 8, we always have to read
in guest virtual memory.

Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
[small fixes]

KVM: s390: prepare to read random guest instructions

We will have to read instructions not residing at the current PSW
address.

Reviewed-by: Eric Farman <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>

KVM: s390: guestdbg: filter i-fetch events on icpts

We already filter PER events reported via icpt code 8. For icpt code
4 and 56, this is still missing.

So let's properly detect if we have a debugging event and if we have to
inject a PER i-fetch event into the guest at all.

Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>

KVM: s390: Introduce BCD Vector Instructions to the guest

We can directly forward the vector BCD instructions to the guest
if available and VX is requested by user space.

Please note that user space will have to take care of the final state
of the facility bit when migrating to older machines.

Signed-off-by: Guenther Hutzl <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>

KVM: s390: Introduce Vector Enhancements facility 1 to the guest

We can directly forward the vector enhancement facility 1 to the guest
if available and VX is requested by user space.

Please note that user space will have to take care of the final state
of the facility bit when migrating to older machines.

Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Maxim Samoylov <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>

KVM: s390: Get rid of ar_t

sparse with __CHECK_ENDIAN__ shows that ar_t was never properly
used across KVM on s390. We can now:
- fix all places
- do not make ar_t special
Since ar_t is just used as a register number (no endianness issues
for u8), and all other register numbers are also just plain int
variables, let's just use u8, which matches the __u8 in the userspace
ABI for the memop ioctl.

Signed-off-by: Christian Borntraeger <[email protected]>
Acked-by: Janosch Frank <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>

KVM: s390: get rid of bogus cc initialization

The plo inline assembly has a cc output operand that is always written
to and is also as such an operand declared. Therefore the compiler is
free to omit the rather pointless and misleading initialization.

Get rid of this.

Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>

KVM: s390: instruction-execution-protection support

The new Instruction Execution Protection needs to be enabled before
the guest can use it. Therefore we pass the IEP facility bit to the
guest and enable IEP interpretation.

Signed-off-by: Janosch Frank <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>

KVM: s390: gaccess: add ESOP2 handling

When we access guest memory and run into a protection exception, we
need to pass the exception data to the guest. ESOP2 provides detailed
information about all protection exceptions which ESOP1 only partially
provided.

The gaccess changes make sure, that the guest always gets all
available information.

Signed-off-by: Janosch Frank <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>

kvm: x86: mmu: Verify that restored PTE has needed perms in fast page fault

Before fast page fault restores an access track PTE back to a regular PTE,
it now also verifies that the restored PTE would grant the necessary
permissions for the faulting access to succeed. If not, it falls back
to the slow page fault path.

Signed-off-by: Junaid Shahid <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: mmu: Move pgtbl walk inside retry loop in fast_page_fault

Redo the page table walk in fast_page_fault when retrying so that we are
working on the latest PTE even if the hierarchy changes.

Signed-off-by: Junaid Shahid <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: mmu: Update comment in mark_spte_for_access_track

Reword the comment to hopefully make it more clear.

Signed-off-by: Junaid Shahid <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: mmu: Set SPTE_SPECIAL_MASK within mmu.c

Instead of the caller including the SPTE_SPECIAL_MASK in the masks being
supplied to kvm_mmu_set_mmio_spte_mask() and kvm_mmu_set_mask_ptes(),
those functions now themselves include the SPTE_SPECIAL_MASK.

Note that bit 63 is now reset in the default MMIO mask.

Signed-off-by: Junaid Shahid <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: mmu: Rename EPT_VIOLATION_READ/WRITE/INSTR constants

Rename the EPT_VIOLATION_READ/WRITE/INSTR constants to
EPT_VIOLATION_ACC_READ/WRITE/INSTR to more clearly indicate that these
signify the type of the memory access as opposed to the permissions
granted by the PTE.

Signed-off-by: Junaid Shahid <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: PPC: Book3S PR: Refactor program interrupt related code into separate function

The function kvmppc_handle_exit_pr() is quite huge and thus hard to read,
and even contains a "spaghetti-code"-like goto between the different case
labels of the big switch statement. This can be made much more readable
by moving the code related to injecting program interrupts / instruction
emulation into a separate function instead.

Signed-off-by: Thomas Huth <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S HV: Fix H_PROD to actually wake the target vcpu

The H_PROD hypercall is supposed to wake up an idle vcpu.  We have
an implementation, but because Linux doesn't use it except when
doing cpu hotplug, it was never tested properly.  AIX does use it,
and reported it broken.  It turns out we were waking the wrong
vcpu (the one doing H_PROD, not the target of the prod) and we
weren't handling the case where the target needs an IPI to wake
it.  Fix it by using the existing kvmppc_fast_vcpu_kick_hv()
function, which is intended for this kind of thing, and by using
the target vcpu not the current vcpu.

We were also not looking at the prodded flag when checking whether a
ceded vcpu should wake up, so this adds checks for the prodded flag
alongside the checks for pending exceptions.

Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S: Move 64-bit KVM interrupt handler out from alt section

A subsequent patch to make KVM handlers relocation-safe makes them
unusable from within alt section "else" cases (due to the way fixed
addresses are taken from within fixed section head code).

Stop open-coding the KVM handlers, and add them both as normal. A more
optimal fix may be to allow some level of alternate feature patching in
the exception macros themselves, but for now this will do.

The TRAMP_KVM handlers must be moved to the "virt" fixed section area
(name is arbitrary) in order to be closer to .text and avoid the dreaded
"relocation truncated to fit" error.

Signed-off-by: Nicholas Piggin <[email protected]>
Acked-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

KVM: PPC: Book3S: Change interrupt call to reduce scratch space use on HV

Change the calling convention to put the trap number together with
CR in two halves of r12, which frees up HSTATE_SCRATCH2 in the HV
handler.

The 64-bit PR handler entry translates the calling convention back
to match the previous call convention (i.e., shared with 32-bit), for
simplicity.

Signed-off-by: Nicholas Piggin <[email protected]>
Acked-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

KVM: PPC: Book 3S: XICS: Don't lock twice when checking for resend

This patch improves the code that takes lock twice to check the resend flag
and do the actual resending, by checking the resend flag locklessly, and
add a boolean parameter check_resend to icp_[rm_]deliver_irq(), so the
resend flag can be checked in the lock when doing the delivery.

We need make sure when we clear the ics's bit in the icp's resend_map, we
don't miss the resend flag of the irqs that set the bit. It could be
ordered through the barrier in test_and_clear_bit(), and a newly added
wmb between setting irq's resend flag, and icp's resend_map.

Signed-off-by: Li Zhong <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book 3S: XICS: Implement ICS P/Q states

This patch implements P(Presented)/Q(Queued) states for ICS irqs.

When the interrupt is presented, set P. Present if P was not set.
If P is already set, don't present again, set Q.
When the interrupt is EOI'ed, move Q into P (and clear Q). If it is
set, re-present.

The asserted flag used by LSI is also incorporated into the P bit.

When the irq state is saved, P/Q bits are also saved, they need some
qemu modifications to be recognized and passed around to be restored.
KVM_XICS_PENDING bit set and saved should also indicate
KVM_XICS_PRESENTED bit set and saved. But it is possible some old
code doesn't have/recognize the P bit, so when we restore, we set P
for PENDING bit, too.

The idea and much of the code come from Ben.

Signed-off-by: Li Zhong <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book 3S: XICS: Fix potential issue with duplicate IRQ resends

It is possible that in the following order, one irq is resent twice:

        CPU 1                                   CPU 2

ics_check_resend()
  lock ics_lock
    see resend set
  unlock ics_lock
                                       /* change affinity of the irq */
                                       kvmppc_xics_set_xive()
                                         write_xive()
                                           lock ics_lock
                                             see resend set
                                           unlock ics_lock

                                         icp_deliver_irq() /* resend */

  icp_deliver_irq() /* resend again */

It doesn't have any user-visible effect at present, but needs to be avoided
when the following patch implementing the P/Q stuff is applied.

This patch clears the resend flag before releasing the ics lock, when we
know we will do a re-delivery after checking the flag, or setting the flag.

Signed-off-by: Li Zhong <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book 3S: XICS: correct the real mode ICP rejecting counter

Some counters are added in Commit 6e0365b78273 ("KVM: PPC: Book3S HV:
Add ICP real mode counters"), to provide some performance statistics to
determine whether further optimizing is needed for real mode functions.

The n_reject counter counts how many times ICP rejects an irq because of
priority in real mode. The redelivery of an lsi that is still asserted
after eoi doesn't fall into this category, so the increasement there is
removed.

Also, it needs to be increased in icp_rm_deliver_irq() if it rejects
another one.

Signed-off-by: Li Zhong <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book 3S: XICS cleanup: remove XICS_RM_REJECT

Commit b0221556dbd3 ("KVM: PPC: Book3S HV: Move virtual mode ICP functions
to real-mode") removed the setting of the XICS_RM_REJECT flag. And
since that commit, nothing else sets the flag any more, so we can remove
the flag and the remaining code that handles it, including the counter
that counts how many times it get set.

Signed-off-by: Li Zhong <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S HV: Don't try to signal cpu -1

If the target vcpu for kvmppc_fast_vcpu_kick_hv() is not running on
any CPU, then we will have vcpu->arch.thread_cpu == -1, and as it
happens, kvmppc_fast_vcpu_kick_hv will call kvmppc_ipi_thread with
-1 as the cpu argument.  Although this is not meaningful, in the past,
before commit 1704a81ccebc ("KVM: PPC: Book3S HV: Use msgsnd for IPIs
to other cores on POWER9", 2016-11-18), it was harmless because CPU
-1 is not in the same core as any real CPU thread.  On a POWER9,
however, we don't do the "same core" check, so we were trying to
do a msgsnd to thread -1, which is invalid.  To avoid this, we add
a check to see that vcpu->arch.thread_cpu is >= 0 before calling
kvmppc_ipi_thread() with it.  Since vcpu->arch.thread_vcpu can change
asynchronously, we use READ_ONCE to ensure that the value we check is
the same value that we use as the argument to kvmppc_ipi_thread().

Fixes: 1704a81ccebc ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9")
Signed-off-by: Paul Mackerras <[email protected]>

KVM: arm/arm64: vgic: Add debugfs vgic-state file

Add a file to debugfs to read the in-kernel state of the vgic. We don't
do any locking of the entire VGIC state while traversing all the IRQs,
so if the VM is running the user/developer may not see a quiesced state,
but should take care to pause the VM using facilities in user space for
that purpose.

We also don't support LPIs yet, but they can be added easily if needed.

Reviewed-by: Eric Auger <[email protected]>
Tested-by: Eric Auger <[email protected]>
Tested-by: Andre Przywara <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>

KVM: arm/arm64: Remove struct vgic_irq pending field

One of the goals behind the VGIC redesign was to get rid of cached or
intermediate state in the data structures, but we decided to allow
ourselves to precompute the pending value of an IRQ based on the line
level and pending latch state. However, this has now become difficult
to base proper GICv3 save/restore on, because there is a potential to
modify the pending state without knowing if an interrupt is edge or
level configured.

See the following post and related message for more background:
https://lists.cs.columbia.edu/pipermail/kvmarm/2017-January/023195.html

This commit gets rid of the precomputed pending field in favor of a
function that calculates the value when needed, irq_is_pending().

The soft_pending field is renamed to pending_latch to represent that
this latch is the equivalent hardware latch which gets manipulated by
the input signal for edge-triggered interrupts and when writing to the
SPENDR/CPENDR registers.

After this commit save/restore code should be able to simply restore the
pending_latch state, line_level state, and config state in any order and
get the desired result.

Reviewed-by: Andre Przywara <[email protected]>
Reviewed-by: Marc Zyngier <[email protected]>
Tested-by: Andre Przywara <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>

Linux 4.10-rc5

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fix from Thomas Gleixner:
"Restore the retrigger callbacks in the IO APIC irq chips. That
  addresses a long standing regression which got introduced with the
  rewrite of the x86 irq subsystem two years ago and went unnoticed so
  far"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ioapic: Restore IO-APIC irq_chip retrigger callback

Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull smp/hotplug fix from Thomas Gleixner:
"Remove an unused variable which is a leftover from the notifier
removal"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Remove unused but set variable in _cpu_down()

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio/vhost fixes from Michael Tsirkin:
"Random fixes and cleanups that accumulated over the time"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio/s390: virtio: constify virtio_config_ops structures
  virtio/s390: add missing \n to end of dev_err message
  virtio/s390: support READ_STATUS command for virtio-ccw
  tools/virtio/ringtest: tweaks for s390
  tools/virtio/ringtest: fix run-on-all.sh for offline cpus
  virtio_console: fix a crash in config_work_handler
  vhost/scsi: silence uninitialized variable warning
  vhost: scsi: constify target_core_fabric_ops structures

Merge branch 'for-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux

Pull thermal management fixes from Zhang Rui:

- fix a regression that thermal zone dynamically allocated sysfs
   attributes are freed before they're removed, which is introduced in
   4.10-rc1 (Jacob von Chorus)

- fix a boot warning because deprecated hwmon API is used (Fabio
   Estevam)

- a couple of fixes for rockchip thermal driver (Brian Norris, Caesar
   Wang)

* 'for-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux:
  thermal: rockchip: fixes the conversion table
  thermal: core: move tz->device.groups cleanup to thermal_release
  thermal: thermal_hwmon: Convert to hwmon_device_register_with_info()
  thermal: rockchip: handle set_trips without the trip points
  thermal: rockchip: optimize the conversion table
  thermal: rockchip: fixes invalid temperature case
  thermal: rockchip: don't pass table structs by value
  thermal: rockchip: improve conversion error messages

Merge tag 'usb-4.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
"Here are a few small USB fixes for 4.10-rc5.

  Most of these are gadget/dwc2 fixes for reported issues, all of these
  have been in linux-next for a while. The last one is a single xhci
  WARN_ON removal to handle an issue that the dwc3 driver is hitting in
  the 4.10-rc tree. The warning is harmless and needs to be removed, and
  a "real" fix that is more complex will show up in 4.11-rc1 for this
  device.

  That last patch hasn't been in linux-next yet due to the weekend
  timing, but it's a "simple" WARN_ON() removal so what could go wrong?
  :)"

Famous last words.

* tag 'usb-4.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  xhci: remove WARN_ON if dma mask is not set for platform devices
  usb: dwc2: host: fix Wmaybe-uninitialized warning
  usb: dwc2: gadget: Fix GUSBCFG.USBTRDTIM value
  usb: gadget: udc: atmel: remove memory leak
  usb: dwc3: exynos fix axius clock error path to do cleanup
  usb: dwc2: Avoid suspending if we're in gadget mode
  usb: dwc2: use u32 for DT binding parameters
  usb: gadget: f_fs: Fix iterations on endpoints.
  usb: dwc2: gadget: Fix DMA memory freeing
  usb: gadget: composite: Fix function used to free memory

Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm fixes from Dan Williams:
"Two fixes:

   - a regression fix for the multiple-pmem-namespace-per-region support
     added in 4.9. Even if an existing environment is not using that
     feature the act of creating and a destroying a single namespace
     with the ndctl utility will lead to the proliferation of extra
     unwanted namespace devices.

   - a fix for the error code returned from the pmem driver when the
     memcpy_mcsafe() routine returns -EFAULT. Btrfs seems to be the only
     block I/O consumer that tries to parse the meaning of the error
     code when it is non-zero.

  Neither of these fixes are critical, the namespace leak is awkward in
  that it can cause device naming to change and complicates debugging
  namespace initialization issues. The error code fix is included out of
  caution for what other consumers might be expecting -EIO for block I/O
  errors"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm, namespace: fix pmem namespace leak, delete when size set to zero
  pmem: return EIO on read_pmem() failure

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk fix from Stephen Boyd:
"One fix for Samsung Exynos524x SoCs where recent IOMMU patches have
  caused some of these clocks to turn off when they were always left on
  before"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk/samsung: exynos542x: mark some clocks as critical

Merge tag 'arc-4.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc

Pull ARC fixes from Vineet Gupta:

- more intc updates [Yuriv]

- fix module build when unwinder is turned off

- IO Coherency Programming model updates

- other miscellaneous

* tag 'arc-4.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
  ARC: Revert "ARC: mm: IOC: Don't enable IOC by default"
  ARC: mm: split arc_cache_init to allow __init reaping of bulk
  ARCv2: IOC: Use actual memory size to setup aperture size
  ARCv2: IOC: Adhere to progamming model guidelines to avoid DMA corruption
  ARCv2: IOC: refactor the IOC and SLC operations into own functions
  ARC: module: Fix !CONFIG_ARC_DW2_UNWIND builds
  ARCv2: save r30 on kernel entry as gcc uses it for code-gen
  ARCv2: IRQ: Call entry/exit functions for chained handlers in MCIP
  ARC: IRQ: Use hwirq instead of virq in mask/unmask
  ARC: mmu: clarify the MMUv3 programming model

Merge tag 'powerpc-4.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
"Two fixes for fallout from the hugetlb changes we merged this cycle.

  Ten other fixes, four only affect Power9, and the rest are a bit of a
  mixture though nothing terrible.

  Thanks to: Aneesh Kumar K.V, Anton Blanchard, Benjamin Herrenschmidt,
  Dave Martin, Gavin Shan, Madhavan Srinivasan, Nicholas Piggin, Reza
  Arbab"

* tag 'powerpc-4.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc: Ignore reserved field in DCSR and PVR reads and writes
  powerpc/ptrace: Preserve previous TM fprs/vsrs on short regset write
  powerpc/ptrace: Preserve previous fprs/vsrs on short regset write
  powerpc/perf: Use MSR to report privilege level on P9 DD1
  selftest/powerpc: Wrong PMC initialized in pmc56_overflow test
  powerpc/eeh: Enable IO path on permanent error
  powerpc/perf: Fix PM_BRU_CMPL event code for power9
  powerpc/mm: Fix little-endian 4K hugetlb
  powerpc/mm/hugetlb: Don't panic when we don't find the default huge page size
  powerpc: Fix pgtable pmd cache init
  powerpc/icp-opal: Fix missing KVM case and harden replay
  powerpc/mm: Fix memory hotplug BUG() on radix

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Radim Krčmář:
"ARM:
   - Fix for timer setup on VHE machines
   - Drop spurious warning when the timer races against the vcpu running
     again
   - Prevent a vgic deadlock when the initialization fails (for stable)

  s390:
   - Fix a kernel memory exposure (for stable)

  x86:
   - Fix exception injection when hypercall instruction cannot be
     patched"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: s390: do not expose random data via facility bitmap
  KVM: x86: fix fixing of hypercalls
  KVM: arm/arm64: vgic: Fix deadlock on error handling
  KVM: arm64: Access CNTHCTL_EL2 bit fields correctly on VHE systems
  KVM: arm/arm64: Fix occasional warning from the timer work function

Merge branch 'scsi-target-for-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/bvanassche/linux

Pull SCSI target fixes from Bart Van Assche:

- two small fixes for the ibmvscsis driver

- ten patches with bug fixes for the target mode of the qla2xxx driver

- four patches that avoid that the "sparse" and "smatch" static
   analyzer tools report false positives for the qla2xxx code base

* 'scsi-target-for-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/bvanassche/linux:
  qla2xxx: Disable out-of-order processing by default in firmware
  qla2xxx: Fix erroneous invalid handle message
  qla2xxx: Reduce exess wait during chip reset
  qla2xxx: Terminate exchange if corrupted
  qla2xxx: Fix crash due to null pointer access
  qla2xxx: Collect additional information to debug fw dump
  qla2xxx: Reset reserved field in firmware options to 0
  qla2xxx: Set tcm_qla2xxx version to automatically track qla2xxx version
  qla2xxx: Include ATIO queue in firmware dump when in target mode
  qla2xxx: Fix wrong IOCB type assumption
  qla2xxx: Avoid that building with W=1 triggers complaints about set-but-not-used variables
  qla2xxx: Move two arrays from header files to .c files
  qla2xxx: Declare an array with file scope static
  qla2xxx: Fix indentation
  ibmvscsis: Fix sleeping in interrupt context
  ibmvscsis: Fix max transfer length