Git Repo - linux.git/log

KVM: x86/mmu: Refactor TDP MMU iter need resched check

Refactor the TDP MMU iterator "need resched" checks into a helper
function so they can be called from a different code path in a
subsequent commit.

No functional change intended.

Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: rebase on a swapped order of checks]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Demote the WARN on yielded in xxx_cond_resched() to KVM_MMU_WARN_ON

Convert the WARN in tdp_mmu_iter_cond_resched() that the iterator hasn't
already yielded to a KVM_MMU_WARN_ON() so the code is compiled out for
production kernels (assuming production kernels disable KVM_PROVE_MMU).

Checking for a needed reschedule is a hot path, and KVM sanity checks
iter->yielded in several other less-hot paths, i.e. the odds of KVM not
flagging that something went sideways are quite low. Furthermore, the
odds of KVM not noticing *and* the WARN detecting something worth
investigating are even lower.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Check yielded_gfn for forward progress iff resched is needed

Swap the order of the checks in tdp_mmu_iter_cond_resched() so that KVM
checks to see if a resched is needed _before_ checking to see if yielding
must be disallowed to guarantee forward progress.  Iterating over TDP MMU
SPTEs is a hot path, e.g. tearing down a root can touch millions of SPTEs,
and not needing to reschedule is by far the common case.  On the other
hand, disallowing yielding because forward progress has not been made is a
very rare case.

Returning early for the common case (no resched), effectively reduces the
number of checks from 2 to 1 for the common case, and should make the code
slightly more predictable for the CPU.

To resolve a weird conundrum where the forward progress check currently
returns false, but the need resched check subtly returns iter->yielded,
which _should_ be false (enforced by a WARN), return false unconditionally
(which might also help make the sequence more predictable).  If KVM has a
bug where iter->yielded is left danging, continuing to yield is neither
right nor wrong, it was simply an artifact of how the original code was
written.

Unconditionally returning false when yielding is unnecessary or unwanted
will also allow extracting the "should resched" logic to a separate helper
in a future patch.

Cc: David Matlack <[email protected]>
Reviewed-by: James Houghton <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Batch TLB flushes when zapping collapsible TDP MMU SPTEs

Set SPTEs directly to SHADOW_NONPRESENT_VALUE and batch up TLB flushes
when zapping collapsible SPTEs, rather than freezing them first.

Freezing the SPTE first is not required. It is fine for another thread
holding mmu_lock for read to immediately install a present entry before
TLBs are flushed because the underlying mapping is not changing. vCPUs
that translate through the stale 4K mappings or a new huge page mapping
will still observe the same GPA->HPA translations.

KVM must only flush TLBs before dropping RCU (to avoid use-after-free of
the zapped page tables) and before dropping mmu_lock (to synchronize
with mmu_notifiers invalidating mappings).

In VMs backed with 2MiB pages, batching TLB flushes improves the time it
takes to zap collapsible SPTEs to disable dirty logging:

$ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g

Before: Disabling dirty logging time: 14.334453428s (131072 flushes)
After: Disabling dirty logging time: 4.794969689s (76 flushes)

Skipping freezing SPTEs also avoids stalling vCPU threads on the frozen
SPTE for the time it takes to perform a remote TLB flush. vCPUs faulting
on the zapped mapping can now immediately install a new huge mapping and
proceed with guest execution.

Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Drop @max_level from kvm_mmu_max_mapping_level()

Drop the @max_level parameter from kvm_mmu_max_mapping_level(). All
callers pass in PG_LEVEL_NUM, so @max_level can be replaced with
PG_LEVEL_NUM in the function body.

No functional change intended.

Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86: Don't emit TLB flushes when aging SPTEs for mmu_notifiers

Follow x86's primary MMU, which hasn't flushed TLBs when clearing Accessed
bits for 10+ years, and skip all TLB flushes when aging SPTEs in response
to a clear_flush_young() mmu_notifier event. As documented in x86's
ptep_clear_flush_young(), the probability and impact of "bad" reclaim due
to stale A-bit information is relatively low, whereas the performance cost
of TLB flushes is relatively high. I.e. the cost of flushing TLBs
outweighs the benefits.

On KVM x86, the cost of TLB flushes is even higher, as KVM doesn't batch
TLB flushes for mmu_notifier events (KVM's mmu_notifier contract with MM
makes it all but impossible), and sending IPIs forces all running vCPUs to
go through a VM-Exit => VM-Enter roundtrip.

Furthermore, MGLRU aging of secondary MMUs is expected to use flush-less
mmu_notifiers, i.e. flushing for the !MGLRU will make even less sense, and
will be actively confusing as it wouldn't be clear why KVM "needs" to
flush TLBs for legacy LRU aging, but not for MGLRU aging.

Cc: James Houghton <[email protected]>
Cc: Yan Zhao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: Allow arch code to elide TLB flushes when aging a young page

Add a Kconfig to allow architectures to opt-out of a TLB flush when a
young page is aged, as invalidating TLB entries is not functionally
required on most KVM-supported architectures. Stale TLB entries can
result in false negatives and theoretically lead to suboptimal reclaim,
but in practice all observations have been that the performance gained by
skipping TLB flushes outweighs any performance lost by reclaiming hot
pages.

E.g. the primary MMUs for x86 RISC-V, s390, and PPC Book3S elide the TLB
flush for ptep_clear_flush_young(), and arm64's MMU skips the trailing DSB
that's required for ordering (presumably because there are optimizations
related to eliding other TLB flushes when doing make-before-break).

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Set Dirty bit for new SPTEs, even if _hardware_ A/D bits are disabled

When making a SPTE, set the Dirty bit in the SPTE as appropriate, even if
hardware A/D bits are disabled. Only EPT allows A/D bits to be disabled,
and for EPT, the bits are software-available (ignored by hardware) when
A/D bits are disabled, i.e. it is perfectly legal for KVM to use the Dirty
to track dirty pages in software.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Dedup logic for detecting TLB flushes on leaf SPTE changes

Now that the shadow MMU and TDP MMU have identical logic for detecting
required TLB flushes when updating SPTEs, move said logic to a helper so
that the TDP MMU code can benefit from the comments that are currently
exclusive to the shadow MMU.

No functional change intended.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Stop processing TDP MMU roots for test_age if young SPTE found

Return immediately if a young SPTE is found when testing, but not updating,
SPTEs. The return value is a boolean, i.e. whether there is one young SPTE
or fifty is irrelevant (ignoring the fact that it's impossible for there to
be fifty SPTEs, as KVM has a hard limit on the number of valid TDP MMU
roots).

Link: https://lore.kernel.org/r/[email protected]
[sean: use guard(rcu)(), as suggested by Paolo]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Process only valid TDP MMU roots when aging a gfn range

Skip invalid TDP MMU roots when aging a gfn range. There is zero reason
to process invalid roots, as they by definition hold stale information.
E.g. if a root is invalid because its from a previous memslot generation,
in the unlikely event the root has a SPTE for the gfn, then odds are good
that the gfn=>hva mapping is different, i.e. doesn't map to the hva that
is being aged by the primary MMU.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Use Accessed bit even when _hardware_ A/D bits are disabled

Use the Accessed bit in SPTEs even when A/D bits are disabled in hardware,
i.e. propagate accessed information to SPTE.Accessed even when KVM is
doing manual tracking by making SPTEs not-present. In addition to
eliminating a small amount of code in is_accessed_spte(), this also paves
the way for preserving Accessed information when a SPTE is zapped in
response to a mmu_notifier PROTECTION event, e.g. if a SPTE is zapped
because NUMA balancing kicks in.

Note, EPT is the only flavor of paging in which A/D bits are conditionally
enabled, and the Accessed (and Dirty) bit is software-available when A/D
bits are disabled.

Note #2, there are currently no concrete plans to preserve Accessed
information. Explorations on that front were the initial catalyst, but
the cleanup is the motivation for the actual commit.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Set shadow_dirty_mask for EPT even if A/D bits disabled

Set shadow_dirty_mask to the architectural EPT Dirty bit value even if
A/D bits are disabled at the module level, i.e. even if KVM will never
enable A/D bits in hardware. Doing so provides consistent behavior for
Accessed and Dirty bits, i.e. doesn't leave KVM in a state where it sets
shadow_accessed_mask but not shadow_dirty_mask.

Functionally, this should be one big nop, as consumption of
shadow_dirty_mask is always guarded by a check that hardware A/D bits are
enabled.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Set shadow_accessed_mask for EPT even if A/D bits disabled

Now that KVM doesn't use shadow_accessed_mask to detect if hardware A/D
bits are enabled, set shadow_accessed_mask for EPT even when A/D bits
are disabled in hardware. This will allow using shadow_accessed_mask for
software purposes, e.g. to preserve accessed status in a non-present SPTE
acros NUMA balancing, if something like that is ever desirable.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Add a dedicated flag to track if A/D bits are globally enabled

Add a dedicated flag to track if KVM has enabled A/D bits at the module
level, instead of inferring the state based on whether or not the MMU's
shadow_accessed_mask is non-zero. This will allow defining and using
shadow_accessed_mask even when A/D bits aren't used by hardware.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: WARN and flush if resolving a TDP MMU fault clears MMU-writable

Do a remote TLB flush if installing a leaf SPTE overwrites an existing
leaf SPTE (with the same target pfn, which is enforced by a BUG() in
handle_changed_spte()) and clears the MMU-Writable bit. Since the TDP MMU
passes ACC_ALL to make_spte(), i.e. always requests a Writable SPTE, the
only scenario in which make_spte() should create a !MMU-Writable SPTE is
if the gfn is write-tracked or if KVM is prefetching a SPTE.

When write-protecting for write-tracking, KVM must hold mmu_lock for write,
i.e. can't race with a vCPU faulting in the SPTE. And when prefetching a
SPTE, the TDP MMU takes care to avoid clobbering a shadow-present SPTE,
i.e. it should be impossible to replace a MMU-writable SPTE with a
!MMU-writable SPTE when handling a TDP MMU fault.

Cc: David Matlack <[email protected]>
Cc: Yan Zhao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Fold mmu_spte_update_no_track() into mmu_spte_update()

Fold the guts of mmu_spte_update_no_track() into mmu_spte_update() now
that the latter doesn't flush when clearing A/D bits, i.e. now that there
is no need to explicitly avoid TLB flushes when aging SPTEs.

Opportunistically WARN if mmu_spte_update() requests a TLB flush when
aging SPTEs, as aging should never modify a SPTE in such a way that KVM
thinks a TLB flush is needed.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Drop ignored return value from kvm_tdp_mmu_clear_dirty_slot()

Drop the return value from kvm_tdp_mmu_clear_dirty_slot() as its sole
caller ignores the result (KVM flushes after clearing dirty logs based on
the logs themselves, not based on SPTEs).

Cc: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Don't flush TLBs when clearing Dirty bit in shadow MMU

Don't force a TLB flush when an SPTE update in the shadow MMU happens to
clear the Dirty bit, as KVM unconditionally flushes TLBs when enabling
dirty logging, and when clearing dirty logs, KVM flushes based on its
software structures, not the SPTEs. I.e. the flows that care about
accurate Dirty bit information already ensure there are no stale TLB
entries.

Opportunistically drop is_dirty_spte() as mmu_spte_update() was the sole
caller.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Don't force flush if SPTE update clears Accessed bit

Don't force a TLB flush if mmu_spte_update() clears the Accessed bit, as
access tracking tolerates false negatives, as evidenced by the
mmu_notifier hooks that explicitly test and age SPTEs without doing a TLB
flush.

In practice, this is very nearly a nop.  spte_write_protect() and
spte_clear_dirty() never clear the Accessed bit.  make_spte() always
sets the Accessed bit for !prefetch scenarios.  FNAME(sync_spte) only sets
SPTE if the protection bits are changing, i.e. if a flush will be needed
regardless of the Accessed bits.  And FNAME(pte_prefetch) sets SPTE if and
only if the old SPTE is !PRESENT.

That leaves kvm_arch_async_page_ready() as the one path that will generate
a !ACCESSED SPTE *and* overwrite a PRESENT SPTE.  And that's very arguably
a bug, as clobbering a valid SPTE in that case is nonsensical.

Tested-by: Alex Bennée <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Fold all of make_spte()'s writable handling into one if-else

Now that make_spte() no longer uses a funky goto to bail out for a special
case of its unsync handling, combine all of the unsync vs. writable logic
into a single if-else statement.

No functional change intended.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Always set SPTE's dirty bit if it's created as writable

When creating a SPTE, always set the Dirty bit if the Writable bit is set,
i.e. if KVM is creating a writable mapping. If two (or more) vCPUs are
racing to install a writable SPTE on a !PRESENT fault, only the "winning"
vCPU will create a SPTE with W=1 and D=1, all "losers" will generate a
SPTE with W=1 && D=0.

As a result, tdp_mmu_map_handle_target_level() will fail to detect that
the losing faults are effectively spurious, and will overwrite the D=1
SPTE with a D=0 SPTE. For normal VMs, overwriting a present SPTE is a
small performance blip; KVM blasts a remote TLB flush, but otherwise life
goes on.

For upcoming TDX VMs, overwriting a present SPTE is much more costly, and
can even lead to the VM being terminated if KVM isn't careful, e.g. if KVM
attempts TDH.MEM.PAGE.AUG because the TDX code doesn't detect that the
new SPTE is actually the same as the old SPTE (which would be a bug in its
own right).

Suggested-by: Sagi Shahar <[email protected]>
Cc: Yan Zhao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Flush remote TLBs iff MMU-writable flag is cleared from RO SPTE

Don't force a remote TLB flush if KVM happens to effectively "refresh" a
read-only SPTE that is still MMU-Writable, as KVM allows MMU-Writable SPTEs
to have Writable TLB entries, even if the SPTE is !Writable. Remote TLBs
need to be flushed only when creating a read-only SPTE for write-tracking,
i.e. when installing a !MMU-Writable SPTE.

In practice, especially now that KVM doesn't overwrite existing SPTEs when
prefetching, KVM will rarely "refresh" a read-only, MMU-Writable SPTE,
i.e. this is unlikely to eliminate many, if any, TLB flushes. But, more
precisely flushing makes it easier to understand exactly when KVM does and
doesn't need to flush.

Note, x86 architecturally requires relevant TLB entries to be invalidated
on a page fault, i.e. there is no risk of putting a vCPU into an infinite
loop of read-only page faults.

Cc: Yan Zhao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

Merge branch 'kvm-no-struct-page' into HEAD

TL;DR: Eliminate KVM's long-standing (and heinous) behavior of essentially
guessing which pfns are refcounted pages (see kvm_pfn_to_refcounted_page()).

Getting there requires "fixing" arch code that isn't obviously broken.
Specifically, to get rid of kvm_pfn_to_refcounted_page(), KVM needs to
stop marking pages/folios dirty/accessed based solely on the pfn that's
stored in KVM's stage-2 page tables.

Instead of tracking which SPTEs correspond to refcounted pages, simply
remove all of the code that operates on "struct page" based ona the pfn
in stage-2 PTEs. This is the back ~40-50% of the series.

For x86 in particular, which sets accessed/dirty status when that info
would be "lost", e.g. when SPTEs are zapped or KVM clears the dirty flag
in a SPTE, foregoing the updates provides very measurable performance
improvements for related operations. E.g. when clearing dirty bits as
part of dirty logging, and zapping SPTEs to reconstitue huge pages when
disabling dirty logging.

The front ~40% of the series is cleanups and prep work, and most of it is
x86 focused (purely because x86 added the most special cases, *sigh*).
E.g. several of the inputs to hva_to_pfn() (and it's myriad wrappers),
can be removed by cleaning up and deduplicating x86 code.

Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Don't grab reference on VM_MIXEDMAP pfns that have a "struct page"

Now that KVM no longer relies on an ugly heuristic to find its struct page
references, i.e. now that KVM can't get false positives on VM_MIXEDMAP
pfns, remove KVM's hack to elevate the refcount for pfns that happen to
have a valid struct page. In addition to removing a long-standing wart
in KVM, this allows KVM to map non-refcounted struct page memory into the
guest, e.g. for exposing GPU TTM buffers to KVM guests.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Drop APIs that manipulate "struct page" via pfns

Remove all kvm_{release,set}_pfn_*() APIs now that all users are gone.

No functional change intended.

Reviewed-by: Alex Bennée <[email protected]>
Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: arm64: Don't mark "struct page" accessed when making SPTE young

Don't mark pages/folios as accessed in the primary MMU when making a SPTE
young in KVM's secondary MMU, as doing so relies on
kvm_pfn_to_refcounted_page(), and generally speaking is unnecessary and
wasteful. KVM participates in page aging via mmu_notifiers, so there's no
need to push "accessed" updates to the primary MMU.

Dropping use of kvm_set_pfn_accessed() also paves the way for removing
kvm_pfn_to_refcounted_page() and all its users.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEs

Don't mark pages/folios as accessed in the primary MMU when zapping SPTEs,
as doing so relies on kvm_pfn_to_refcounted_page(), and generally speaking
is unnecessary and wasteful. KVM participates in page aging via
mmu_notifiers, so there's no need to push "accessed" updates to the
primary MMU.

And if KVM zaps a SPTe in response to an mmu_notifier, marking it accessed
_after_ the primary MMU has decided to zap the page is likely to go
unnoticed, i.e. odds are good that, if the page is being zapped for
reclaim, the page will be swapped out regardless of whether or not KVM
marks the page accessed.

Dropping x86's use of kvm_set_pfn_accessed() also paves the way for
removing kvm_pfn_to_refcounted_page() and all its users.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Make kvm_follow_pfn.refcounted_page a required field

Now that the legacy gfn_to_pfn() APIs are gone, and all callers of
hva_to_pfn() pass in a refcounted_page pointer, make it a required field
to ensure all future usage in KVM plays nice.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: s390: Use kvm_release_page_dirty() to unpin "struct page" memory

Use kvm_release_page_dirty() when unpinning guest pages, as the pfn was
retrieved via pin_guest_page(), i.e. is guaranteed to be backed by struct
page memory. This will allow dropping kvm_release_pfn_dirty() and
friends.

Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Drop gfn_to_pfn() APIs now that all users are gone

Drop gfn_to_pfn() and all its variants now that all users are gone.

No functional change intended.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: PPC: Explicitly require struct page memory for Ultravisor sharing

Explicitly require "struct page" memory when sharing memory between
guest and host via an Ultravisor. Given the number of pfn_to_page()
calls in the code, it's safe to assume that KVM already requires that the
pfn returned by gfn_to_pfn() is backed by struct page, i.e. this is
likely a bug fix, not a reduction in KVM capabilities.

Switching to gfn_to_page() will eventually allow removing gfn_to_pfn()
and kvm_pfn_to_refcounted_page().

Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: arm64: Use __gfn_to_page() when copying MTE tags to/from userspace

Use __gfn_to_page() instead when copying MTE tags between guest and
userspace. This will eventually allow removing gfn_to_pfn_prot(),
gfn_to_pfn(), kvm_pfn_to_refcounted_page(), and related APIs.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Add support for read-only usage of gfn_to_page()

Rework gfn_to_page() to support read-only accesses so that it can be used
by arm64 to get MTE tags out of guest memory.

Opportunistically rewrite the comment to be even more stern about using
gfn_to_page(), as there are very few scenarios where requiring a struct
page is actually the right thing to do (though there are such scenarios).
Add a FIXME to call out that KVM probably should be pinning pages, not
just getting pages.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Convert gfn_to_page() to use kvm_follow_pfn()

Convert gfn_to_page() to the new kvm_follow_pfn() internal API, which will
eventually allow removing gfn_to_pfn() and kvm_pfn_to_refcounted_page().

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: PPC: Use kvm_vcpu_map() to map guest memory to patch dcbz instructions

Use kvm_vcpu_map() when patching dcbz in guest memory, as a regular GUP
isn't technically sufficient when writing to data in the target pages.
As per Documentation/core-api/pin_user_pages.rst:

      Correct (uses FOLL_PIN calls):
          pin_user_pages()
          write to the data within the pages
          unpin_user_pages()

      INCORRECT (uses FOLL_GET calls):
          get_user_pages()
          write to the data within the pages
          put_page()

As a happy bonus, using kvm_vcpu_{,un}map() takes care of creating a
mapping and marking the page dirty.

Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: PPC: Remove extra get_page() to fix page refcount leak

Don't manually do get_page() when patching dcbz, as gfn_to_page() gifts
the caller a reference. I.e. doing get_page() will leak the page due to
not putting all references.

Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: MIPS: Use kvm_faultin_pfn() to map pfns into the guest

Convert MIPS to kvm_faultin_pfn()+kvm_release_faultin_page(), which
are new APIs to consolidate arch code and provide consistent behavior
across all KVM architectures.

Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: MIPS: Mark "struct page" pfns accessed prior to dropping mmu_lock

Mark pages accessed before dropping mmu_lock when faulting in guest memory
so that MIPS can convert to kvm_release_faultin_page() without tripping
its lockdep assertion on mmu_lock being held.

Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: MIPS: Mark "struct page" pfns accessed only in "slow" page fault path

Mark pages accessed only in the slow page fault path in order to remove
an unnecessary user of kvm_pfn_to_refcounted_page(). Marking pages
accessed in the primary MMU during KVM page fault handling isn't harmful,
but it's largely pointless and likely a waste of a cycles since the
primary MMU will call into KVM via mmu_notifiers when aging pages. I.e.
KVM participates in a "pull" model, so there's no need to also "push"
updates.

Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: MIPS: Mark "struct page" pfns dirty only in "slow" page fault path

Mark pages/folios dirty only the slow page fault path, i.e. only when
mmu_lock is held and the operation is mmu_notifier-protected, as marking a
page/folio dirty after it has been written back can make some filesystems
unhappy (backing KVM guests will such filesystem files is uncommon, and
the race is minuscule, hence the lack of complaints).

See the link below for details.

Link: https://lore.kernel.org/all/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: LoongArch: Use kvm_faultin_pfn() to map pfns into the guest

Convert LoongArch to kvm_faultin_pfn()+kvm_release_faultin_page(), which
are new APIs to consolidate arch code and provide consistent behavior
across all KVM architectures.

Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: LoongArch: Mark "struct page" pfn accessed before dropping mmu_lock

Mark pages accessed before dropping mmu_lock when faulting in guest memory
so that LoongArch can convert to kvm_release_faultin_page() without
tripping its lockdep assertion on mmu_lock being held.

Reviewed-by: Bibo Mao <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: LoongArch: Mark "struct page" pfns accessed only in "slow" page fault path

Mark pages accessed only in the slow path, before dropping mmu_lock when
faulting in guest memory so that LoongArch can convert to
kvm_release_faultin_page() without tripping its lockdep assertion on
mmu_lock being held.

Reviewed-by: Bibo Mao <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: LoongArch: Mark "struct page" pfns dirty only in "slow" page fault path

Mark pages/folios dirty only the slow page fault path, i.e. only when
mmu_lock is held and the operation is mmu_notifier-protected, as marking a
page/folio dirty after it has been written back can make some filesystems
unhappy (backing KVM guests will such filesystem files is uncommon, and
the race is minuscule, hence the lack of complaints).

See the link below for details.

Link: https://lore.kernel.org/all/[email protected]
Reviewed-by: Bibo Mao <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: PPC: Use kvm_faultin_pfn() to handle page faults on Book3s PR

Convert Book3S PR to __kvm_faultin_pfn()+kvm_release_faultin_page(), which
are new APIs to consolidate arch code and provide consistent behavior
across all KVM architectures.

Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: PPC: Book3S: Mark "struct page" pfns dirty/accessed after installing PTE

Mark pages/folios dirty/accessed after installing a PTE, and more
specifically after acquiring mmu_lock and checking for an mmu_notifier
invalidation. Marking a page/folio dirty after it has been written back
can make some filesystems unhappy (backing KVM guests will such filesystem
files is uncommon, and the race is minuscule, hence the lack of complaints).
See the link below for details.

This will also allow converting Book3S to kvm_release_faultin_page(),
which requires that mmu_lock be held (for the aforementioned reason).

Link: https://lore.kernel.org/all/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: PPC: Drop unused @kvm_ro param from kvmppc_book3s_instantiate_page()

Drop @kvm_ro from kvmppc_book3s_instantiate_page() as it is now only
written, and never read.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: PPC: Use __kvm_faultin_pfn() to handle page faults on Book3s Radix

Replace Book3s Radix's homebrewed (read: copy+pasted) fault-in logic with
__kvm_faultin_pfn(), which functionally does pretty much the exact same
thing.

Note, when the code was written, KVM indeed didn't do fast GUP without
"!atomic && !async", but that has long since changed (KVM tries fast GUP
for all writable mappings).

Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: PPC: Use __kvm_faultin_pfn() to handle page faults on Book3s HV

Replace Book3s HV's homebrewed fault-in logic with __kvm_faultin_pfn(),
which functionally does pretty much the exact same thing.

Note, when the code was written, KVM indeed didn't do fast GUP without
"!atomic && !async", but that has long since changed (KVM tries fast GUP
for all writable mappings).

Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: RISC-V: Use kvm_faultin_pfn() when mapping pfns into the guest

Convert RISC-V to __kvm_faultin_pfn()+kvm_release_faultin_page(), which
are new APIs to consolidate arch code and provide consistent behavior
across all KVM architectures.

Opportunisticaly fix a s/priort/prior typo in the related comment.

Reviewed-by: Andrew Jones <[email protected]>
Acked-by: Anup Patel <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: RISC-V: Mark "struct page" pfns accessed before dropping mmu_lock

Mark pages accessed before dropping mmu_lock when faulting in guest memory
so that RISC-V can convert to kvm_release_faultin_page() without tripping
its lockdep assertion on mmu_lock being held. Marking pages accessed
outside of mmu_lock is ok (not great, but safe), but marking pages _dirty_
outside of mmu_lock can make filesystems unhappy (see the link below).
Do both under mmu_lock to minimize the chances of doing the wrong thing in
the future.

Link: https://lore.kernel.org/all/[email protected]
Reviewed-by: Andrew Jones <[email protected]>
Acked-by: Anup Patel <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: RISC-V: Mark "struct page" pfns dirty iff a stage-2 PTE is installed

Don't mark pages dirty if KVM bails from the page fault handler without
installing a stage-2 mapping, i.e. if the page is guaranteed to not be
written by the guest.

In addition to being a (very) minor fix, this paves the way for converting
RISC-V to use kvm_release_faultin_page().

Reviewed-by: Andrew Jones <[email protected]>
Acked-by: Anup Patel <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: arm64: Use __kvm_faultin_pfn() to handle memory aborts

Convert arm64 to use __kvm_faultin_pfn()+kvm_release_faultin_page().
Three down, six to go.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: arm64: Mark "struct page" pfns accessed/dirty before dropping mmu_lock

Mark pages/folios accessed+dirty prior to dropping mmu_lock, as marking a
page/folio dirty after it has been written back can make some filesystems
unhappy (backing KVM guests will such filesystem files is uncommon, and
the race is minuscule, hence the lack of complaints).

While scary sounding, practically speaking the worst case scenario is that
KVM would trigger this WARN in filemap_unaccount_folio():

        /*
         * At this point folio must be either written or cleaned by
         * truncate.  Dirty folio here signals a bug and loss of
         * unwritten data - on ordinary filesystems.
         *
         * But it's harmless on in-memory filesystems like tmpfs; and can
         * occur when a driver which did get_user_pages() sets page dirty
         * before putting it, while the inode is being finally evicted.
         *
         * Below fixes dirty accounting after removing the folio entirely
         * but leaves the dirty flag set: it has no effect for truncated
         * folio and anyway will be cleared before returning folio to
         * buddy allocator.
         */
        if (WARN_ON_ONCE(folio_test_dirty(folio) &&
                         mapping_can_writeback(mapping)))
                folio_account_cleaned(folio, inode_to_wb(mapping->host));

KVM won't actually write memory because the stage-2 mappings are protected
by the mmu_notifier, i.e. there is no risk of loss of data, even if the
VM were backed by memory that needs writeback.

See the link below for additional details.

This will also allow converting arm64 to kvm_release_faultin_page(), which
requires that mmu_lock be held (for the aforementioned reason).

Link: https://lore.kernel.org/all/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: PPC: e500: Use __kvm_faultin_pfn() to handle page faults

Convert PPC e500 to use __kvm_faultin_pfn()+kvm_release_faultin_page(),
and continue the inexorable march towards the demise of
kvm_pfn_to_refcounted_page().

Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: PPC: e500: Mark "struct page" pfn accessed before dropping mmu_lock

Mark pages accessed before dropping mmu_lock when faulting in guest memory
so that shadow_map() can convert to kvm_release_faultin_page() without
tripping its lockdep assertion on mmu_lock being held. Marking pages
accessed outside of mmu_lock is ok (not great, but safe), but marking
pages _dirty_ outside of mmu_lock can make filesystems unhappy.

Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: PPC: e500: Mark "struct page" dirty in kvmppc_e500_shadow_map()

Mark the underlying page as dirty in kvmppc_e500_ref_setup()'s sole
caller, kvmppc_e500_shadow_map(), which will allow converting e500 to
__kvm_faultin_pfn() + kvm_release_faultin_page() without having to do
a weird dance between ref_setup() and shadow_map().

Opportunistically drop the redundant kvm_set_pfn_accessed(), as
shadow_map() puts the page via kvm_release_pfn_clean().

Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: VMX: Use __kvm_faultin_page() to get APIC access page/pfn

Use __kvm_faultin_page() get the APIC access page so that KVM can
precisely release the refcounted page, i.e. to remove yet another user
of kvm_pfn_to_refcounted_page(). While the path isn't handling a guest
page fault, the semantics are effectively the same; KVM just happens to
be mapping the pfn into a VMCS field instead of a secondary MMU.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: VMX: Hold mmu_lock until page is released when updating APIC access page

Hold mmu_lock across kvm_release_pfn_clean() when refreshing the APIC
access page address to ensure that KVM doesn't mark a page/folio as
accessed after it has been unmapped. Practically speaking marking a folio
accesses is benign in this scenario, as KVM does hold a reference (it's
really just marking folios dirty that is problematic), but there's no
reason not to be paranoid (moving the APIC access page isn't a hot path),
and no reason to be different from other mmu_notifier-protected flows in
KVM.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Move x86's API to release a faultin page to common KVM

Move KVM x86's helper that "finishes" the faultin process to common KVM
so that the logic can be shared across all architectures. Note, not all
architectures implement a fast page fault path, but the gist of the
comment applies to all architectures.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: x86/mmu: Don't mark unused faultin pages as accessed

When finishing guest page faults, don't mark pages as accessed if KVM
is resuming the guest _without_ installing a mapping, i.e. if the page
isn't being used. While it's possible that marking the page accessed
could avoid minor thrashing due to reclaiming a page that the guest is
about to access, it's far more likely that the gfn=>pfn mapping was
was invalidated, e.g. due a memslot change, or because the corresponding
VMA is being modified.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: x86/mmu: Put refcounted pages instead of blindly releasing pfns

Now that all x86 page fault paths precisely track refcounted pages, use
Use kvm_page_fault.refcounted_page to put references to struct page memory
when finishing page faults. This is a baby step towards eliminating
kvm_pfn_to_refcounted_page().

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: guest_memfd: Provide "struct page" as output from kvm_gmem_get_pfn()

Provide the "struct page" associated with a guest_memfd pfn as an output
from __kvm_gmem_get_pfn() so that KVM guest page fault handlers can
directly put the page instead of having to rely on
kvm_pfn_to_refcounted_page().

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: guest_memfd: Pass index, not gfn, to __kvm_gmem_get_pfn()

Refactor guest_memfd usage of __kvm_gmem_get_pfn() to pass the index into
the guest_memfd file instead of the gfn, i.e. resolve the index based on
the slot+gfn in the caller instead of in __kvm_gmem_get_pfn(). This will
allow kvm_gmem_get_pfn() to retrieve and return the specific "struct page",
which requires the index into the folio, without a redoing the index
calculation multiple times (which isn't costly, just hard to follow).

Opportunistically add a kvm_gmem_get_index() helper to make the copy+pasted
code easier to understand.

Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: x86/mmu: Convert page fault paths to kvm_faultin_pfn()

Convert KVM x86 to use the recently introduced __kvm_faultin_pfn().
Opportunstically capture the refcounted_page grabbed by KVM for use in
future changes.

No functional change intended.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Add kvm_faultin_pfn() to specifically service guest page faults

Add a new dedicated API, kvm_faultin_pfn(), for servicing guest page
faults, i.e. for getting pages/pfns that will be mapped into the guest via
an mmu_notifier-protected KVM MMU. Keep struct kvm_follow_pfn buried in
internal code, as having __kvm_faultin_pfn() take "out" params is actually
cleaner for several architectures, e.g. it allows the caller to have its
own "page fault" structure without having to marshal data to/from
kvm_follow_pfn.

Long term, common KVM would ideally provide a kvm_page_fault structure, a
la x86's struct of the same name. But all architectures need to be
converted to a common API before that can happen.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Move declarations of memslot accessors up in kvm_host.h

Move the memslot lookup helpers further up in kvm_host.h so that they can
be used by inlined "to pfn" wrappers.

No functional change intended.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: x86/mmu: Mark pages/folios dirty at the origin of make_spte()

Move the marking of folios dirty from make_spte() out to its callers,
which have access to the _struct page_, not just the underlying pfn.
Once all architectures follow suit, this will allow removing KVM's ugly
hack where KVM elevates the refcount of VM_MIXEDMAP pfns that happen to
be struct page memory.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: x86/mmu: Add helper to "finish" handling a guest page fault

Add a helper to finish/complete the handling of a guest page, e.g. to
mark the pages accessed and put any held references. In the near
future, this will allow improving the logic without having to copy+paste
changes into all page fault paths. And in the less near future, will
allow sharing the "finish" API across all architectures.

No functional change intended.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: x86/mmu: Add common helper to handle prefetching SPTEs

Deduplicate the prefetching code for indirect and direct MMUs. The core
logic is the same, the only difference is that indirect MMUs need to
prefetch SPTEs one-at-a-time, as contiguous guest virtual addresses aren't
guaranteed to yield contiguous guest physical addresses.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: x86/mmu: Put direct prefetched pages via kvm_release_page_clean()

Use kvm_release_page_clean() to put prefeteched pages instead of calling
put_page() directly. This will allow de-duplicating the prefetch code
between indirect and direct MMUs.

Note, there's a small functional change as kvm_release_page_clean() marks
the page/folio as accessed. While it's not strictly guaranteed that the
guest will access the page, KVM won't intercept guest accesses, i.e. won't
mark the page accessed if it _is_ accessed by the guest (unless A/D bits
are disabled, but running without A/D bits is effectively limited to
pre-HSW Intel CPUs).

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: x86/mmu: Add "mmu" prefix fault-in helpers to free up generic names

Prefix x86's faultin_pfn helpers with "mmu" so that the mmu-less names can
be used by common KVM for similar APIs.

No functional change intended.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: x86: Don't fault-in APIC access page during initial allocation

Drop the gfn_to_page() lookup when installing KVM's internal memslot for
the APIC access page, as KVM doesn't need to immediately fault-in the page
now that the page isn't pinned. In the extremely unlikely event the
kernel can't allocate a 4KiB page, KVM can just as easily return -EFAULT
on the future page fault.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Disallow direct access (w/o mmu_notifier) to unpinned pfn by default

Add an off-by-default module param to control whether or not KVM is allowed
to map memory that isn't pinned, i.e. that KVM can't guarantee won't be
freed while it is mapped into KVM and/or the guest. Don't remove the
functionality entirely, as there are use cases where mapping unpinned
memory is safe (as defined by the platform owner), e.g. when memory is
hidden from the kernel and managed by userspace, in which case userspace
is already fully trusted to not muck with guest memory mappings.

But for more typical setups, mapping unpinned memory is wildly unsafe, and
unnecessary. The APIs are used exclusively by x86's nested virtualization
support, and there is no known (or sane) use case for mapping PFN-mapped
memory a KVM guest _and_ letting the guest use it for virtualization
structures.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Get writable mapping for __kvm_vcpu_map() only when necessary

When creating a memory map for read, don't request a writable pfn from the
primary MMU. While creating read-only mappings can be theoretically slower,
as they don't play nice with fast GUP due to the need to break CoW before
mapping the underlying PFN, practically speaking, creating a mapping isn't
a super hot path, and getting a writable mapping for reading is weird and
confusing.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Pass in write/dirty to kvm_vcpu_map(), not kvm_vcpu_unmap()

Now that all kvm_vcpu_{,un}map() users pass "true" for @dirty, have them
pass "true" as a @writable param to kvm_vcpu_map(), and thus create a
read-only mapping when possible.

Note, creating read-only mappings can be theoretically slower, as they
don't play nice with fast GUP due to the need to break CoW before mapping
the underlying PFN. But practically speaking, creating a mapping isn't
a super hot path, and getting a writable mapping for reading is weird and
confusing.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: nVMX: Mark vmcs12's APIC access page dirty when unmapping

Mark the APIC access page as dirty when unmapping it from KVM. The fact
that the page _shouldn't_ be written doesn't guarantee the page _won't_ be
written. And while the contents are likely irrelevant, the values _are_
visible to the guest, i.e. dropping writes would be visible to the guest
(though obviously highly unlikely to be problematic in practice).

Marking the map dirty will allow specifying the write vs. read-only when
*mapping* the memory, which in turn will allow creating read-only maps.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Pin (as in FOLL_PIN) pages during kvm_vcpu_map()

Pin, as in FOLL_PIN, pages when mapping them for direct access by KVM.
As per Documentation/core-api/pin_user_pages.rst, writing to a page that
was gotten via FOLL_GET is explicitly disallowed.

  Correct (uses FOLL_PIN calls):
      pin_user_pages()
      write to the data within the pages
      unpin_user_pages()

  INCORRECT (uses FOLL_GET calls):
      get_user_pages()
      write to the data within the pages
      put_page()

Unfortunately, FOLL_PIN is a "private" flag, and so kvm_follow_pfn must
use a one-off bool instead of being able to piggyback the "flags" field.

Link: https://lwn.net/Articles/930667
Link: https://lore.kernel.org/all/[email protected]
Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Migrate kvm_vcpu_map() to kvm_follow_pfn()

Migrate kvm_vcpu_map() to kvm_follow_pfn(), and have it track whether or
not the map holds a refcounted struct page. Precisely tracking struct
page references will eventually allow removing kvm_pfn_to_refcounted_page()
and its various wrappers.

Signed-off-by: David Stevens <[email protected]>
[sean: use a pointer instead of a boolean]
Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: pfncache: Precisely track refcounted pages

Track refcounted struct page memory using kvm_follow_pfn.refcounted_page
instead of relying on kvm_release_pfn_clean() to correctly detect that the
pfn is associated with a struct page.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Move kvm_{set,release}_page_{clean,dirty}() helpers up in kvm_main.c

Hoist the kvm_{set,release}_page_{clean,dirty}() APIs further up in
kvm_main.c so that they can be used by the kvm_follow_pfn family of APIs.

No functional change intended.

Reviewed-by: Alex Bennée <[email protected]>
Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Provide refcounted page as output field in struct kvm_follow_pfn

Add kvm_follow_pfn.refcounted_page as an output for the "to pfn" APIs to
"return" the struct page that is associated with the returned pfn (if KVM
acquired a reference to the page). This will eventually allow removing
KVM's hacky kvm_pfn_to_refcounted_page() code, which is error prone and
can't detect pfns that are valid, but aren't (currently) refcounted.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Use plain "struct page" pointer instead of single-entry array

Use a single pointer instead of a single-entry array for the struct page
pointer in hva_to_pfn_fast(). Using an array makes the code unnecessarily
annoying to read and update.

No functional change intended.

Reviewed-by: Alex Bennée <[email protected]>
Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: nVMX: Add helper to put (unmap) vmcs12 pages

Add a helper to dedup unmapping the vmcs12 pages. This will reduce the
amount of churn when a future patch refactors the kvm_vcpu_unmap() API.

No functional change intended.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: nVMX: Drop pointless msr_bitmap_map field from struct nested_vmx

Remove vcpu_vmx.msr_bitmap_map and instead use an on-stack structure in
the one function that uses the map, nested_vmx_prepare_msr_bitmap().

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: nVMX: Rely on kvm_vcpu_unmap() to track validity of eVMCS mapping

Remove the explicit evmptr12 validity check when deciding whether or not
to unmap the eVMCS pointer, and instead rely on kvm_vcpu_unmap() to play
nice with a NULL map->hva, i.e. to do nothing if the map is invalid.

Note, vmx->nested.hv_evmcs_map is zero-allocated along with the rest of
vcpu_vmx, i.e. the map starts out invalid/NULL.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Use NULL for struct page pointer to indicate mremapped memory

Drop yet another unnecessary magic page value from KVM, as there's zero
reason to use a poisoned pointer to indicate "no page". If KVM uses a
NULL page pointer, the kernel will explode just as quickly as if KVM uses
a poisoned pointer. Never mind the fact that such usage would be a
blatant and egregious KVM bug.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Explicitly initialize all fields at the start of kvm_vcpu_map()

Explicitly initialize the entire kvm_host_map structure when mapping a
pfn, as some callers declare their struct on the stack, i.e. don't
zero-initialize the struct, which makes the map->hva in kvm_vcpu_unmap()
*very* suspect.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Remove pointless sanity check on @map param to kvm_vcpu_(un)map()

Drop kvm_vcpu_{,un}map()'s useless checks on @map being non-NULL. The map
is 100% kernel controlled, any caller that passes a NULL pointer is broken
and needs to be fixed, i.e. a crash due to a NULL pointer dereference is
desirable (though obviously not as desirable as not having a bug in the
first place).

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Introduce kvm_follow_pfn() to eventually replace "gfn_to_pfn" APIs

Introduce kvm_follow_pfn() to eventually supplant the various "gfn_to_pfn"
APIs, albeit by adding more wrappers.  The primary motivation of the new
helper is to pass a structure instead of an ever changing set of parameters,
e.g. so that tweaking the behavior, inputs, and/or outputs of the "to pfn"
helpers doesn't require churning half of KVM.

In the more distant future, the APIs exposed to arch code could also
follow suit, e.g. by adding something akin to x86's "struct kvm_page_fault"
when faulting in guest memory.  But for now, the goal is purely to clean
up KVM's "internal" MMU code.

As part of the conversion, replace the write_fault, interruptible, and
no-wait boolean flags with FOLL_WRITE, FOLL_INTERRUPTIBLE, and FOLL_NOWAIT
respectively.  Collecting the various FOLL_* flags into a single field
will again ease the pain of passing new flags.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: David Stevens <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Drop unused "hva" pointer from __gfn_to_pfn_memslot()

Drop @hva from __gfn_to_pfn_memslot() now that all callers pass NULL.

No functional change intended.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: x86/mmu: Drop kvm_page_fault.hva, i.e. don't track intermediate hva

Remove kvm_page_fault.hva as it is never read, only written. This will
allow removing the @hva param from __gfn_to_pfn_memslot().

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Replace "async" pointer in gfn=>pfn with "no_wait" and error code

Add a pfn error code to communicate that hva_to_pfn() failed because I/O
was needed and disallowed, and convert @async to a constant @no_wait
boolean. This will allow eliminating the @no_wait param by having callers
pass in FOLL_NOWAIT along with other FOLL_* flags.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: David Stevens <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Drop extra GUP (via check_user_page_hwpoison()) to detect poisoned page

Remove check_user_page_hwpoison() as it's effectively dead code.  Prior to
commit 234b239bea39 ("kvm: Faults which trigger IO release the mmap_sem"),
hva_to_pfn_slow() wasn't actually a slow path in all cases, i.e. would do
get_user_pages_fast() without ever doing slow GUP with FOLL_HWPOISON.

Now that hva_to_pfn_slow() is a straight shot to get_user_pages_unlocked(),
and unconditionally passes FOLL_HWPOISON, it is impossible for hva_to_pfn()
to get an -errno that needs to be morphed to -EHWPOISON.

There are essentially four cases in KVM:

  - npages == 0, then FOLL_NOWAIT, a.k.a. @async, must be true, and thus
    check_user_page_hwpoison() will not be called
  - npages == 1 || npages == -EHWPOISON, all good
  - npages == -EINTR || npages == -EAGAIN, bail early, all good
  - everything else, including -EFAULT, can go down the vma_lookup() path,
    as npages < 0 means KVM went through hva_to_pfn_slow() which passes
    FOLL_HWPOISON

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Return ERR_SIGPENDING from hva_to_pfn() if GUP returns -EGAIN

Treat an -EAGAIN return from GUP the same as -EINTR and immediately report
to the caller that a signal is pending. GUP only returns -EAGAIN if
the _initial_ mmap_read_lock_killable() fails, which in turn onnly fails
if a signal is pending

Note, rwsem_down_read_slowpath() actually returns -EINTR, so GUP is really
just making life harder than it needs to be. And the call to
mmap_read_lock_killable() in the retry path returns its -errno verbatim,
i.e. GUP (and thus KVM) is already handling locking failure this way, but
only some of the time.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Annotate that all paths in hva_to_pfn() might sleep

Now that hva_to_pfn() no longer supports being called in atomic context,
move the might_sleep() annotation from hva_to_pfn_slow() to hva_to_pfn().

Reviewed-by: Alex Bennée <[email protected]>
Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Drop @atomic param from gfn=>pfn and hva=>pfn APIs

Drop @atomic from the myriad "to_pfn" APIs now that all callers pass
"false", and remove a comment blurb about KVM running only the "GUP fast"
part in atomic context.

No functional change intended.

Reviewed-by: Alex Bennée <[email protected]>
Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: Rename gfn_to_page_many_atomic() to kvm_prefetch_pages()

Rename gfn_to_page_many_atomic() to kvm_prefetch_pages() to try and
communicate its true purpose, as the "atomic" aspect is essentially a
side effect of the fact that x86 uses the API while holding mmu_lock.
E.g. even if mmu_lock weren't held, KVM wouldn't want to fault-in pages,
as the goal is to opportunistically grab surrounding pages that have
already been accessed and/or dirtied by the host, and to do so quickly.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>

KVM: x86/mmu: Use gfn_to_page_many_atomic() when prefetching indirect PTEs

Use gfn_to_page_many_atomic() instead of gfn_to_pfn_memslot_atomic() when
prefetching indirect PTEs (direct_pte_prefetch_many() already uses the
"to page" APIS). Functionally, the two are subtly equivalent, as the "to
pfn" API short-circuits hva_to_pfn() if hva_to_pfn_fast() fails, i.e. is
just a wrapper for get_user_page_fast_only()/get_user_pages_fast_only().

Switching to the "to page" API will allow dropping the @atomic parameter
from the entire hva_to_pfn() callchain.

Tested-by: Alex Bennée <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <20241010182427.1434605 [email protected]>