Linus Torvalds [Tue, 28 Jan 2020 22:53:31 +0000 (14:53 -0800)]
Merge tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features, highlights:
- async discard
- "mount -o discard=async" to enable it
- freed extents are not discarded immediatelly, but grouped
together and trimmed later, with IO rate limiting
- the "sync" mode submits short extents that could have been
ignored completely by the device, for SATA prior to 3.1 the
requests are unqueued and have a big impact on performance
- the actual discard IO requests have been moved out of
transaction commit to a worker thread, improving commit latency
- IO rate and request size can be tuned by sysfs files, for now
enabled only with CONFIG_BTRFS_DEBUG as we might need to
add/delete the files and don't have a stable-ish ABI for
general use, defaults are conservative
- export device state info in sysfs, eg. missing, writeable
- no discard of extents known to be untouched on disk (eg. after
reservation)
- device stats reset is logged with process name and PID that called
the ioctl
Fixes:
- fix missing hole after hole punching and fsync when using NO_HOLES
- writeback: range cyclic mode could miss some dirty pages and lead
to OOM
- two more corner cases for metadata_uuid change after power loss
during the change
- fix infinite loop during fsync after mix of rename operations
Core changes:
- qgroup assign returns ENOTCONN when quotas not enabled, used to
return EINVAL that was confusing
- device closing does not need to allocate memory anymore
- snapshot aware code got removed, disabled for years due to
performance problems, reimplmentation will allow to select wheter
defrag breaks or does not break COW on shared extents
- tree-checker:
- check leaf chunk item size, cross check against number of
stripes
- verify location keys for DIR_ITEM, DIR_INDEX and XATTR items
- new self test for physical -> logical mapping code, used for super
block range exclusion
- assertion helpers/macros updated to avoid objtool "unreachable
code" reports on older compilers or config option combinations"
* tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (84 commits)
btrfs: free block groups after free'ing fs trees
btrfs: Fix split-brain handling when changing FSID to metadata uuid
btrfs: Handle another split brain scenario with metadata uuid feature
btrfs: Factor out metadata_uuid code from find_fsid.
btrfs: Call find_fsid from find_fsid_inprogress
Btrfs: fix infinite loop during fsync after rename operations
btrfs: set trans->drity in btrfs_commit_transaction
btrfs: drop log root for dropped roots
btrfs: sysfs, add devid/dev_state kobject and device attributes
btrfs: Refactor btrfs_rmap_block to improve readability
btrfs: Add self-tests for btrfs_rmap_block
btrfs: selftests: Add support for dummy devices
btrfs: Move and unexport btrfs_rmap_block
btrfs: separate definition of assertion failure handlers
btrfs: device stats, log when stats are zeroed
btrfs: fix improper setting of scanned for range cyclic write cache pages
btrfs: safely advance counter when looking up bio csums
btrfs: remove unused member btrfs_device::work
btrfs: remove unnecessary wrapper get_alloc_profile
btrfs: add correction to handle -1 edge case in async discard
...
Linus Torvalds [Tue, 28 Jan 2020 21:06:05 +0000 (13:06 -0800)]
Merge branch 'x86-mtrr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mtrr updates from Ingo Molnar:
"Two changes: restrict /proc/mtrr to CAP_SYS_ADMIN, plus a cleanup"
* 'x86-mtrr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mtrr: Require CAP_SYS_ADMIN for all access
x86/mtrr: Get rid of mtrr_seq_show() forward declaration
Linus Torvalds [Tue, 28 Jan 2020 21:04:38 +0000 (13:04 -0800)]
Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FPU updates from Ingo Molnar:
"Three changes: fix a race that can result in FPU corruption, plus two
cleanups"
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Deactivate FPU state after failure during state load
x86/fpu/xstate: Make xfeature_is_supervisor()/xfeature_is_user() return bool
x86/fpu/xstate: Fix small issues
Linus Torvalds [Tue, 28 Jan 2020 20:46:42 +0000 (12:46 -0800)]
Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu-features updates from Ingo Molnar:
"The biggest change in this cycle was a large series from Sean
Christopherson to clean up the handling of VMX features. This both
fixes bugs/inconsistencies and makes the code more coherent and
future-proof.
There are also two cleanups and a minor TSX syslog messages
enhancement"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/cpu: Remove redundant cpu_detect_cache_sizes() call
x86/cpu: Print "VMX disabled" error message iff KVM is enabled
KVM: VMX: Allow KVM_INTEL when building for Centaur and/or Zhaoxin CPUs
perf/x86: Provide stubs of KVM helpers for non-Intel CPUs
KVM: VMX: Use VMX_FEATURE_* flags to define VMCS control bits
KVM: VMX: Check for full VMX support when verifying CPU compatibility
KVM: VMX: Use VMX feature flag to query BIOS enabling
KVM: VMX: Drop initialization of IA32_FEAT_CTL MSR
x86/cpufeatures: Add flag to track whether MSR IA32_FEAT_CTL is configured
x86/cpu: Set synthetic VMX cpufeatures during init_ia32_feat_ctl()
x86/cpu: Print VMX flags in /proc/cpuinfo using VMX_FEATURES_*
x86/cpu: Detect VMX features on Intel, Centaur and Zhaoxin CPUs
x86/vmx: Introduce VMX_FEATURES_*
x86/cpu: Clear VMX feature flag if VMX is not fully enabled
x86/zhaoxin: Use common IA32_FEAT_CTL MSR initialization
x86/centaur: Use common IA32_FEAT_CTL MSR initialization
x86/mce: WARN once if IA32_FEAT_CTL MSR is left unlocked
x86/intel: Initialize IA32_FEAT_CTL MSR at boot
tools/x86: Sync msr-index.h from kernel sources
selftests, kvm: Replace manual MSR defs with common msr-index.h
...
Sven Schnelle [Tue, 28 Jan 2020 08:30:29 +0000 (09:30 +0100)]
selftests/ftrace: fix glob selftest
test.d/ftrace/func-filter-glob.tc is failing on s390 because it has
ARCH_INLINE_SPIN_LOCK and friends set to 'y'. So the usual
__raw_spin_lock symbol isn't in the ftrace function list. Change
'*aw*lock' to '*spin*lock' which would hopefully match some of the
locking functions on all platforms.
Linus Torvalds [Tue, 28 Jan 2020 20:28:06 +0000 (12:28 -0800)]
Merge branch 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Ingo Molnar:
"Misc changes:
- Enhance #GP fault printouts by distinguishing between canonical and
non-canonical address faults, and also add KASAN fault decoding.
- Fix/enhance the x86 NMI handler by putting the duration check into
a direct function call instead of an irq_work which we know to be
broken in some cases.
- Clean up do_general_protection() a bit"
* 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Remove irq_work from the long duration NMI handler
x86/traps: Cleanup do_general_protection()
x86/kasan: Print original address on #GP
x86/dumpstack: Introduce die_addr() for die() with #GP fault address
x86/traps: Print address on #GP
x86/insn-eval: Add support for 64-bit kernel mode
Linus Torvalds [Tue, 28 Jan 2020 20:11:23 +0000 (12:11 -0800)]
Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
"Misc cleanups all around the map"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/CPU/AMD: Remove amd_get_topology_early()
x86/tsc: Remove redundant assignment
x86/crash: Use resource_size()
x86/cpu: Add a missing prototype for arch_smt_update()
x86/nospec: Remove unused RSB_FILL_LOOPS
x86/vdso: Provide missing include file
x86/Kconfig: Correct spelling and punctuation
Documentation/x86/boot: Fix typo
x86/boot: Fix a comment's incorrect file reference
x86/process: Remove set but not used variables prev and next
x86/Kconfig: Fix Kconfig indentation
Linus Torvalds [Tue, 28 Jan 2020 20:00:29 +0000 (12:00 -0800)]
Merge branch 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Ingo Molnar:
"The main change in this tree is the extension of the resctrl procfs
ABI with a new file that helps tooling to navigate from tasks back to
resctrl groups: /proc/{pid}/cpu_resctrl_groups.
Also fix static key usage for certain feature combinations and
simplify the task exit resctrl case"
* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Add task resctrl information display
x86/resctrl: Check monitoring static key in the MBM overflow handler
x86/resctrl: Do not reconfigure exiting tasks
Linus Torvalds [Tue, 28 Jan 2020 19:54:05 +0000 (11:54 -0800)]
Merge branch 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot update from Ingo Molnar:
"Two minor changes: fix an atypical binutils combination build bug, and
also fix a VRAM size check for simplefb"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sysfb: Fix check for bad VRAM size
x86/boot: Discard .eh_frame sections
Linus Torvalds [Tue, 28 Jan 2020 19:08:13 +0000 (11:08 -0800)]
Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
"Misc updates:
- Remove last remaining calls to exception_enter/exception_exit() and
simplify the entry code some more.
- Remove force_iret()
- Add support for "Fast Short Rep Mov", which is available starting
with Ice Lake Intel CPUs - and make the x86 assembly version of
memmove() use REP MOV for all sizes when FSRM is available.
- Micro-optimize/simplify the 32-bit boot code a bit.
- Use a more future-proof SYSRET instruction mnemonic"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Simplify calculation of output address
x86/entry/64: Add instruction suffix to SYSRET
x86: Remove force_iret()
x86/cpufeatures: Add support for fast short REP; MOVSB
x86/context-tracking: Remove exception_enter/exit() from KVM_PV_REASON_PAGE_NOT_PRESENT async page fault
x86/context-tracking: Remove exception_enter/exit() from do_page_fault()
Linus Torvalds [Tue, 28 Jan 2020 18:07:09 +0000 (10:07 -0800)]
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"These were the main changes in this cycle:
- More -rt motivated separation of CONFIG_PREEMPT and
CONFIG_PREEMPTION.
- Add more low level scheduling topology sanity checks and warnings
to filter out nonsensical topologies that break scheduling.
- Extend uclamp constraints to influence wakeup CPU placement
- Make the RT scheduler more aware of asymmetric topologies and CPU
capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y
- Make idle CPU selection more consistent
- Various fixes, smaller cleanups, updates and enhancements - please
see the git log for details"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
sched/fair: Define sched_idle_cpu() only for SMP configurations
sched/topology: Assert non-NUMA topology masks don't (partially) overlap
idle: fix spelling mistake "iterrupts" -> "interrupts"
sched/fair: Remove redundant call to cpufreq_update_util()
sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
sched/fair: calculate delta runnable load only when it's needed
sched/cputime: move rq parameter in irqtime_account_process_tick
stop_machine: Make stop_cpus() static
sched/debug: Reset watchdog on all CPUs while processing sysrq-t
sched/core: Fix size of rq::uclamp initialization
sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
sched/fair: Load balance aggressively for SCHED_IDLE CPUs
sched/fair : Improve update_sd_pick_busiest for spare capacity case
watchdog: Remove soft_lockup_hrtimer_cnt and related code
sched/rt: Make RT capacity-aware
sched/fair: Make EAS wakeup placement consider uclamp restrictions
sched/fair: Make task_fits_capacity() consider uclamp restrictions
sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
sched/uclamp: Make uclamp util helpers use and return UL values
...
Linus Torvalds [Tue, 28 Jan 2020 17:44:15 +0000 (09:44 -0800)]
Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
"Kernel side changes:
- Ftrace is one of the last W^X violators (after this only KLP is
left). These patches move it over to the generic text_poke()
interface and thereby get rid of this oddity. This requires a
surprising amount of surgery, by Peter Zijlstra.
- x86/AMD PMUs: add support for 'Large Increment per Cycle Events' to
count certain types of events that have a special, quirky hw ABI
(by Kim Phillips)
- kprobes fixes by Masami Hiramatsu
Lots of tooling updates as well, the following subcommands were
updated: annotate/report/top, c2c, clang, record, report/top TUI,
sched timehist, tests; plus updates were done to the gtk ui, libperf,
headers and the parser"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
perf/x86/amd: Add support for Large Increment per Cycle Events
perf/x86/amd: Constrain Large Increment per Cycle events
perf/x86/intel/rapl: Add Comet Lake support
tracing: Initialize ret in syscall_enter_define_fields()
perf header: Use last modification time for timestamp
perf c2c: Fix return type for histogram sorting comparision functions
perf beauty sockaddr: Fix augmented syscall format warning
perf/ui/gtk: Fix gtk2 build
perf ui gtk: Add missing zalloc object
perf tools: Use %define api.pure full instead of %pure-parser
libperf: Setup initial evlist::all_cpus value
perf report: Fix no libunwind compiled warning break s390 issue
perf tools: Support --prefix/--prefix-strip
perf report: Clarify in help that --children is default
tools build: Fix test-clang.cpp with Clang 8+
perf clang: Fix build with Clang 9
kprobes: Fix optimize_kprobe()/unoptimize_kprobe() cancellation logic
tools lib: Fix builds when glibc contains strlcpy()
perf report/top: Make 'e' visible in the help and make it toggle showing callchains
perf report/top: Do not offer annotation for symbols without samples
...
Linus Torvalds [Tue, 28 Jan 2020 17:33:25 +0000 (09:33 -0800)]
Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"Just a handful of changes in this cycle: an ARM64 performance
optimization, a comment fix and a debug output fix"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/osq: Use optimized spinning loop for arm64
locking/qspinlock: Fix inaccessible URL of MCS lock paper
locking/lockdep: Fix lockdep_stats indentation problem
Linus Torvalds [Tue, 28 Jan 2020 17:03:40 +0000 (09:03 -0800)]
Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI updates from Ingo Molnar:
"The main changes in this cycle were:
- Cleanup of the GOP [graphics output] handling code in the EFI stub
- Complete refactoring of the mixed mode handling in the x86 EFI stub
- Overhaul of the x86 EFI boot/runtime code
- Increase robustness for mixed mode code
- Add the ability to disable DMA at the root port level in the EFI
stub
- Get rid of RWX mappings in the EFI memory map and page tables,
where possible
- Move the support code for the old EFI memory mapping style into its
only user, the SGI UV1+ support code.
- plus misc fixes, updates, smaller cleanups.
... and due to interactions with the RWX changes, another round of PAT
cleanups make a guest appearance via the EFI tree - with no side
effects intended"
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
efi/x86: Disable instrumentation in the EFI runtime handling code
efi/libstub/x86: Fix EFI server boot failure
efi/x86: Disallow efi=old_map in mixed mode
x86/boot/compressed: Relax sed symbol type regex for LLVM ld.lld
efi/x86: avoid KASAN false positives when accessing the 1: 1 mapping
efi: Fix handling of multiple efi_fake_mem= entries
efi: Fix efi_memmap_alloc() leaks
efi: Add tracking for dynamically allocated memmaps
efi: Add a flags parameter to efi_memory_map
efi: Fix comment for efi_mem_type() wrt absent physical addresses
efi/arm: Defer probe of PCIe backed efifb on DT systems
efi/x86: Limit EFI old memory map to SGI UV machines
efi/x86: Avoid RWX mappings for all of DRAM
efi/x86: Don't map the entire kernel text RW for mixed mode
x86/mm: Fix NX bit clearing issue in kernel_map_pages_in_pgd
efi/libstub/x86: Fix unused-variable warning
efi/libstub/x86: Use mandatory 16-byte stack alignment in mixed mode
efi/libstub/x86: Use const attribute for efi_is_64bit()
efi: Allow disabling PCI busmastering on bridges during boot
efi/x86: Allow translating 64-bit arguments for mixed mode calls
...
Linus Torvalds [Tue, 28 Jan 2020 16:46:13 +0000 (08:46 -0800)]
Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
"The RCU changes in this cycle were:
- Expedited grace-period updates
- kfree_rcu() updates
- RCU list updates
- Preemptible RCU updates
- Torture-test updates
- Miscellaneous fixes
- Documentation updates"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
rcu: Remove unused stop-machine #include
powerpc: Remove comment about read_barrier_depends()
.mailmap: Add entries for old [email protected] addresses
srcu: Apply *_ONCE() to ->srcu_last_gp_end
rcu: Switch force_qs_rnp() to for_each_leaf_node_cpu_mask()
rcu: Move rcu_{expedited,normal} definitions into rcupdate.h
rcu: Move gp_state_names[] and gp_state_getname() to tree_stall.h
rcu: Remove the declaration of call_rcu() in tree.h
rcu: Fix tracepoint tracking RCU CPU kthread utilization
rcu: Fix harmless omission of "CONFIG_" from #if condition
rcu: Avoid tick_dep_set_cpu() misordering
rcu: Provide wrappers for uses of ->rcu_read_lock_nesting
rcu: Use READ_ONCE() for ->expmask in rcu_read_unlock_special()
rcu: Clear ->rcu_read_unlock_special only once
rcu: Clear .exp_hint only when deferred quiescent state has been reported
rcu: Rename some instance of CONFIG_PREEMPTION to CONFIG_PREEMPT_RCU
rcu: Remove kfree_call_rcu_nobatch()
rcu: Remove kfree_rcu() special casing and lazy-callback handling
rcu: Add support for debug_objects debugging for kfree_rcu()
rcu: Add multiple in-flight batches of kfree_rcu() work
...
Linus Torvalds [Tue, 28 Jan 2020 16:38:25 +0000 (08:38 -0800)]
Merge branch 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
"The main changes are to move the ORC unwind table sorting from early
init to build-time - this speeds up booting.
No change in functionality intended"
* 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/unwind/orc: Fix !CONFIG_MODULES build warning
x86/unwind/orc: Remove boot-time ORC unwind tables sorting
scripts/sorttable: Implement build-time ORC unwind table sorting
scripts/sorttable: Rename 'sortextable' to 'sorttable'
scripts/sortextable: Refactor the do_func() function
scripts/sortextable: Remove dead code
scripts/sortextable: Clean up the code to meet the kernel coding style better
scripts/sortextable: Rewrite error/success handling
This commit didn't work for properties such as 'interrupt-map' that have
phandle in the middle of an entry. It would also not work for a 0 or -1
phandle value that acts as a NULL.
Linus Torvalds [Tue, 28 Jan 2020 16:20:54 +0000 (08:20 -0800)]
Merge branch 'core-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull header cleanup from Ingo Molnar:
"This is a treewide cleanup, mostly (but not exclusively) with x86
impact, which breaks implicit dependencies on the asm/realtime.h
header and finally removes it from asm/acpi.h"
* 'core-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ACPI/sleep: Move acpi_get_wakeup_address() into sleep.c, remove <asm/realmode.h> from <asm/acpi.h>
ACPI/sleep: Convert acpi_wakeup_address into a function
x86/ACPI/sleep: Remove an unnecessary include of asm/realmode.h
ASoC: Intel: Skylake: Explicitly include linux/io.h for virt_to_phys()
vmw_balloon: Explicitly include linux/io.h for virt_to_phys()
virt: vbox: Explicitly include linux/io.h to pick up various defs
efi/capsule-loader: Explicitly include linux/io.h for page_to_phys()
perf/x86/intel: Explicitly include asm/io.h to use virt_to_phys()
x86/kprobes: Explicitly include vmalloc.h for set_vm_flush_reset_perms()
x86/ftrace: Explicitly include vmalloc.h for set_vm_flush_reset_perms()
x86/boot: Explicitly include realmode.h to handle RM reservations
x86/efi: Explicitly include realmode.h to handle RM trampoline quirk
x86/platform/intel/quark: Explicitly include linux/io.h for virt_to_phys()
x86/setup: Enhance the comments
x86/setup: Clean up the header portion of setup.c
Michael Ellerman [Sun, 26 Jan 2020 11:52:47 +0000 (22:52 +1100)]
of: Add OF_DMA_DEFAULT_COHERENT & select it on powerpc
There's an OF helper called of_dma_is_coherent(), which checks if a
device has a "dma-coherent" property to see if the device is coherent
for DMA.
But on some platforms devices are coherent by default, and on some
platforms it's not possible to update existing device trees to add the
"dma-coherent" property.
So add a Kconfig symbol to allow arch code to tell
of_dma_is_coherent() that devices are coherent by default, regardless
of the presence of the property.
Select that symbol on powerpc when NOT_COHERENT_CACHE is not set, ie.
when the system has a coherent cache.
Alexandru Elisei [Mon, 27 Jan 2020 10:36:52 +0000 (10:36 +0000)]
KVM: arm64: Treat emulated TVAL TimerValue as a signed 32-bit integer
According to the ARM ARM, registers CNT{P,V}_TVAL_EL0 have bits [63:32]
RES0 [1]. When reading the register, the value is truncated to the least
significant 32 bits [2], and on writes, TimerValue is treated as a signed
32-bit integer [1, 2].
When the guest behaves correctly and writes 32-bit values, treating TVAL
as an unsigned 64 bit register works as expected. However, things start
to break down when the guest writes larger values, because
(u64)0x1_ffff_ffff = 8589934591. but (s32)0x1_ffff_ffff = -1, and the
former will cause the timer interrupt to be asserted in the future, but
the latter will cause it to be asserted now. Let's treat TVAL as a
signed 32-bit register on writes, to match the behaviour described in
the architecture, and the behaviour experimentally exhibited by the
virtual timer on a non-vhe host.
[1] Arm DDI 0487E.a, section D13.8.18
[2] Arm DDI 0487E.a, section D11.2.4
Eric Auger [Fri, 24 Jan 2020 14:25:35 +0000 (15:25 +0100)]
KVM: arm64: pmu: Only handle supported event counters
Let the code never use unsupported event counters. Change
kvm_pmu_handle_pmcr() to only reset supported counters and
kvm_pmu_vcpu_reset() to only stop supported counters.
Other actions are filtered on the supported counters in
kvm/sysregs.c
Willem de Bruijn [Mon, 27 Jan 2020 20:40:31 +0000 (15:40 -0500)]
udp: segment looped gso packets correctly
Multicast and broadcast packets can be looped from egress to ingress
pre segmentation with dev_loopback_xmit. That function unconditionally
sets ip_summed to CHECKSUM_UNNECESSARY.
udp_rcv_segment segments gso packets in the udp rx path. Segmentation
usually executes on egress, and does not expect packets of this type.
__udp_gso_segment interprets !CHECKSUM_PARTIAL as CHECKSUM_NONE. But
the offsets are not correct for gso_make_checksum.
UDP GSO packets are of type CHECKSUM_PARTIAL, with their uh->check set
to the correct pseudo header checksum. Reset ip_summed to this type.
(CHECKSUM_PARTIAL is allowed on ingress, see comments in skbuff.h)
Reported-by: syzbot <[email protected]> Fixes: cf329aa42b66 ("udp: cope with UDP GRO packet misdirection") Signed-off-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
The old netem mailing list was inactive and recently was targeted by
spammers. Switch to just using netdev mailing list which is where all
the real change happens.
Mike Christie [Tue, 12 Nov 2019 00:19:00 +0000 (18:19 -0600)]
prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
There are several storage drivers like dm-multipath, iscsi, tcmu-runner,
amd nbd that have userspace components that can run in the IO path. For
example, iscsi and nbd's userspace deamons may need to recreate a socket
and/or send IO on it, and dm-multipath's daemon multipathd may need to
send SG IO or read/write IO to figure out the state of paths and re-set
them up.
In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
memalloc_*_save/restore functions to control the allocation behavior,
but for userspace we would end up hitting an allocation that ended up
writing data back to the same device we are trying to allocate for.
The device is then in a state of deadlock, because to execute IO the
device needs to allocate memory, but to allocate memory the memory
layers want execute IO to the device.
Here is an example with nbd using a local userspace daemon that performs
network IO to a remote server. We are using XFS on top of the nbd device,
but it can happen with any FS or other modules layered on top of the nbd
device that can write out data to free memory. Here a nbd daemon helper
thread, msgr-worker-1, is performing a write/sendmsg on a socket to execute
a request. This kicks off a reclaim operation which results in a WRITE to
the nbd device and the nbd thread calling back into the mm layer.
This patch adds a new prctl command that daemons can use after they have
done their initial setup, and before they start to do allocations that
are in the IO path. It sets the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE
flags so both userspace block and FS threads can use it to avoid the
allocation recursion and try to prevent from being throttled while
writing out data to free up memory.
Linus Torvalds [Tue, 28 Jan 2020 01:28:52 +0000 (17:28 -0800)]
Merge tag 'x86-pti-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 pti updates from Thomas Gleixner:
"The performance deterioration departement provides a few non-scary
fixes and improvements:
- Update the cached HLE state when the TSX state is changed via the
new control register. This ensures feature bit consistency.
- Exclude the new Zhaoxin CPUs from Spectre V2 and SWAPGS
vulnerabilities"
* tag 'x86-pti-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation/swapgs: Exclude Zhaoxin CPUs from SWAPGS vulnerability
x86/speculation/spectre_v2: Exclude Zhaoxin CPUs from SPECTRE_V2
x86/cpu: Update cached HLE state on write to TSX_CTRL_CPUID_CLEAR
Linus Torvalds [Tue, 28 Jan 2020 01:22:21 +0000 (17:22 -0800)]
Merge tag 'irq-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"The interrupt departement provides:
- A mechanism to shield isolated tasks from managed interrupts:
The affinity of managed interrupts is completely controlled by the
kernel and user space has no influence on them. The reason is that
the automatically assigned affinity correlates to the multi-queue
CPU handling of block devices.
If the generated affinity mask spaws both housekeeping and isolated
CPUs the interrupt could be routed to an isolated CPU which would
then be disturbed by I/O submitted by a housekeeping CPU.
The new mechamism ensures that as long as one housekeeping CPU is
online in the assigned affinity mask the interrupt is routed to a
housekeeping CPU.
If there is no online housekeeping CPU in the affinity mask, then
the interrupt is routed to an isolated CPU to keep the device queue
intact, but unless the isolated CPU submits I/O by itself these
interrupts are not raised.
- A small addon to the device tree irqdomain core code to avoid
duplication in irq chip drivers
- Conversion of the SiFive PLIC to hierarchical domains
- The usual pile of new irq chip drivers: SiFive GPIO, Aspeed SCI,
NXP INTMUX, Meson A1 GPIO
- The first cut of support for the new ARM GICv4.1
- The usual pile of fixes and improvements in core and driver code"
* tag 'irq-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
genirq, sched/isolation: Isolate from handling managed interrupts
irqchip/gic-v4.1: Allow direct invalidation of VLPIs
irqchip/gic-v4.1: Suppress per-VLPI doorbell
irqchip/gic-v4.1: Add VPE INVALL callback
irqchip/gic-v4.1: Add VPE eviction callback
irqchip/gic-v4.1: Add VPE residency callback
irqchip/gic-v4.1: Add mask/unmask doorbell callbacks
irqchip/gic-v4.1: Plumb skeletal VPE irqchip
irqchip/gic-v4.1: Implement the v4.1 flavour of VMOVP
irqchip/gic-v4.1: Don't use the VPE proxy if RVPEID is set
irqchip/gic-v4.1: Implement the v4.1 flavour of VMAPP
irqchip/gic-v4.1: VPE table (aka GICR_VPROPBASER) allocation
irqchip/gic-v3: Add GICv4.1 VPEID size discovery
irqchip/gic-v3: Detect GICv4.1 supporting RVPEID
irqchip/gic-v3-its: Fix get_vlpi_map() breakage with doorbells
irqdomain: Fix a memory leak in irq_domain_push_irq()
irqchip: Add NXP INTMUX interrupt multiplexer support
dt-bindings: interrupt-controller: Add binding for NXP INTMUX interrupt multiplexer
irqchip: Define EXYNOS_IRQ_COMBINER
irqchip/meson-gpio: Add support for meson a1 SoCs
...
Linus Torvalds [Tue, 28 Jan 2020 01:04:51 +0000 (17:04 -0800)]
Merge tag 'smp-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core SMP updates from Thomas Gleixner:
"A small set of SMP core code changes:
- Rework the smp function call core code to avoid the allocation of
an additional cpumask
- Remove the not longer required GFP argument from on_each_cpu_cond()
and on_each_cpu_cond_mask() and fixup the callers"
* tag 'smp-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp: Remove allocation mask from on_each_cpu_cond.*()
smp: Add a smp_cond_func_t argument to smp_call_function_many()
smp: Use smp_cond_func_t as type for the conditional function
Linus Torvalds [Tue, 28 Jan 2020 00:47:05 +0000 (16:47 -0800)]
Merge tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"The timekeeping and timers departement provides:
- Time namespace support:
If a container migrates from one host to another then it expects
that clocks based on MONOTONIC and BOOTTIME are not subject to
disruption. Due to different boot time and non-suspended runtime
these clocks can differ significantly on two hosts, in the worst
case time goes backwards which is a violation of the POSIX
requirements.
The time namespace addresses this problem. It allows to set offsets
for clock MONOTONIC and BOOTTIME once after creation and before
tasks are associated with the namespace. These offsets are taken
into account by timers and timekeeping including the VDSO.
Offsets for wall clock based clocks (REALTIME/TAI) are not provided
by this mechanism. While in theory possible, the overhead and code
complexity would be immense and not justified by the esoteric
potential use cases which were discussed at Plumbers '18.
The overhead for tasks in the root namespace (ie where host time
offsets = 0) is in the noise and great effort was made to ensure
that especially in the VDSO. If time namespace is disabled in the
kernel configuration the code is compiled out.
Kudos to Andrei Vagin and Dmitry Sofanov who implemented this
feature and kept on for more than a year addressing review
comments, finding better solutions. A pleasant experience.
- Overhaul of the alarmtimer device dependency handling to ensure
that the init/suspend/resume ordering is correct.
- A new clocksource/event driver for Microchip PIT64
- Suspend/resume support for the Hyper-V clocksource
- The usual pile of fixes, updates and improvements mostly in the
driver code"
* tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
alarmtimer: Make alarmtimer_get_rtcdev() a stub when CONFIG_RTC_CLASS=n
alarmtimer: Use wakeup source from alarmtimer platform device
alarmtimer: Make alarmtimer platform device child of RTC device
alarmtimer: Update alarmtimer_get_rtcdev() docs to reflect reality
hrtimer: Add missing sparse annotation for __run_timer()
lib/vdso: Only read hrtimer_res when needed in __cvdso_clock_getres()
MIPS: vdso: Define BUILD_VDSO32 when building a 32bit kernel
clocksource/drivers/hyper-v: Set TSC clocksource as default w/ InvariantTSC
clocksource/drivers/hyper-v: Untangle stimers and timesync from clocksources
clocksource/drivers/timer-microchip-pit64b: Fix sparse warning
clocksource/drivers/exynos_mct: Rename Exynos to lowercase
clocksource/drivers/timer-ti-dm: Fix uninitialized pointer access
clocksource/drivers/timer-ti-dm: Switch to platform_get_irq
clocksource/drivers/timer-ti-dm: Convert to devm_platform_ioremap_resource
clocksource/drivers/em_sti: Fix variable declaration in em_sti_probe
clocksource/drivers/em_sti: Convert to devm_platform_ioremap_resource
clocksource/drivers/bcm2835_timer: Fix memory leak of timer
clocksource/drivers/cadence-ttc: Use ttc driver as platform driver
clocksource/drivers/timer-microchip-pit64b: Add Microchip PIT64B support
clocksource/drivers/hyper-v: Reserve PAGE_SIZE space for tsc page
...
Linus Torvalds [Tue, 28 Jan 2020 00:42:11 +0000 (16:42 -0800)]
Merge tag 'core-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull watchdog updates from Thomas Gleixner:
"A set of watchdog/softlockup related improvements:
- Enforce that the watchdog timestamp is always valid on boot. The
original implementation caused a watchdog disabled gap of one
second in the boot process due to truncation of the underlying
sched clock.
The sched clock is divided by 1e9 to convert nanoseconds to
seconds. So for the first second of the boot process the result is
0 which is at the same time the indicator to disable the watchdog.
The trivial fix is to change the disabled indicator to ULONG_MAX.
- Two cleanup patches removing unused and redundant code which got
forgotten to be cleaned up in previous changes"
* tag 'core-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
watchdog/softlockup: Enforce that timestamp is valid on boot
watchdog/softlockup: Remove obsolete check of last reported task
watchdog: Remove soft_lockup_hrtimer_cnt and related code
Linus Torvalds [Tue, 28 Jan 2020 00:37:40 +0000 (16:37 -0800)]
Merge tag 'timers-urgent-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"Two fixes for the generic VDSO code which missed 5.5:
- Make the update to the coarse timekeeper unconditional.
This is required because the coarse timekeeper interfaces in the
VDSO do not depend on a VDSO capable clocksource. If the system
does not have a VDSO capable clocksource and the update is
depending on the VDSO capable clocksource, the coarse VDSO
interfaces would operate on stale data forever.
- Invert the logic of __arch_update_vdso_data() to avoid further head
scratching.
Tripped over this several times while analyzing the update problem
above"
* tag 'timers-urgent-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lib/vdso: Update coarse timekeeper unconditionally
lib/vdso: Make __arch_update_vdso_data() logic understandable
Linus Torvalds [Mon, 27 Jan 2020 23:38:15 +0000 (15:38 -0800)]
Merge tag 'selinux-pr-20200127' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull SELinux update from Paul Moore:
"This is one of the bigger SELinux pull requests in recent years with
28 patches. Everything is passing our test suite and the highlights
are below:
- Mark CONFIG_SECURITY_SELINUX_DISABLE as deprecated. We're some time
away from actually attempting to remove this in the kernel, but the
only distro we know that still uses it (Fedora) is working on
moving away from this so we want to at least let people know we are
planning to remove it.
- Reorder the SELinux hooks to help prevent bad things when SELinux
is disabled at runtime. The proper fix is to remove the
CONFIG_SECURITY_SELINUX_DISABLE functionality (see above) and just
take care of it at boot time (e.g. "selinux=0").
- Add SELinux controls for the kernel lockdown functionality,
introducing a new SELinux class/permissions: "lockdown { integrity
confidentiality }".
- Add a SELinux control for move_mount(2) that reuses the "file {
mounton }" permission.
- Improvements to the SELinux security label data store lookup
functions to speed up translations between our internal label
representations and the visible string labels (both directions).
- Revisit a previous fix related to SELinux inode auditing and
permission caching and do it correctly this time.
- Fix the SELinux access decision cache to cleanup properly on error.
In some extreme cases this could limit the cache size and result in
a decrease in performance.
- Enable SELinux per-file labeling for binderfs.
- The SELinux initialized and disabled flags were wrapped with
accessors to ensure they are accessed correctly.
- Mark several key SELinux structures with __randomize_layout.
- Changes to the LSM build configuration to only build
security/lsm_audit.c when needed.
- Changes to the SELinux build configuration to only build the IB
object cache when CONFIG_SECURITY_INFINIBAND is enabled.
- Move a number of single-caller functions into their callers.
- A handful of cleanup patches that aren't worth mentioning on their
own, the individual descriptions have plenty of detail"
* tag 'selinux-pr-20200127' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: (28 commits)
selinux: fix regression introduced by move_mount(2) syscall
selinux: do not allocate ancillary buffer on first load
selinux: remove redundant allocation and helper functions
selinux: remove redundant selinux_nlmsg_perm
selinux: fix wrong buffer types in policydb.c
selinux: reorder hooks to make runtime disable less broken
selinux: treat atomic flags more carefully
selinux: make default_noexec read-only after init
selinux: move ibpkeys code under CONFIG_SECURITY_INFINIBAND.
selinux: remove redundant msg_msg_alloc_security
Documentation,selinux: fix references to old selinuxfs mount point
selinux: deprecate disabling SELinux and runtime
selinux: allow per-file labelling for binderfs
selinuxfs: use scnprintf to get real length for inode
selinux: remove set but not used variable 'sidtab'
selinux: ensure the policy has been loaded before reading the sidtab stats
selinux: ensure we cleanup the internal AVC counters on error in avc_update()
selinux: randomize layout of key structures
selinux: clean up selinux_enabled/disabled/enforcing_boot
selinux: remove unnecessary selinux cred request
...
Linus Torvalds [Mon, 27 Jan 2020 23:35:50 +0000 (15:35 -0800)]
Merge tag 'audit-pr-20200127' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit update from Paul Moore:
"One small audit patch for the Linux v5.6 merge window, and
unsurprisingly it passes our test suite with flying colors"
* tag 'audit-pr-20200127' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: Add __rcu annotation to RCU pointer
Linus Torvalds [Mon, 27 Jan 2020 23:18:25 +0000 (15:18 -0800)]
Merge branch 'for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- cgroup2 interface for hugetlb controller. I think this was the last
remaining bit which was missing from cgroup2
- fixes for race and a spurious warning in threaded cgroup handling
- other minor changes
* 'for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
iocost: Fix iocost_monitor.py due to helper type mismatch
cgroup: Prevent double killing of css when enabling threaded cgroup
cgroup: fix function name in comment
mm: hugetlb controller for cgroups v2
Linus Torvalds [Mon, 27 Jan 2020 23:16:52 +0000 (15:16 -0800)]
Merge branch 'for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
"Just a couple tracepoint patches"
* 'for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: remove workqueue_work event class
workqueue: add worker function to workqueue_execute_end tracepoint
Jens Axboe [Sat, 25 Jan 2020 06:08:54 +0000 (23:08 -0700)]
io_uring: don't attempt to copy iovec for READ/WRITE
For the non-vectored variant of READV/WRITEV, we don't need to setup an
async io context, and we flag that appropriately in the io_op_defs
array. However, in fixing this for the 5.5 kernel in commit 74566df3a71c
we didn't have these opcodes, so the check there was added just for the
READ_FIXED and WRITE_FIXED opcodes. Replace that check with just a
single check for needing async context, that covers all four of these
read/write variants that don't use an iovec.
scripts/find-unused-docs.sh invokes scripts/kernel-doc to find out if a
source file contains kerneldoc or not.
However, as it passes the no longer supported "-text" option to
scripts/kernel-doc, the latter prints out its help text, causing all
files to be considered containing kerneldoc.
Get rid of these false positives by removing the no longer supported
"-text" option from the scripts/kernel-doc invocation.
Linus Torvalds [Mon, 27 Jan 2020 21:03:00 +0000 (13:03 -0800)]
Merge tag 'ioremap-5.6' of git://git.infradead.org/users/hch/ioremap
Pull ioremap updates from Christoph Hellwig:
"Remove the ioremap_nocache API (plus wrappers) that are always
identical to ioremap"
* tag 'ioremap-5.6' of git://git.infradead.org/users/hch/ioremap:
remove ioremap_nocache and devm_ioremap_nocache
MIPS: define ioremap_nocache to ioremap
Linus Torvalds [Mon, 27 Jan 2020 20:55:48 +0000 (12:55 -0800)]
Merge tag 'for-5.6/drivers-2020-01-27' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
"Like the core side, not a lot of changes here, just two main items:
- Series of patches (via Coly) with fixes for bcache (Coly,
Christoph)
- MD pull request from Song"
* tag 'for-5.6/drivers-2020-01-27' of git://git.kernel.dk/linux-block: (31 commits)
bcache: reap from tail of c->btree_cache in bch_mca_scan()
bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
bcache: remove member accessed from struct btree
bcache: print written and keys in trace_bcache_btree_write
bcache: avoid unnecessary btree nodes flushing in btree_flush_write()
bcache: add code comments for state->pool in __btree_sort()
lib: crc64: include <linux/crc64.h> for 'crc64_be'
bcache: use read_cache_page_gfp to read the superblock
bcache: store a pointer to the on-disk sb in the cache and cached_dev structures
bcache: return a pointer to the on-disk sb from read_super
bcache: transfer the sb_page reference to register_{bdev,cache}
bcache: fix use-after-free in register_bcache()
bcache: properly initialize 'path' and 'err' in register_bcache()
bcache: rework error unwinding in register_bcache
bcache: use a separate data structure for the on-disk super block
bcache: cached_dev_free needs to put the sb page
md/raid1: introduce wait_for_serialization
md/raid1: use bucket based mechanism for IO serialization
md: introduce a new struct for IO serialization
md: don't destroy serial_info_pool if serialize_policy is true
...
Linus Torvalds [Mon, 27 Jan 2020 20:38:25 +0000 (12:38 -0800)]
Merge tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
"This may be the most quiet round we've had in years. I'm not
complaining. Really not a lot to detail here, outside of spelling and
documentation improvements/fixes, we have:
- Allow t10-pi to be modular (Herbert)
- Remove dead code in bfq (Alex)
- Mark zone management requests with REQ_SYNC (Chaitanya)
- BFQ division improvement (Wen)
- Small series improving plugging (Pavel)"
* tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-block:
partitions/ldm: fix spelling mistake "to" -> "too"
block, bfq: improve arithmetic division in bfq_delta()
block/bfq: remove unused bfq_class_rt which never used
block: mark zone-mgmt bios with REQ_SYNC
blk-mq: Document functions for sending request
block: Allow t10-pi to be modular
blk-mq: optimise blk_mq_flush_plug_list()
list: introduce list_for_each_continue()
blk-mq: optimise rq sort function
Linus Torvalds [Mon, 27 Jan 2020 20:04:48 +0000 (12:04 -0800)]
Merge tag 'pnp-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull PNP updates from Rafael Wysocki:
"Get rid of unused variable and function (yu kuai)"
* tag 'pnp-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PNP: isapnp: remove defined but not used function 'isapnp_checksum'
PNP: isapnp: remove set but not used variable 'checksum'
Linus Torvalds [Mon, 27 Jan 2020 19:57:40 +0000 (11:57 -0800)]
Merge tag 'devprop-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull device properties framework updates from Rafael Wysocki:
"Add support for reference properties in sofrware nodes (Dmitry
Torokhov) and a basic test for property entries along with fixes on
top of it (Dmitry Torokhov, Qian Cai, Alan Maguire)"
* tag 'devprop-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
software node: introduce CONFIG_KUNIT_DRIVER_PE_TEST
usb: dwc3: use proper initializers for property entries
drivers/base/test: fix global-out-of-bounds error
software node: add basic tests for property entries
software node: remove separate handling of references
platform/x86: intel_cht_int33fe: use inline reference properties
software node: implement reference properties
software node: allow embedding of small arrays into property_entry
software node: replace is_array with is_inline
Mike Snitzer [Mon, 27 Jan 2020 19:07:23 +0000 (14:07 -0500)]
dm: fix potential for q->make_request_fn NULL pointer
Move blk_queue_make_request() to dm.c:alloc_dev() so that
q->make_request_fn is never NULL during the lifetime of a DM device
(even one that is created without a DM table).
Otherwise generic_make_request() will crash simply by doing:
dmsetup create -n test
mount /dev/dm-N /mnt
While at it, move ->congested_data initialization out of
dm.c:alloc_dev() and into the bio-based specific init method.
Linus Torvalds [Mon, 27 Jan 2020 19:48:47 +0000 (11:48 -0800)]
Merge tag 'acpi-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI updates from Rafael Wysocki:
"These update the ACPICA code in the kernel to the most recent upstream
revision (20200110), add new hardware support to a handful of ACPI
drivers, make the ACPI fan driver expose power states information for
fans, add some more quirks, fix bugs and clean up assorted things.
Specifics:
- Update the ACPICA code in the kernel to upstream revision 20200110
including:
- Update of copyright notices to 2020 (Bob Moore).
- Dispatcher fix to always generate buffer objects for the ASL
create_field() operator (Maximilian Luz).
- Debugger cleanup (Colin Ian King).
- Disassembler change to create buffer fields in
ACPI_PARSE_LOAD_PASS1 (Erik Kaneda).
- UNIX line ending support for non-windows builds in acpisrc (Erik
Kaneda).
- Update the list of ACPICA maintainers (Rafael Wysocki).
- Add Intel Tiger Lake ACPI device IDs to the ACPI DPTF, ACPI fan,
int340x_thermal and intel-hid drivers (Gayatri Kammela).
- Make the ACPI fan driver create additional sysfs attributes to
expose power states information for fans (Srinivas Pandruvada).
- Fix up the ACPI battery driver to deal with unexpected battery
capacity information in a better way (Hans de Goede).
- Add ACPI backlight quirks for Lenovo E41-25/45 and MSI MS-7721
boards (Aaron Ma, Hans de Goede).
- Add DMI quirk for Razer Blade Stealth 13 late 2019 lid switch to
the ACPI button driver (Jason Ekstrand).
- Drop TIMER_DEFERRABLE from the GHES polling mode timer function
flags to make it run precisely at the configured time (Bhaskar
Upadhaya).
- Fix race condition related to the reference counting of query
handlers in the ACPI EC driver (Rafael Wysocki).
- Fix ACPI tools build issue (Zhengyuan Liu).
- Replace dma_request_slave_channel() with dma_request_chan() in the
firmware guide documentation for ACPI (Peter Ujfalusi).
- Fix typo in a comment and clean up function parameter data type
inconsistencies (Kacper Piwiński, Tian Tao)"
* tag 'acpi-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (25 commits)
ACPICA: Update version to 20200110
ACPICA: All acpica: Update copyrights to 2020 Including tool signons.
apei/ghes: Do not delay GHES polling
ACPI: button: Add DMI quirk for Razer Blade Stealth 13 late 2019 lid switch
ACPI: PPTT: Consistently use unsigned int as parameter type
ACPI: EC: Reference count query handlers under lock
ACPICA: Update the list of maintainers
ACPICA: Update version to 20191213
ACPICA: Dispatcher: always generate buffer objects for ASL create_field() operator
ACPICA: acpisrc: add unix line ending support for non-windows build
ACPICA: Disassembler: create buffer fields in ACPI_PARSE_LOAD_PASS1
ACPICA: debugger: fix spelling mistake "adress" -> "address"
ACPI: video: Do not export a non working backlight interface on MSI MS-7721 boards
docs: firmware-guide: ACPI: Replace dma_request_slave_channel() with dma_request_chan()
thermal: int340x_thermal: Add Tiger Lake ACPI device IDs
platform/x86: intel-hid: Add Tiger Lake ACPI device ID
ACPI: fan: Add Tiger Lake ACPI device ID
ACPI: DPTF: Add Tiger Lake ACPI device IDs
ACPI: fan: Expose fan performance state information
tools/power/acpi: fix compilation error
...
Linus Torvalds [Mon, 27 Jan 2020 19:23:54 +0000 (11:23 -0800)]
Merge tag 'pm-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These add ACPI support to the intel_idle driver along with an admin
guide document for it, add support for CPR (Core Power Reduction) to
the AVS (Adaptive Voltage Scaling) subsystem, add new hardware support
in a few places, add some new sysfs attributes, debugfs files and
tracepoints, fix bugs and clean up a bunch of things all over.
Specifics:
- Update the ACPI processor driver in order to export
acpi_processor_evaluate_cst() to the code outside of it, add ACPI
support to the intel_idle driver based on that and clean up that
driver somewhat (Rafael Wysocki).
- Add an admin guide document for the intel_idle driver (Rafael
Wysocki).
- Clean up cpuidle core and drivers, enable compilation testing for
some of them (Benjamin Gaignard, Krzysztof Kozlowski, Rafael
Wysocki, Yangtao Li).
- Fix reference counting of OPP (operating performance points) table
structures (Viresh Kumar).
- Add support for CPR (Core Power Reduction) to the AVS (Adaptive
Voltage Scaling) subsystem (Niklas Cassel, Colin Ian King,
YueHaibing).
- Add support for TigerLake Mobile and JasperLake to the Intel RAPL
power capping driver (Zhang Rui).
- Update cpufreq drivers:
- Add i.MX8MP support to imx-cpufreq-dt (Anson Huang).
- Fix usage of a macro in loongson2_cpufreq (Alexandre Oliva).
- Fix cpufreq policy reference counting issues in s3c and
brcmstb-avs (chenqiwu).
- Fix ACPI table reference counting issue and HiSilicon quirk
handling in the CPPC driver (Hanjun Guo).
- Clean up spelling mistake in intel_pstate (Harry Pan).
- Convert the kirkwood and tegra186 drivers to using
devm_platform_ioremap_resource() (Yangtao Li).
- Update devfreq core:
- Add 'name' sysfs attribute for devfreq devices (Chanwoo Choi).
- Clean up the handing of transition statistics and allow them to
be reset by writing 0 to the 'trans_stat' devfreq device
attribute in sysfs (Kamil Konieczny).
- Add 'devfreq_summary' to debugfs (Chanwoo Choi).
- Clean up kerneldoc comments and Kconfig indentation (Krzysztof
Kozlowski, Randy Dunlap).
- Update devfreq drivers:
- Add dynamic scaling for the imx8m DDR controller and clean up
imx8m-ddrc (Leonard Crestez, YueHaibing).
- Fix DT node reference counting and nitialization error code path
in rk3399_dmc and add COMPILE_TEST and HAVE_ARM_SMCCC dependency
for it (Chanwoo Choi, Yangtao Li).
- Fix DT node reference counting in rockchip-dfi and make it use
devm_platform_ioremap_resource() (Yangtao Li).
- Fix excessive stack usage in exynos-ppmu (Arnd Bergmann).
- Fix initialization error code paths in exynos-bus (Yangtao Li).
- Clean up exynos-bus and exynos somewhat (Artur Świgoń, Krzysztof
Kozlowski).
- Add tracepoints for tracking usage_count updates unrelated to
status changes in PM-runtime (Michał Mirosław).
- Add sysfs attribute to control the "sync on suspend" behavior
during system-wide suspend (Jonas Meurer).
- Switch system-wide suspend tests over to 64-bit time (Alexandre
Belloni).
- Make wakeup sources statistics in debugfs cover deleted ones which
used to be the case some time ago (zhuguangqing).
- Clean up computations carried out during hibernation, update
messages related to hibernation and fix a spelling mistake in one
of them (Wen Yang, Luigi Semenzato, Colin Ian King).
- Add mailmap entry for maintainer e-mail address that has not been
functional for several years (Rafael Wysocki)"
* tag 'pm-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (83 commits)
cpufreq: loongson2_cpufreq: adjust cpufreq uses of LOONGSON_CHIPCFG
intel_idle: Clean up irtl_2_usec()
intel_idle: Move 3 functions closer to their callers
intel_idle: Annotate initialization code and data structures
intel_idle: Move and clean up intel_idle_cpuidle_devices_uninit()
intel_idle: Rearrange intel_idle_cpuidle_driver_init()
intel_idle: Clean up NULL pointer check in intel_idle_init()
intel_idle: Fold intel_idle_probe() into intel_idle_init()
intel_idle: Eliminate __setup_broadcast_timer()
cpuidle: fix cpuidle_find_deepest_state() kerneldoc warnings
cpuidle: sysfs: fix warnings when compiling with W=1
cpuidle: coupled: fix warnings when compiling with W=1
cpufreq: brcmstb-avs: fix imbalance of cpufreq policy refcount
PM: suspend: Add sysfs attribute to control the "sync on suspend" behavior
PM / devfreq: Add debugfs support with devfreq_summary file
Documentation: admin-guide: PM: Add intel_idle document
cpuidle: arm: Enable compile testing for some of drivers
PM-runtime: add tracepoints for usage_count changes
cpufreq: intel_pstate: fix spelling mistake: "Whethet" -> "Whether"
PM: hibernate: fix spelling mistake "shapshot" -> "snapshot"
...
Linus Torvalds [Mon, 27 Jan 2020 19:18:55 +0000 (11:18 -0800)]
Merge tag 'regulator-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator updates from Mark Brown:
"Hardly anything going on in the core this time around with the
regulator API and pretty quiet on the driver front:
- An API for comparing regulators, useful for devices that need to
check if supply voltages exactly match rather than just nominally
match.
- Conversion of several DT bindings to YAML format.
- Conversion of I2C drivers to probe_new().
- New drivers for Monolithic MPQ7920 and MP8859, and Rohm BD71828"
* tag 'regulator-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (34 commits)
dt-bindings: regulator: add document bindings for mpq7920
regulator: core: Fix exported symbols to the exported GPL version
regulator: mpq7920: Fix incorrect defines
regulator: vqmmc-ipq4019: Fix platform_no_drv_owner.cocci warnings
regulator: vctrl-regulator: Avoid deadlock getting and setting the voltage
regulator fix for "regulator: core: Add regulator_is_equal() helper"
regulator: core: Add regulator_is_equal() helper
regulator: mpq7920: Convert to use .probe_new
regulator: mpq7920: Remove unneeded fields from struct mpq7920_regulator_info
regulator: vqmmc-ipq4019: Trivial clean up
regulator: vqmmc-ipq4019: Remove ipq4019_regulator_remove
regulator: bindings: Drop document bindings for mpq7920
dt-bindings: Drop entry for Monolithic Power System, MPS
regulator: bd718x7: Simplify the code by removing struct bd718xx_pmic_inits
regulator: add IPQ4019 SDHCI VQMMC LDO driver
regulator: Convert i2c drivers to use .probe_new
regulator: mpq7920: Check the correct variable in mpq7920_regulator_register()
regulator: mpq7920: Fix Woverflow warning on conversion
regulator: mp8859: tidy up white space in probe
regulator: mpq7920: add mpq7920 regulator driver
...
Linus Torvalds [Mon, 27 Jan 2020 19:15:34 +0000 (11:15 -0800)]
Merge tag 'spi-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi updates from Mark Brown:
"Not much going on in the core for SPI this time but a reasonable
amount of change in the drivers:
- Removal of dmal_request_slave_channel() from Peter Ujfalusi.
- More conversions of drivers to GPIO descriptors from Linus Walleij.
- A big rework of the sh-msiof driver from Geert Uytterhoeven moving
it over to the generic native chipselect support.
- DMA support for the uniphier driver from Kunihiko Hayashi.
- New driver support for HiSilcon v3xx SPI NOR controllers from John
Garry"
* tag 'spi-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (52 commits)
dt-binding: spi: add NPCM PSPI reset binding
spi: pxa2xx: Avoid touching SSCR0_SSE on MMP2
spi: spi-fsl-qspi: Ensure width is respected in spi-mem operations
spi: npcm-pspi: modify reset support
spi: npcm-pspi: improve spi transfer performance
spi: spi-ti-qspi: fix warning
spi: npcm-pspi: fix 16 bit send and receive support
spi: pxa2xx: Add support for Intel Comet Lake PCH-V
spi: fsl: simplify error path in of_fsl_spi_probe()
spi: fsl-lpspi: fix only one cs-gpio working
spi: spi-ti-qspi: optimize byte-transfers
spi: spi-ti-qspi: support large flash devices
spi: spi-qcom-qspi: Use device managed memory for clk_bulk_data
MAINTAINERS: Add a maintainer for the HiSilicon v3xx SFC driver
spi: Add HiSilicon v3xx SPI NOR flash controller driver
dt-bindings: spi_atmel: add microchip,sam9x60-spi
spi: bcm2835: Raise maximum number of slaves to 4
spi: sh-msiof: Do not redefine STR while compile testing
spi: rspi: Add support for GPIO chip selects
spi: rspi: Add support for multiple native chip selects
...
Linus Torvalds [Mon, 27 Jan 2020 19:13:02 +0000 (11:13 -0800)]
Merge tag 'regmap-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap updates from Mark Brown:
"This is quite a busy release for a subsystem that's usually very
quiet, though still a small set of updates in the grand scheme of
things:
- A fix for writes to non-incrementing registers.
- An iopoll() style helper for use with atomic safe regmaps, making
it easier to transition from raw memory mapped I/O.
- Some constification"
* tag 'regmap-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: fix writes to non incrementing registers
regmap: add iopoll-like atomic polling macro
regmap-i2c: constify regmap_bus structures
Walk the host page tables to identify hugepage mappings for ZONE_DEVICE
pfns, i.e. DAX pages. Explicitly query kvm_is_zone_device_pfn() when
deciding whether or not to bother walking the host page tables, as DAX
pages do not set up the head/tail infrastructure, i.e. will return false
for PageCompound() even when using huge pages.
Zap ZONE_DEVICE sptes when disabling dirty logging, e.g. if live
migration fails, to allow KVM to rebuild large pages for DAX-based
mappings. Presumably DAX favors large pages, and worst case scenario is
a minor performance hit as KVM will need to re-fault all DAX-based
pages.
KVM: x86/mmu: Remove lpage_is_disallowed() check from set_spte()
Remove the late "lpage is disallowed" check from set_spte() now that the
initial check is performed after acquiring mmu_lock. Fold the guts of
the remaining helper, __mmu_gfn_lpage_is_disallowed(), into
kvm_mmu_hugepage_adjust() to eliminate the unnecessary slot !NULL check.
KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust()
Fold max_mapping_level() into kvm_mmu_hugepage_adjust() now that HugeTLB
mappings are handled in kvm_mmu_hugepage_adjust(), i.e. there isn't a
need to pre-calculate the max mapping level. Co-locating all hugepage
checks eliminates a memslot lookup, at the cost of performing the
__mmu_gfn_lpage_is_disallowed() checks while holding mmu_lock.
The latency of lpage_is_disallowed() is likely negligible relative to
the rest of the code run while holding mmu_lock, and can be offset to
some extent by eliminating the mmu_gfn_lpage_is_disallowed() check in
set_spte() in a future patch. Eliminating the check in set_spte() is
made possible by performing the initial lpage_is_disallowed() checks
while holding mmu_lock.
KVM: x86/mmu: Zap any compound page when collapsing sptes
Zap any compound page, e.g. THP or HugeTLB pages, when zapping sptes
that can potentially be converted to huge sptes after disabling dirty
logging on the associated memslot. Note, this approach could result in
false positives, e.g. if a random compound page is mapped into the
guest, but mapping non-huge compound pages into the guest is far from
the norm, and toggling dirty logging is not a frequent operation.
KVM: x86/mmu: Remove obsolete gfn restoration in FNAME(fetch)
Remove logic to retrieve the original gfn now that HugeTLB mappings are
are identified in FNAME(fetch), i.e. FNAME(page_fault) no longer adjusts
the level or gfn.
KVM: x86/mmu: Rely on host page tables to find HugeTLB mappings
Remove KVM's HugeTLB specific logic and instead rely on walking the host
page tables (already done for THP) to identify HugeTLB mappings.
Eliminating the HugeTLB-only logic avoids taking mmap_sem and calling
find_vma() for all hugepage compatible page faults, and simplifies KVM's
page fault code by consolidating all hugepage adjustments into a common
helper.
KVM: x86/mmu: Drop level optimization from fast_page_fault()
Remove fast_page_fault()'s optimization to stop the shadow walk if the
iterator level drops below the intended map level. The intended map
level is only acccurate for HugeTLB mappings (THP mappings are detected
after fast_page_fault()), i.e. it's not required for correctness, and
a future patch will also move HugeTLB mapping detection to after
fast_page_fault().
KVM: x86/mmu: Walk host page tables to find THP mappings
Explicitly walk the host page tables to identify THP mappings instead
of relying solely on the metadata in struct page. This sets the stage
for using a common method of identifying huge mappings regardless of the
underlying implementation (HugeTLB vs THB vs DAX), and hopefully avoids
the pitfalls of relying on metadata to identify THP mappings, e.g. see
commit 169226f7e0d2 ("mm: thp: handle page cache THP correctly in
PageTransCompoundMap") and the need for KVM to explicitly check for a
THP compound page. KVM will also naturally work with 1gb THP pages, if
they are ever supported.
Walking the tables for THP mappings is likely marginally slower than
querying metadata, but a future patch will reuse the walk to identify
HugeTLB mappings, at which point eliminating the existing VMA lookup for
HugeTLB will make this a net positive.
KVM: x86/mmu: Refactor THP adjust to prep for changing query
Refactor transparent_hugepage_adjust() in preparation for walking the
host page tables to identify hugepage mappings, initially for THP pages,
and eventualy for HugeTLB and DAX-backed pages as well. The latter
cases support 1gb pages, i.e. the adjustment logic needs access to the
max allowed level.
Add a helper, lookup_address_in_mm(), to traverse the page tables of a
given mm struct. KVM will use the helper to retrieve the host mapping
level, e.g. 4k vs. 2mb vs. 1gb, of a compound (or DAX-backed) page
without having to resort to implementation specific metadata. E.g. KVM
currently uses different logic for HugeTLB vs. THP, and would add a
third variant for DAX-backed files.
KVM: Play nice with read-only memslots when querying host page size
Avoid the "writable" check in __gfn_to_hva_many(), which will always fail
on read-only memslots due to gfn_to_hva() assuming writes. Functionally,
this allows x86 to create large mappings for read-only memslots that
are backed by HugeTLB mappings.
Note, the changelog for commit 05da45583de9 ("KVM: MMU: large page
support") states "If the largepage contains write-protected pages, a
large pte is not used.", but "write-protected" refers to pages that are
temporarily read-only, e.g. read-only memslots didn't even exist at the
time.
Fixes: 4d8b81abc47b ("KVM: introduce readonly memslot") Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]>
[Redone using kvm_vcpu_gfn_to_memslot_prot. - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
mm: thp: KVM: Explicitly check for THP when populating secondary MMU
Add a helper, is_transparent_hugepage(), to explicitly check whether a
compound page is a THP and use it when populating KVM's secondary MMU.
The explicit check fixes a bug where a remapped compound page, e.g. for
an XDP Rx socket, is mapped into a KVM guest and is mistaken for a THP,
which results in KVM incorrectly creating a huge page in its secondary
MMU.
KVM: x86/mmu: Enforce max_level on HugeTLB mappings
Limit KVM's mapping level for HugeTLB based on its calculated max_level.
The max_level check prior to invoking host_mapping_level() only filters
out the case where KVM cannot create a 2mb mapping, it doesn't handle
the scenario where KVM can create a 2mb but not 1gb mapping, and the
host is using a 1gb HugeTLB mapping.
Fixes: 2f57b7051fe8 ("KVM: x86/mmu: Persist gfn_lpage_is_disallowed() to max_level") Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
KVM: Return immediately if __kvm_gfn_to_hva_cache_init() fails
Check the result of __kvm_gfn_to_hva_cache_init() and return immediately
instead of relying on the kvm_is_error_hva() check to detect errors so
that it's abundantly clear KVM intends to immediately bail on an error.
Note, the hva check is still mandatory to handle errors on subqeuesnt
calls with the same generation. Similarly, always return -EFAULT on
error so that multiple (bad) calls for a given generation will get the
same result, e.g. on an illegal gfn wrap, propagating the return from
__kvm_gfn_to_hva_cache_init() would cause the initial call to return
-EINVAL and subsequent calls to return -EFAULT.
KVM: Clean up __kvm_gfn_to_hva_cache_init() and its callers
Barret reported a (technically benign) bug where nr_pages_avail can be
accessed without being initialized if gfn_to_hva_many() fails.
virt/kvm/kvm_main.c:2193:13: warning: 'nr_pages_avail' may be
used uninitialized in this function [-Wmaybe-uninitialized]
Rather than simply squashing the warning by initializing nr_pages_avail,
fix the underlying issues by reworking __kvm_gfn_to_hva_cache_init() to
return immediately instead of continuing on. Now that all callers check
the result and/or bail immediately on a bad hva, there's no need to
explicitly nullify the memslot on error.
KVM: Check for a bad hva before dropping into the ghc slow path
When reading/writing using the guest/host cache, check for a bad hva
before checking for a NULL memslot, which triggers the slow path for
handing cross-page accesses. Because the memslot is nullified on error
by __kvm_gfn_to_hva_cache_init(), if the bad hva is encountered after
crossing into a new page, then the kvm_{read,write}_guest() slow path
could potentially write/access the first chunk prior to detecting the
bad hva.
Arguably, performing a partial access is semantically correct from an
architectural perspective, but that behavior is certainly not intended.
In the original implementation, memslot was not explicitly nullified
and therefore the partial access behavior varied based on whether the
memslot itself was null, or if the hva was simply bad. The current
behavior was introduced as a seemingly unintentional side effect in
commit f1b9dd5eb86c ("kvm: Disallow wraparound in
kvm_gfn_to_hva_cache_init"), which justified the change with "since some
callers don't check the return code from this function, it sit seems
prudent to clear ghc->memslot in the event of an error".
Regardless of intent, the partial access is dependent on _not_ checking
the result of the cache initialization, which is arguably a bug in its
own right, at best simply weird.
Krish Sadhukhan [Thu, 16 Jan 2020 00:54:32 +0000 (19:54 -0500)]
KVM: nVMX: Check GUEST_DR7 on vmentry of nested guests
According to section "Checks on Guest Control Registers, Debug Registers, and
and MSRs" in Intel SDM vol 3C, the following checks are performed on vmentry
of nested guests:
If the "load debug controls" VM-entry control is 1, bits 63:32 in the DR7
field must be 0.
In KVM, GUEST_DR7 is set prior to the vmcs02 VM-entry by kvm_set_dr() and the
latter synthesizes a #GP if any bit in the high dword in the former is set.
Hence this field needs to be checked in software.
Alex Shi [Thu, 16 Jan 2020 03:32:39 +0000 (11:32 +0800)]
KVM: remove unused guest_enter
After commit 61bd0f66ff92 ("KVM: PPC: Book3S HV: Fix guest time accounting
with VIRT_CPU_ACCOUNTING_GEN"), no one use this function anymore, So better
to remove it.
Paolo Bonzini [Thu, 9 Jan 2020 14:57:19 +0000 (09:57 -0500)]
KVM: Move running VCPU from ARM to common code
For ring-based dirty log tracking, it will be more efficient to account
writes during schedule-out or schedule-in to the currently running VCPU.
We would like to do it even if the write doesn't use the current VCPU's
address space, as is the case for cached writes (see commit 4e335d9e7ddb,
"Revert "KVM: Support vCPU-based gfn->hva cache"", 2017-05-02).
Therefore, add a mechanism to track the currently-loaded kvm_vcpu struct.
There is already something similar in KVM/ARM; one important difference
is that kvm_arch_vcpu_{load,put} have two callers in virt/kvm/kvm_main.c:
we have to update both the architecture-independent vcpu_{load,put} and
the preempt notifiers.
Another change made in the process is to allow using kvm_get_running_vcpu()
in preemptible code. This is allowed because preempt notifiers ensure
that the value does not change even after the VCPU thread is migrated.
Peter Xu [Thu, 9 Jan 2020 14:57:16 +0000 (09:57 -0500)]
KVM: X86: Drop x86_set_memory_region()
The helper x86_set_memory_region() is only used in vmx_set_tss_addr()
and kvm_arch_destroy_vm(). Push the lock upper in both cases. With
that, drop x86_set_memory_region().
This prepares to allow __x86_set_memory_region() to return a HVA
mapped, because the HVA will need to be protected by the lock too even
after __x86_set_memory_region() returns.
Peter Xu [Thu, 9 Jan 2020 14:57:12 +0000 (09:57 -0500)]
KVM: Add build-time error check on kvm_run size
It's already going to reach 2400 Bytes (which is over half of page
size on 4K page archs), so maybe it's good to have this build-time
check in case it overflows when adding new fields.
Vitaly Kuznetsov [Wed, 15 Jan 2020 17:10:12 +0000 (18:10 +0100)]
x86/kvm/hyper-v: remove stale evmcs_already_enabled check from nested_enable_evmcs()
In nested_enable_evmcs() evmcs_already_enabled check doesn't really do
anything: controls are already sanitized and we return '0' regardless.
Just drop the check.
KVM: x86: Perform non-canonical checks in 32-bit KVM
Remove the CONFIG_X86_64 condition from the low level non-canonical
helpers to effectively enable non-canonical checks on 32-bit KVM.
Non-canonical checks are performed by hardware if the CPU *supports*
64-bit mode, whether or not the CPU is actually in 64-bit mode is
irrelevant.
For the most part, skipping non-canonical checks on 32-bit KVM is ok-ish
because 32-bit KVM always (hopefully) drops bits 63:32 of whatever value
it's checking before propagating it to hardware, and architecturally,
the expected behavior for the guest is a bit of a grey area since the
vCPU itself doesn't support 64-bit mode. I.e. a 32-bit KVM guest can
observe the missed checks in several paths, e.g. INVVPID and VM-Enter,
but it's debatable whether or not the missed checks constitute a bug
because technically the vCPU doesn't support 64-bit mode.
The primary motivation for enabling the non-canonical checks is defense
in depth. As mentioned above, a guest can trigger a missed check via
INVVPID or VM-Enter. INVVPID is straightforward as it takes a 64-bit
virtual address as part of its 128-bit INVVPID descriptor and fails if
the address is non-canonical, even if INVVPID is executed in 32-bit PM.
Nested VM-Enter is a bit more convoluted as it requires the guest to
write natural width VMCS fields via memory accesses and then VMPTRLD the
VMCS, but it's still possible. In both cases, KVM is saved from a true
bug only because its flows that propagate values to hardware (correctly)
take "unsigned long" parameters and so drop bits 63:32 of the bad value.
Explicitly performing the non-canonical checks makes it less likely that
a bad value will be propagated to hardware, e.g. in the INVVPID case,
if __invvpid() didn't implicitly drop bits 63:32 then KVM would BUG() on
the resulting unexpected INVVPID failure due to hardware rejecting the
non-canonical address.
The only downside to enabling the non-canonical checks is that it adds a
relatively small amount of overhead, but the affected flows are not hot
paths, i.e. the overhead is negligible.
Note, KVM technically could gate the non-canonical checks on 32-bit KVM
with static_cpu_has(X86_FEATURE_LM), but on bare metal that's an even
bigger waste of code for everyone except the 0.00000000000001% of the
population running on Yonah, and nested 32-bit on 64-bit already fudges
things with respect to 64-bit CPU behavior.
Signed-off-by: Sean Christopherson <[email protected]>
[Also do so in nested_vmx_check_host_state as reported by Krish. - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
Oliver Upton [Sat, 14 Dec 2019 00:33:58 +0000 (16:33 -0800)]
KVM: nVMX: WARN on failure to set IA32_PERF_GLOBAL_CTRL
Writes to MSR_CORE_PERF_GLOBAL_CONTROL should never fail if the VM-exit
and VM-entry controls are exposed to L1. Promote the checks to perform a
full WARN if kvm_set_msr() fails and remove the now unused macro
SET_MSR_OR_WARN().
KVM: x86: Remove unused ctxt param from emulator's FPU accessors
Remove an unused struct x86_emulate_ctxt * param from low level helpers
used to access guest FPU state. The unused param was left behind by
commit 6ab0b9feb82a ("x86,kvm: remove KVM emulator get_fpu / put_fpu").
KVM: x86: Revert "KVM: X86: Fix fpu state crash in kvm guest"
Reload the current thread's FPU state, which contains the guest's FPU
state, to the CPU registers if necessary during vcpu_enter_guest().
TIF_NEED_FPU_LOAD can be set any time control is transferred out of KVM,
e.g. if I/O is triggered during a KVM call to get_user_pages() or if a
softirq occurs while KVM is scheduled in.
Moving the handling of TIF_NEED_FPU_LOAD from vcpu_enter_guest() to
kvm_arch_vcpu_load(), effectively kvm_sched_in(), papered over a bug
where kvm_put_guest_fpu() failed to account for TIF_NEED_FPU_LOAD. The
easiest way to the kvm_put_guest_fpu() bug was to run with involuntary
preemption enable, thus handling TIF_NEED_FPU_LOAD during kvm_sched_in()
made the bug go away. But, removing the handling in vcpu_enter_guest()
exposed KVM to the rare case of a softirq triggering kernel_fpu_begin()
between vcpu_load() and vcpu_enter_guest().
Now that kvm_{load,put}_guest_fpu() correctly handle TIF_NEED_FPU_LOAD,
revert the commit to both restore the vcpu_enter_guest() behavior and
eliminate the superfluous switch_fpu_return() in kvm_arch_vcpu_load().
Note, leaving the handling in kvm_arch_vcpu_load() isn't wrong per se,
but it is unnecessary, and most critically, makes it extremely difficult
to find bugs such as the kvm_put_guest_fpu() issue due to shrinking the
window where a softirq can corrupt state.
A sample trace triggered by warning if TIF_NEED_FPU_LOAD is set while
vcpu state is loaded:
KVM: x86: Ensure guest's FPU state is loaded when accessing for emulation
Lock the FPU regs and reload the current thread's FPU state, which holds
the guest's FPU state, to the CPU registers if necessary prior to
accessing guest FPU state as part of emulation. kernel_fpu_begin() can
be called from softirq context, therefore KVM must ensure softirqs are
disabled (locking the FPU regs disables softirqs) when touching CPU FPU
state.
Note, for all intents and purposes this reverts commit 6ab0b9feb82a7
("x86,kvm: remove KVM emulator get_fpu / put_fpu"), but at the time it
was applied, removing get/put_fpu() was correct. The re-introduction
of {get,put}_fpu() is necessitated by the deferring of FPU state load.
Fixes: 5f409e20b7945 ("x86/fpu: Defer FPU state load until return to userspace") Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
KVM: x86: Handle TIF_NEED_FPU_LOAD in kvm_{load,put}_guest_fpu()
Handle TIF_NEED_FPU_LOAD similar to how fpu__copy() handles the flag
when duplicating FPU state to a new task struct. TIF_NEED_FPU_LOAD can
be set any time control is transferred out of KVM, be it voluntarily,
e.g. if I/O is triggered during a KVM call to get_user_pages, or
involuntarily, e.g. if softirq runs after an IRQ occurs. Therefore,
KVM must account for TIF_NEED_FPU_LOAD whenever it is (potentially)
accessing CPU FPU state.
Fixes: 5f409e20b7945 ("x86/fpu: Defer FPU state load until return to userspace") Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>