Git Repo - linux.git/log

Merge tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
"Features, highlights:

   - async discard
       - "mount -o discard=async" to enable it
       - freed extents are not discarded immediatelly, but grouped
         together and trimmed later, with IO rate limiting
       - the "sync" mode submits short extents that could have been
         ignored completely by the device, for SATA prior to 3.1 the
         requests are unqueued and have a big impact on performance
       - the actual discard IO requests have been moved out of
         transaction commit to a worker thread, improving commit latency
       - IO rate and request size can be tuned by sysfs files, for now
         enabled only with CONFIG_BTRFS_DEBUG as we might need to
         add/delete the files and don't have a stable-ish ABI for
         general use, defaults are conservative

   - export device state info in sysfs, eg. missing, writeable

   - no discard of extents known to be untouched on disk (eg. after
     reservation)

   - device stats reset is logged with process name and PID that called
     the ioctl

  Fixes:

   - fix missing hole after hole punching and fsync when using NO_HOLES

   - writeback: range cyclic mode could miss some dirty pages and lead
     to OOM

   - two more corner cases for metadata_uuid change after power loss
     during the change

   - fix infinite loop during fsync after mix of rename operations

  Core changes:

   - qgroup assign returns ENOTCONN when quotas not enabled, used to
     return EINVAL that was confusing

   - device closing does not need to allocate memory anymore

   - snapshot aware code got removed, disabled for years due to
     performance problems, reimplmentation will allow to select wheter
     defrag breaks or does not break COW on shared extents

   - tree-checker:
       - check leaf chunk item size, cross check against number of
         stripes
       - verify location keys for DIR_ITEM, DIR_INDEX and XATTR items

   - new self test for physical -> logical mapping code, used for super
     block range exclusion

   - assertion helpers/macros updated to avoid objtool "unreachable
     code" reports on older compilers or config option combinations"

* tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (84 commits)
  btrfs: free block groups after free'ing fs trees
  btrfs: Fix split-brain handling when changing FSID to metadata uuid
  btrfs: Handle another split brain scenario with metadata uuid feature
  btrfs: Factor out metadata_uuid code from find_fsid.
  btrfs: Call find_fsid from find_fsid_inprogress
  Btrfs: fix infinite loop during fsync after rename operations
  btrfs: set trans->drity in btrfs_commit_transaction
  btrfs: drop log root for dropped roots
  btrfs: sysfs, add devid/dev_state kobject and device attributes
  btrfs: Refactor btrfs_rmap_block to improve readability
  btrfs: Add self-tests for btrfs_rmap_block
  btrfs: selftests: Add support for dummy devices
  btrfs: Move and unexport btrfs_rmap_block
  btrfs: separate definition of assertion failure handlers
  btrfs: device stats, log when stats are zeroed
  btrfs: fix improper setting of scanned for range cyclic write cache pages
  btrfs: safely advance counter when looking up bio csums
  btrfs: remove unused member btrfs_device::work
  btrfs: remove unnecessary wrapper get_alloc_profile
  btrfs: add correction to handle -1 edge case in async discard
  ...

Revert "drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC"

This reverts commit 245595e83fbedda9e107eb0b37cec0ad07733778.

Guido Günther reported issues with this patch that broke existing
user space. Let's revert it for now and fix it properly later on.

Link: https://patchwork.kernel.org/patch/11291089/
https://lore.kernel.org/lkml/20200121114553.2667556 [email protected]/
Cc: Guido Günther <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>

Merge branch 'x86-mtrr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mtrr updates from Ingo Molnar:
"Two changes: restrict /proc/mtrr to CAP_SYS_ADMIN, plus a cleanup"

* 'x86-mtrr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mtrr: Require CAP_SYS_ADMIN for all access
x86/mtrr: Get rid of mtrr_seq_show() forward declaration

Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 FPU updates from Ingo Molnar:
"Three changes: fix a race that can result in FPU corruption, plus two
  cleanups"

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu: Deactivate FPU state after failure during state load
  x86/fpu/xstate: Make xfeature_is_supervisor()/xfeature_is_user() return bool
  x86/fpu/xstate: Fix small issues

Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpu-features updates from Ingo Molnar:
"The biggest change in this cycle was a large series from Sean
  Christopherson to clean up the handling of VMX features. This both
  fixes bugs/inconsistencies and makes the code more coherent and
  future-proof.

  There are also two cleanups and a minor TSX syslog messages
  enhancement"

* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  x86/cpu: Remove redundant cpu_detect_cache_sizes() call
  x86/cpu: Print "VMX disabled" error message iff KVM is enabled
  KVM: VMX: Allow KVM_INTEL when building for Centaur and/or Zhaoxin CPUs
  perf/x86: Provide stubs of KVM helpers for non-Intel CPUs
  KVM: VMX: Use VMX_FEATURE_* flags to define VMCS control bits
  KVM: VMX: Check for full VMX support when verifying CPU compatibility
  KVM: VMX: Use VMX feature flag to query BIOS enabling
  KVM: VMX: Drop initialization of IA32_FEAT_CTL MSR
  x86/cpufeatures: Add flag to track whether MSR IA32_FEAT_CTL is configured
  x86/cpu: Set synthetic VMX cpufeatures during init_ia32_feat_ctl()
  x86/cpu: Print VMX flags in /proc/cpuinfo using VMX_FEATURES_*
  x86/cpu: Detect VMX features on Intel, Centaur and Zhaoxin CPUs
  x86/vmx: Introduce VMX_FEATURES_*
  x86/cpu: Clear VMX feature flag if VMX is not fully enabled
  x86/zhaoxin: Use common IA32_FEAT_CTL MSR initialization
  x86/centaur: Use common IA32_FEAT_CTL MSR initialization
  x86/mce: WARN once if IA32_FEAT_CTL MSR is left unlocked
  x86/intel: Initialize IA32_FEAT_CTL MSR at boot
  tools/x86: Sync msr-index.h from kernel sources
  selftests, kvm: Replace manual MSR defs with common msr-index.h
  ...

docs: filesystems: add overlayfs to index.rst

While the document is there, it is currently missing at the
index file.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Link: https://lore.kernel.org/r/3b8e7783b1fcc71e4f94af5ea8e5fa264392f8c4.1580193653.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <[email protected]>

docs: usb: remove some broken references

It seems that some files were removed from USB documentation.

Update the links accordingly.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Link: https://lore.kernel.org/r/00008303fde6b4e06d027d3b76ae7032614a7030.1580193653.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <[email protected]>

selftests/ftrace: fix glob selftest

test.d/ftrace/func-filter-glob.tc is failing on s390 because it has
ARCH_INLINE_SPIN_LOCK and friends set to 'y'. So the usual
__raw_spin_lock symbol isn't in the ftrace function list. Change
'*aw*lock' to '*spin*lock' which would hopefully match some of the
locking functions on all platforms.

Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Signed-off-by: Sven Schnelle <[email protected]>
Signed-off-by: Shuah Khan <[email protected]>

Merge branch 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 updates from Ingo Molnar:
"Misc changes:

   - Enhance #GP fault printouts by distinguishing between canonical and
     non-canonical address faults, and also add KASAN fault decoding.

   - Fix/enhance the x86 NMI handler by putting the duration check into
     a direct function call instead of an irq_work which we know to be
     broken in some cases.

   - Clean up do_general_protection() a bit"

* 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/nmi: Remove irq_work from the long duration NMI handler
  x86/traps: Cleanup do_general_protection()
  x86/kasan: Print original address on #GP
  x86/dumpstack: Introduce die_addr() for die() with #GP fault address
  x86/traps: Print address on #GP
  x86/insn-eval: Add support for 64-bit kernel mode

Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Ingo Molnar:
"Misc cleanups all around the map"

* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/CPU/AMD: Remove amd_get_topology_early()
  x86/tsc: Remove redundant assignment
  x86/crash: Use resource_size()
  x86/cpu: Add a missing prototype for arch_smt_update()
  x86/nospec: Remove unused RSB_FILL_LOOPS
  x86/vdso: Provide missing include file
  x86/Kconfig: Correct spelling and punctuation
  Documentation/x86/boot: Fix typo
  x86/boot: Fix a comment's incorrect file reference
  x86/process: Remove set but not used variables prev and next
  x86/Kconfig: Fix Kconfig indentation

Merge branch 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 resource control updates from Ingo Molnar:
"The main change in this tree is the extension of the resctrl procfs
  ABI with a new file that helps tooling to navigate from tasks back to
  resctrl groups: /proc/{pid}/cpu_resctrl_groups.

  Also fix static key usage for certain feature combinations and
  simplify the task exit resctrl case"

* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Add task resctrl information display
  x86/resctrl: Check monitoring static key in the MBM overflow handler
  x86/resctrl: Do not reconfigure exiting tasks

Merge branch 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 boot update from Ingo Molnar:
"Two minor changes: fix an atypical binutils combination build bug, and
  also fix a VRAM size check for simplefb"

* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sysfb: Fix check for bad VRAM size
  x86/boot: Discard .eh_frame sections

Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 asm updates from Ingo Molnar:
"Misc updates:

   - Remove last remaining calls to exception_enter/exception_exit() and
     simplify the entry code some more.

   - Remove force_iret()

   - Add support for "Fast Short Rep Mov", which is available starting
     with Ice Lake Intel CPUs - and make the x86 assembly version of
     memmove() use REP MOV for all sizes when FSRM is available.

   - Micro-optimize/simplify the 32-bit boot code a bit.

   - Use a more future-proof SYSRET instruction mnemonic"

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Simplify calculation of output address
  x86/entry/64: Add instruction suffix to SYSRET
  x86: Remove force_iret()
  x86/cpufeatures: Add support for fast short REP; MOVSB
  x86/context-tracking: Remove exception_enter/exit() from KVM_PV_REASON_PAGE_NOT_PRESENT async page fault
  x86/context-tracking: Remove exception_enter/exit() from do_page_fault()

Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 apic fix from Ingo Molnar:
"A single commit that simplifies the code and gets rid of a compiler
warning"

* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic/uv: Avoid unused variable warning

Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
"These were the main changes in this cycle:

   - More -rt motivated separation of CONFIG_PREEMPT and
     CONFIG_PREEMPTION.

   - Add more low level scheduling topology sanity checks and warnings
     to filter out nonsensical topologies that break scheduling.

   - Extend uclamp constraints to influence wakeup CPU placement

   - Make the RT scheduler more aware of asymmetric topologies and CPU
     capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y

   - Make idle CPU selection more consistent

   - Various fixes, smaller cleanups, updates and enhancements - please
     see the git log for details"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
  sched/fair: Define sched_idle_cpu() only for SMP configurations
  sched/topology: Assert non-NUMA topology masks don't (partially) overlap
  idle: fix spelling mistake "iterrupts" -> "interrupts"
  sched/fair: Remove redundant call to cpufreq_update_util()
  sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
  sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
  sched/fair: calculate delta runnable load only when it's needed
  sched/cputime: move rq parameter in irqtime_account_process_tick
  stop_machine: Make stop_cpus() static
  sched/debug: Reset watchdog on all CPUs while processing sysrq-t
  sched/core: Fix size of rq::uclamp initialization
  sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
  sched/fair: Load balance aggressively for SCHED_IDLE CPUs
  sched/fair : Improve update_sd_pick_busiest for spare capacity case
  watchdog: Remove soft_lockup_hrtimer_cnt and related code
  sched/rt: Make RT capacity-aware
  sched/fair: Make EAS wakeup placement consider uclamp restrictions
  sched/fair: Make task_fits_capacity() consider uclamp restrictions
  sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
  sched/uclamp: Make uclamp util helpers use and return UL values
  ...

Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf updates from Ingo Molnar:
"Kernel side changes:

   - Ftrace is one of the last W^X violators (after this only KLP is
     left). These patches move it over to the generic text_poke()
     interface and thereby get rid of this oddity. This requires a
     surprising amount of surgery, by Peter Zijlstra.

   - x86/AMD PMUs: add support for 'Large Increment per Cycle Events' to
     count certain types of events that have a special, quirky hw ABI
     (by Kim Phillips)

   - kprobes fixes by Masami Hiramatsu

  Lots of tooling updates as well, the following subcommands were
  updated: annotate/report/top, c2c, clang, record, report/top TUI,
  sched timehist, tests; plus updates were done to the gtk ui, libperf,
  headers and the parser"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
  perf/x86/amd: Add support for Large Increment per Cycle Events
  perf/x86/amd: Constrain Large Increment per Cycle events
  perf/x86/intel/rapl: Add Comet Lake support
  tracing: Initialize ret in syscall_enter_define_fields()
  perf header: Use last modification time for timestamp
  perf c2c: Fix return type for histogram sorting comparision functions
  perf beauty sockaddr: Fix augmented syscall format warning
  perf/ui/gtk: Fix gtk2 build
  perf ui gtk: Add missing zalloc object
  perf tools: Use %define api.pure full instead of %pure-parser
  libperf: Setup initial evlist::all_cpus value
  perf report: Fix no libunwind compiled warning break s390 issue
  perf tools: Support --prefix/--prefix-strip
  perf report: Clarify in help that --children is default
  tools build: Fix test-clang.cpp with Clang 8+
  perf clang: Fix build with Clang 9
  kprobes: Fix optimize_kprobe()/unoptimize_kprobe() cancellation logic
  tools lib: Fix builds when glibc contains strlcpy()
  perf report/top: Make 'e' visible in the help and make it toggle showing callchains
  perf report/top: Do not offer annotation for symbols without samples
  ...

Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:
"Just a handful of changes in this cycle: an ARM64 performance
  optimization, a comment fix and a debug output fix"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/osq: Use optimized spinning loop for arm64
  locking/qspinlock: Fix inaccessible URL of MCS lock paper
  locking/lockdep: Fix lockdep_stats indentation problem

Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull EFI updates from Ingo Molnar:
"The main changes in this cycle were:

   - Cleanup of the GOP [graphics output] handling code in the EFI stub

   - Complete refactoring of the mixed mode handling in the x86 EFI stub

   - Overhaul of the x86 EFI boot/runtime code

   - Increase robustness for mixed mode code

   - Add the ability to disable DMA at the root port level in the EFI
     stub

   - Get rid of RWX mappings in the EFI memory map and page tables,
     where possible

   - Move the support code for the old EFI memory mapping style into its
     only user, the SGI UV1+ support code.

   - plus misc fixes, updates, smaller cleanups.

  ... and due to interactions with the RWX changes, another round of PAT
  cleanups make a guest appearance via the EFI tree - with no side
  effects intended"

* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
  efi/x86: Disable instrumentation in the EFI runtime handling code
  efi/libstub/x86: Fix EFI server boot failure
  efi/x86: Disallow efi=old_map in mixed mode
  x86/boot/compressed: Relax sed symbol type regex for LLVM ld.lld
  efi/x86: avoid KASAN false positives when accessing the 1: 1 mapping
  efi: Fix handling of multiple efi_fake_mem= entries
  efi: Fix efi_memmap_alloc() leaks
  efi: Add tracking for dynamically allocated memmaps
  efi: Add a flags parameter to efi_memory_map
  efi: Fix comment for efi_mem_type() wrt absent physical addresses
  efi/arm: Defer probe of PCIe backed efifb on DT systems
  efi/x86: Limit EFI old memory map to SGI UV machines
  efi/x86: Avoid RWX mappings for all of DRAM
  efi/x86: Don't map the entire kernel text RW for mixed mode
  x86/mm: Fix NX bit clearing issue in kernel_map_pages_in_pgd
  efi/libstub/x86: Fix unused-variable warning
  efi/libstub/x86: Use mandatory 16-byte stack alignment in mixed mode
  efi/libstub/x86: Use const attribute for efi_is_64bit()
  efi: Allow disabling PCI busmastering on bridges during boot
  efi/x86: Allow translating 64-bit arguments for mixed mode calls
  ...

Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RCU updates from Ingo Molnar:
"The RCU changes in this cycle were:
   - Expedited grace-period updates
   - kfree_rcu() updates
   - RCU list updates
   - Preemptible RCU updates
   - Torture-test updates
   - Miscellaneous fixes
   - Documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
  rcu: Remove unused stop-machine #include
  powerpc: Remove comment about read_barrier_depends()
  .mailmap: Add entries for old [email protected] addresses
  srcu: Apply *_ONCE() to ->srcu_last_gp_end
  rcu: Switch force_qs_rnp() to for_each_leaf_node_cpu_mask()
  rcu: Move rcu_{expedited,normal} definitions into rcupdate.h
  rcu: Move gp_state_names[] and gp_state_getname() to tree_stall.h
  rcu: Remove the declaration of call_rcu() in tree.h
  rcu: Fix tracepoint tracking RCU CPU kthread utilization
  rcu: Fix harmless omission of "CONFIG_" from #if condition
  rcu: Avoid tick_dep_set_cpu() misordering
  rcu: Provide wrappers for uses of ->rcu_read_lock_nesting
  rcu: Use READ_ONCE() for ->expmask in rcu_read_unlock_special()
  rcu: Clear ->rcu_read_unlock_special only once
  rcu: Clear .exp_hint only when deferred quiescent state has been reported
  rcu: Rename some instance of CONFIG_PREEMPTION to CONFIG_PREEMPT_RCU
  rcu: Remove kfree_call_rcu_nobatch()
  rcu: Remove kfree_rcu() special casing and lazy-callback handling
  rcu: Add support for debug_objects debugging for kfree_rcu()
  rcu: Add multiple in-flight batches of kfree_rcu() work
  ...

Merge branch 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool updates from Ingo Molnar:
"The main changes are to move the ORC unwind table sorting from early
  init to build-time - this speeds up booting.

  No change in functionality intended"

* 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/unwind/orc: Fix !CONFIG_MODULES build warning
  x86/unwind/orc: Remove boot-time ORC unwind tables sorting
  scripts/sorttable: Implement build-time ORC unwind table sorting
  scripts/sorttable: Rename 'sortextable' to 'sorttable'
  scripts/sortextable: Refactor the do_func() function
  scripts/sortextable: Remove dead code
  scripts/sortextable: Clean up the code to meet the kernel coding style better
  scripts/sortextable: Rewrite error/success handling

scripts/dtc: Revert "yamltree: Ensure consistent bracketing of properties with phandles"

This reverts upstream commit 18d7b2f4ee45fec422b7d82bab0b3c762ee907e4. A
revert in upstream dtc is pending.

This commit didn't work for properties such as 'interrupt-map' that have
phandle in the middle of an entry. It would also not work for a 0 or -1
phandle value that acts as a NULL.

Signed-off-by: Rob Herring <[email protected]>

Merge branch 'core-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull header cleanup from Ingo Molnar:
"This is a treewide cleanup, mostly (but not exclusively) with x86
  impact, which breaks implicit dependencies on the asm/realtime.h
  header and finally removes it from asm/acpi.h"

* 'core-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ACPI/sleep: Move acpi_get_wakeup_address() into sleep.c, remove <asm/realmode.h> from <asm/acpi.h>
  ACPI/sleep: Convert acpi_wakeup_address into a function
  x86/ACPI/sleep: Remove an unnecessary include of asm/realmode.h
  ASoC: Intel: Skylake: Explicitly include linux/io.h for virt_to_phys()
  vmw_balloon: Explicitly include linux/io.h for virt_to_phys()
  virt: vbox: Explicitly include linux/io.h to pick up various defs
  efi/capsule-loader: Explicitly include linux/io.h for page_to_phys()
  perf/x86/intel: Explicitly include asm/io.h to use virt_to_phys()
  x86/kprobes: Explicitly include vmalloc.h for set_vm_flush_reset_perms()
  x86/ftrace: Explicitly include vmalloc.h for set_vm_flush_reset_perms()
  x86/boot: Explicitly include realmode.h to handle RM reservations
  x86/efi: Explicitly include realmode.h to handle RM trampoline quirk
  x86/platform/intel/quark: Explicitly include linux/io.h for virt_to_phys()
  x86/setup: Enhance the comments
  x86/setup: Clean up the header portion of setup.c

of: Add OF_DMA_DEFAULT_COHERENT & select it on powerpc

There's an OF helper called of_dma_is_coherent(), which checks if a
device has a "dma-coherent" property to see if the device is coherent
for DMA.

But on some platforms devices are coherent by default, and on some
platforms it's not possible to update existing device trees to add the
"dma-coherent" property.

So add a Kconfig symbol to allow arch code to tell
of_dma_is_coherent() that devices are coherent by default, regardless
of the presence of the property.

Select that symbol on powerpc when NOT_COHERENT_CACHE is not set, ie.
when the system has a coherent cache.

Fixes: 92ea637edea3 ("of: introduce of_dma_is_coherent() helper")
Cc: [email protected] # v3.16+
Reported-by: Christian Zigotzky <[email protected]>
Tested-by: Christian Zigotzky <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Reviewed-by: Ulf Hansson <[email protected]>
Signed-off-by: Rob Herring <[email protected]>

KVM: arm64: Treat emulated TVAL TimerValue as a signed 32-bit integer

According to the ARM ARM, registers CNT{P,V}_TVAL_EL0 have bits [63:32]
RES0 [1]. When reading the register, the value is truncated to the least
significant 32 bits [2], and on writes, TimerValue is treated as a signed
32-bit integer [1, 2].

When the guest behaves correctly and writes 32-bit values, treating TVAL
as an unsigned 64 bit register works as expected. However, things start
to break down when the guest writes larger values, because
(u64)0x1_ffff_ffff = 8589934591. but (s32)0x1_ffff_ffff = -1, and the
former will cause the timer interrupt to be asserted in the future, but
the latter will cause it to be asserted now. Let's treat TVAL as a
signed 32-bit register on writes, to match the behaviour described in
the architecture, and the behaviour experimentally exhibited by the
virtual timer on a non-vhe host.

[1] Arm DDI 0487E.a, section D13.8.18
[2] Arm DDI 0487E.a, section D11.2.4

Signed-off-by: Alexandru Elisei <[email protected]>
[maz: replaced the read-side mask with lower_32_bits]
Signed-off-by: Marc Zyngier <[email protected]>
Fixes: 8fa761624871 ("KVM: arm/arm64: arch_timer: Fix CNTP_TVAL calculation")
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: pmu: Only handle supported event counters

Let the code never use unsupported event counters. Change
kvm_pmu_handle_pmcr() to only reset supported counters and
kvm_pmu_vcpu_reset() to only stop supported counters.

Other actions are filtered on the supported counters in
kvm/sysregs.c

Signed-off-by: Eric Auger <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: pmu: Fix chained SW_INCR counters

At the moment a SW_INCR counter always overflows on 32-bit
boundary, independently on whether the n+1th counter is
programmed as CHAIN.

Check whether the SW_INCR counter is a 64b counter and if so,
implement the 64b logic.

Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
Signed-off-by: Eric Auger <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: pmu: Don't mark a counter as chained if the odd one is disabled

At the moment we update the chain bitmap on type setting. This
does not take into account the enable state of the odd register.

Let's make sure a counter is never considered as chained if
the high counter is disabled.

We recompute the chain state on enable/disable and type changes.

Also let create_perf_event() use the chain bitmap and not use
kvm_pmu_idx_has_chain_evtype().

Suggested-by: Marc Zyngier <[email protected]>
Signed-off-by: Eric Auger <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

KVM: arm64: pmu: Don't increment SW_INCR if PMCR.E is unset

The specification says PMSWINC increments PMEVCNTR<n>_EL1 by 1
if PMEVCNTR<n>_EL0 is enabled and configured to count SW_INCR.

For PMEVCNTR<n>_EL0 to be enabled, we need both PMCNTENSET to
be set for the corresponding event counter but we also need
the PMCR.E bit to be set.

Fixes: 7a0adc7064b8 ("arm64: KVM: Add access handler for PMSWINC register")
Signed-off-by: Eric Auger <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Murray <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

net: phy: add default ARCH_BCM_IPROC for MDIO_BCM_IPROC

Add default MDIO_BCM_IPROC Kconfig setting such that it is default
on for IPROC family of devices.

Signed-off-by: Scott Branden <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

udp: segment looped gso packets correctly

Multicast and broadcast packets can be looped from egress to ingress
pre segmentation with dev_loopback_xmit. That function unconditionally
sets ip_summed to CHECKSUM_UNNECESSARY.

udp_rcv_segment segments gso packets in the udp rx path. Segmentation
usually executes on egress, and does not expect packets of this type.
__udp_gso_segment interprets !CHECKSUM_PARTIAL as CHECKSUM_NONE. But
the offsets are not correct for gso_make_checksum.

UDP GSO packets are of type CHECKSUM_PARTIAL, with their uh->check set
to the correct pseudo header checksum. Reset ip_summed to this type.
(CHECKSUM_PARTIAL is allowed on ingress, see comments in skbuff.h)

Reported-by: syzbot <[email protected]>
Fixes: cf329aa42b66 ("udp: cope with UDP GRO packet misdirection")
Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

netem: change mailing list

The old netem mailing list was inactive and recently was targeted by
spammers. Switch to just using netdev mailing list which is where all
the real change happens.

Signed-off-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim

There are several storage drivers like dm-multipath, iscsi, tcmu-runner,
amd nbd that have userspace components that can run in the IO path. For
example, iscsi and nbd's userspace deamons may need to recreate a socket
and/or send IO on it, and dm-multipath's daemon multipathd may need to
send SG IO or read/write IO to figure out the state of paths and re-set
them up.

In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
memalloc_*_save/restore functions to control the allocation behavior,
but for userspace we would end up hitting an allocation that ended up
writing data back to the same device we are trying to allocate for.
The device is then in a state of deadlock, because to execute IO the
device needs to allocate memory, but to allocate memory the memory
layers want execute IO to the device.

Here is an example with nbd using a local userspace daemon that performs
network IO to a remote server. We are using XFS on top of the nbd device,
but it can happen with any FS or other modules layered on top of the nbd
device that can write out data to free memory.  Here a nbd daemon helper
thread, msgr-worker-1, is performing a write/sendmsg on a socket to execute
a request. This kicks off a reclaim operation which results in a WRITE to
the nbd device and the nbd thread calling back into the mm layer.

[ 1626.609191] msgr-worker-1   D    0  1026      1 0x00004000
[ 1626.609193] Call Trace:
[ 1626.609195]  ? __schedule+0x29b/0x630
[ 1626.609197]  ? wait_for_completion+0xe0/0x170
[ 1626.609198]  schedule+0x30/0xb0
[ 1626.609200]  schedule_timeout+0x1f6/0x2f0
[ 1626.609202]  ? blk_finish_plug+0x21/0x2e
[ 1626.609204]  ? _xfs_buf_ioapply+0x2e6/0x410
[ 1626.609206]  ? wait_for_completion+0xe0/0x170
[ 1626.609208]  wait_for_completion+0x108/0x170
[ 1626.609210]  ? wake_up_q+0x70/0x70
[ 1626.609212]  ? __xfs_buf_submit+0x12e/0x250
[ 1626.609214]  ? xfs_bwrite+0x25/0x60
[ 1626.609215]  xfs_buf_iowait+0x22/0xf0
[ 1626.609218]  __xfs_buf_submit+0x12e/0x250
[ 1626.609220]  xfs_bwrite+0x25/0x60
[ 1626.609222]  xfs_reclaim_inode+0x2e8/0x310
[ 1626.609224]  xfs_reclaim_inodes_ag+0x1b6/0x300
[ 1626.609227]  xfs_reclaim_inodes_nr+0x31/0x40
[ 1626.609228]  super_cache_scan+0x152/0x1a0
[ 1626.609231]  do_shrink_slab+0x12c/0x2d0
[ 1626.609233]  shrink_slab+0x9c/0x2a0
[ 1626.609235]  shrink_node+0xd7/0x470
[ 1626.609237]  do_try_to_free_pages+0xbf/0x380
[ 1626.609240]  try_to_free_pages+0xd9/0x1f0
[ 1626.609245]  __alloc_pages_slowpath+0x3a4/0xd30
[ 1626.609251]  ? ___slab_alloc+0x238/0x560
[ 1626.609254]  __alloc_pages_nodemask+0x30c/0x350
[ 1626.609259]  skb_page_frag_refill+0x97/0xd0
[ 1626.609274]  sk_page_frag_refill+0x1d/0x80
[ 1626.609279]  tcp_sendmsg_locked+0x2bb/0xdd0
[ 1626.609304]  tcp_sendmsg+0x27/0x40
[ 1626.609307]  sock_sendmsg+0x54/0x60
[ 1626.609308]  ___sys_sendmsg+0x29f/0x320
[ 1626.609313]  ? sock_poll+0x66/0xb0
[ 1626.609318]  ? ep_item_poll.isra.15+0x40/0xc0
[ 1626.609320]  ? ep_send_events_proc+0xe6/0x230
[ 1626.609322]  ? hrtimer_try_to_cancel+0x54/0xf0
[ 1626.609324]  ? ep_read_events_proc+0xc0/0xc0
[ 1626.609326]  ? _raw_write_unlock_irq+0xa/0x20
[ 1626.609327]  ? ep_scan_ready_list.constprop.19+0x218/0x230
[ 1626.609329]  ? __hrtimer_init+0xb0/0xb0
[ 1626.609331]  ? _raw_spin_unlock_irq+0xa/0x20
[ 1626.609334]  ? ep_poll+0x26c/0x4a0
[ 1626.609337]  ? tcp_tsq_write.part.54+0xa0/0xa0
[ 1626.609339]  ? release_sock+0x43/0x90
[ 1626.609341]  ? _raw_spin_unlock_bh+0xa/0x20
[ 1626.609342]  __sys_sendmsg+0x47/0x80
[ 1626.609347]  do_syscall_64+0x5f/0x1c0
[ 1626.609349]  ? prepare_exit_to_usermode+0x75/0xa0
[ 1626.609351]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

This patch adds a new prctl command that daemons can use after they have
done their initial setup, and before they start to do allocations that
are in the IO path. It sets the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE
flags so both userspace block and FS threads can use it to avoid the
allocation recursion and try to prevent from being throttled while
writing out data to free up memory.

Signed-off-by: Mike Christie <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Tested-by: Masato Suzuki <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

Merge branch 'core/kprobes' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <[email protected]>

Merge branch 'core/documentation' into core/urgent, to pick up single commit

Signed-off-by: Ingo Molnar <[email protected]>

Merge tag 'x86-pti-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 pti updates from Thomas Gleixner:
"The performance deterioration departement provides a few non-scary
  fixes and improvements:

   - Update the cached HLE state when the TSX state is changed via the
     new control register. This ensures feature bit consistency.

   - Exclude the new Zhaoxin CPUs from Spectre V2 and SWAPGS
     vulnerabilities"

* tag 'x86-pti-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/speculation/swapgs: Exclude Zhaoxin CPUs from SWAPGS vulnerability
  x86/speculation/spectre_v2: Exclude Zhaoxin CPUs from SPECTRE_V2
  x86/cpu: Update cached HLE state on write to TSX_CTRL_CPUID_CLEAR

Merge tag 'irq-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
"The interrupt departement provides:

   - A mechanism to shield isolated tasks from managed interrupts:

     The affinity of managed interrupts is completely controlled by the
     kernel and user space has no influence on them. The reason is that
     the automatically assigned affinity correlates to the multi-queue
     CPU handling of block devices.

     If the generated affinity mask spaws both housekeeping and isolated
     CPUs the interrupt could be routed to an isolated CPU which would
     then be disturbed by I/O submitted by a housekeeping CPU.

     The new mechamism ensures that as long as one housekeeping CPU is
     online in the assigned affinity mask the interrupt is routed to a
     housekeeping CPU.

     If there is no online housekeeping CPU in the affinity mask, then
     the interrupt is routed to an isolated CPU to keep the device queue
     intact, but unless the isolated CPU submits I/O by itself these
     interrupts are not raised.

   - A small addon to the device tree irqdomain core code to avoid
     duplication in irq chip drivers

   - Conversion of the SiFive PLIC to hierarchical domains

   - The usual pile of new irq chip drivers: SiFive GPIO, Aspeed SCI,
     NXP INTMUX, Meson A1 GPIO

   - The first cut of support for the new ARM GICv4.1

   - The usual pile of fixes and improvements in core and driver code"

* tag 'irq-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  genirq, sched/isolation: Isolate from handling managed interrupts
  irqchip/gic-v4.1: Allow direct invalidation of VLPIs
  irqchip/gic-v4.1: Suppress per-VLPI doorbell
  irqchip/gic-v4.1: Add VPE INVALL callback
  irqchip/gic-v4.1: Add VPE eviction callback
  irqchip/gic-v4.1: Add VPE residency callback
  irqchip/gic-v4.1: Add mask/unmask doorbell callbacks
  irqchip/gic-v4.1: Plumb skeletal VPE irqchip
  irqchip/gic-v4.1: Implement the v4.1 flavour of VMOVP
  irqchip/gic-v4.1: Don't use the VPE proxy if RVPEID is set
  irqchip/gic-v4.1: Implement the v4.1 flavour of VMAPP
  irqchip/gic-v4.1: VPE table (aka GICR_VPROPBASER) allocation
  irqchip/gic-v3: Add GICv4.1 VPEID size discovery
  irqchip/gic-v3: Detect GICv4.1 supporting RVPEID
  irqchip/gic-v3-its: Fix get_vlpi_map() breakage with doorbells
  irqdomain: Fix a memory leak in irq_domain_push_irq()
  irqchip: Add NXP INTMUX interrupt multiplexer support
  dt-bindings: interrupt-controller: Add binding for NXP INTMUX interrupt multiplexer
  irqchip: Define EXYNOS_IRQ_COMBINER
  irqchip/meson-gpio: Add support for meson a1 SoCs
  ...

Merge tag 'smp-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core SMP updates from Thomas Gleixner:
"A small set of SMP core code changes:

   - Rework the smp function call core code to avoid the allocation of
     an additional cpumask

   - Remove the not longer required GFP argument from on_each_cpu_cond()
     and on_each_cpu_cond_mask() and fixup the callers"

* tag 'smp-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Remove allocation mask from on_each_cpu_cond.*()
  smp: Add a smp_cond_func_t argument to smp_call_function_many()
  smp: Use smp_cond_func_t as type for the conditional function

Merge tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
"The timekeeping and timers departement provides:

   - Time namespace support:

     If a container migrates from one host to another then it expects
     that clocks based on MONOTONIC and BOOTTIME are not subject to
     disruption. Due to different boot time and non-suspended runtime
     these clocks can differ significantly on two hosts, in the worst
     case time goes backwards which is a violation of the POSIX
     requirements.

     The time namespace addresses this problem. It allows to set offsets
     for clock MONOTONIC and BOOTTIME once after creation and before
     tasks are associated with the namespace. These offsets are taken
     into account by timers and timekeeping including the VDSO.

     Offsets for wall clock based clocks (REALTIME/TAI) are not provided
     by this mechanism. While in theory possible, the overhead and code
     complexity would be immense and not justified by the esoteric
     potential use cases which were discussed at Plumbers '18.

     The overhead for tasks in the root namespace (ie where host time
     offsets = 0) is in the noise and great effort was made to ensure
     that especially in the VDSO. If time namespace is disabled in the
     kernel configuration the code is compiled out.

     Kudos to Andrei Vagin and Dmitry Sofanov who implemented this
     feature and kept on for more than a year addressing review
     comments, finding better solutions. A pleasant experience.

   - Overhaul of the alarmtimer device dependency handling to ensure
     that the init/suspend/resume ordering is correct.

   - A new clocksource/event driver for Microchip PIT64

   - Suspend/resume support for the Hyper-V clocksource

   - The usual pile of fixes, updates and improvements mostly in the
     driver code"

* tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
  alarmtimer: Make alarmtimer_get_rtcdev() a stub when CONFIG_RTC_CLASS=n
  alarmtimer: Use wakeup source from alarmtimer platform device
  alarmtimer: Make alarmtimer platform device child of RTC device
  alarmtimer: Update alarmtimer_get_rtcdev() docs to reflect reality
  hrtimer: Add missing sparse annotation for __run_timer()
  lib/vdso: Only read hrtimer_res when needed in __cvdso_clock_getres()
  MIPS: vdso: Define BUILD_VDSO32 when building a 32bit kernel
  clocksource/drivers/hyper-v: Set TSC clocksource as default w/ InvariantTSC
  clocksource/drivers/hyper-v: Untangle stimers and timesync from clocksources
  clocksource/drivers/timer-microchip-pit64b: Fix sparse warning
  clocksource/drivers/exynos_mct: Rename Exynos to lowercase
  clocksource/drivers/timer-ti-dm: Fix uninitialized pointer access
  clocksource/drivers/timer-ti-dm: Switch to platform_get_irq
  clocksource/drivers/timer-ti-dm: Convert to devm_platform_ioremap_resource
  clocksource/drivers/em_sti: Fix variable declaration in em_sti_probe
  clocksource/drivers/em_sti: Convert to devm_platform_ioremap_resource
  clocksource/drivers/bcm2835_timer: Fix memory leak of timer
  clocksource/drivers/cadence-ttc: Use ttc driver as platform driver
  clocksource/drivers/timer-microchip-pit64b: Add Microchip PIT64B support
  clocksource/drivers/hyper-v: Reserve PAGE_SIZE space for tsc page
  ...

Merge tag 'core-debugobjects-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull debugobjects update from Thomas Gleixner:
"A single commit for debug objects which fixes a pile of potential data
races detected by KCSAN"

* tag 'core-debugobjects-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
debugobjects: Fix various data races

Merge tag 'core-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull watchdog updates from Thomas Gleixner:
"A set of watchdog/softlockup related improvements:

   - Enforce that the watchdog timestamp is always valid on boot. The
     original implementation caused a watchdog disabled gap of one
     second in the boot process due to truncation of the underlying
     sched clock.

     The sched clock is divided by 1e9 to convert nanoseconds to
     seconds. So for the first second of the boot process the result is
     0 which is at the same time the indicator to disable the watchdog.

     The trivial fix is to change the disabled indicator to ULONG_MAX.

   - Two cleanup patches removing unused and redundant code which got
     forgotten to be cleaned up in previous changes"

* tag 'core-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  watchdog/softlockup: Enforce that timestamp is valid on boot
  watchdog/softlockup: Remove obsolete check of last reported task
  watchdog: Remove soft_lockup_hrtimer_cnt and related code

Merge tag 'timers-urgent-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Thomas Gleixner:
"Two fixes for the generic VDSO code which missed 5.5:

   - Make the update to the coarse timekeeper unconditional.

     This is required because the coarse timekeeper interfaces in the
     VDSO do not depend on a VDSO capable clocksource. If the system
     does not have a VDSO capable clocksource and the update is
     depending on the VDSO capable clocksource, the coarse VDSO
     interfaces would operate on stale data forever.

   - Invert the logic of __arch_update_vdso_data() to avoid further head
     scratching.

     Tripped over this several times while analyzing the update problem
     above"

* tag 'timers-urgent-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lib/vdso: Update coarse timekeeper unconditionally
  lib/vdso: Make __arch_update_vdso_data() logic understandable

Merge tag 'selinux-pr-20200127' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux

Pull SELinux update from Paul Moore:
"This is one of the bigger SELinux pull requests in recent years with
  28 patches. Everything is passing our test suite and the highlights
  are below:

   - Mark CONFIG_SECURITY_SELINUX_DISABLE as deprecated. We're some time
     away from actually attempting to remove this in the kernel, but the
     only distro we know that still uses it (Fedora) is working on
     moving away from this so we want to at least let people know we are
     planning to remove it.

   - Reorder the SELinux hooks to help prevent bad things when SELinux
     is disabled at runtime. The proper fix is to remove the
     CONFIG_SECURITY_SELINUX_DISABLE functionality (see above) and just
     take care of it at boot time (e.g. "selinux=0").

   - Add SELinux controls for the kernel lockdown functionality,
     introducing a new SELinux class/permissions: "lockdown { integrity
     confidentiality }".

   - Add a SELinux control for move_mount(2) that reuses the "file {
     mounton }" permission.

   - Improvements to the SELinux security label data store lookup
     functions to speed up translations between our internal label
     representations and the visible string labels (both directions).

   - Revisit a previous fix related to SELinux inode auditing and
     permission caching and do it correctly this time.

   - Fix the SELinux access decision cache to cleanup properly on error.
     In some extreme cases this could limit the cache size and result in
     a decrease in performance.

   - Enable SELinux per-file labeling for binderfs.

   - The SELinux initialized and disabled flags were wrapped with
     accessors to ensure they are accessed correctly.

   - Mark several key SELinux structures with __randomize_layout.

   - Changes to the LSM build configuration to only build
     security/lsm_audit.c when needed.

   - Changes to the SELinux build configuration to only build the IB
     object cache when CONFIG_SECURITY_INFINIBAND is enabled.

   - Move a number of single-caller functions into their callers.

   - Documentation fixes (/selinux -> /sys/fs/selinux).

   - A handful of cleanup patches that aren't worth mentioning on their
     own, the individual descriptions have plenty of detail"

* tag 'selinux-pr-20200127' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: (28 commits)
  selinux: fix regression introduced by move_mount(2) syscall
  selinux: do not allocate ancillary buffer on first load
  selinux: remove redundant allocation and helper functions
  selinux: remove redundant selinux_nlmsg_perm
  selinux: fix wrong buffer types in policydb.c
  selinux: reorder hooks to make runtime disable less broken
  selinux: treat atomic flags more carefully
  selinux: make default_noexec read-only after init
  selinux: move ibpkeys code under CONFIG_SECURITY_INFINIBAND.
  selinux: remove redundant msg_msg_alloc_security
  Documentation,selinux: fix references to old selinuxfs mount point
  selinux: deprecate disabling SELinux and runtime
  selinux: allow per-file labelling for binderfs
  selinuxfs: use scnprintf to get real length for inode
  selinux: remove set but not used variable 'sidtab'
  selinux: ensure the policy has been loaded before reading the sidtab stats
  selinux: ensure we cleanup the internal AVC counters on error in avc_update()
  selinux: randomize layout of key structures
  selinux: clean up selinux_enabled/disabled/enforcing_boot
  selinux: remove unnecessary selinux cred request
  ...

Merge tag 'audit-pr-20200127' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit update from Paul Moore:
"One small audit patch for the Linux v5.6 merge window, and
unsurprisingly it passes our test suite with flying colors"

* tag 'audit-pr-20200127' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: Add __rcu annotation to RCU pointer

Merge branch 'for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

- cgroup2 interface for hugetlb controller. I think this was the last
   remaining bit which was missing from cgroup2

- fixes for race and a spurious warning in threaded cgroup handling

- other minor changes

* 'for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  iocost: Fix iocost_monitor.py due to helper type mismatch
  cgroup: Prevent double killing of css when enabling threaded cgroup
  cgroup: fix function name in comment
  mm: hugetlb controller for cgroups v2

Merge branch 'for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue updates from Tejun Heo:
"Just a couple tracepoint patches"

* 'for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: remove workqueue_work event class
workqueue: add worker function to workqueue_execute_end tracepoint

io-wq: make the io_wq ref counted

In preparation for sharing an io-wq across different users, add a
reference count that manages destruction of it.

Reviewed-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

io_uring: fix refcounting with batched allocations at OOM

In case of out of memory the second argument of percpu_ref_put_many() in
io_submit_sqes() may evaluate into "nr - (-EAGAIN)", that is clearly
wrong.

Fixes: 2b85edfc0c90 ("io_uring: batch getting pcpu references")
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

io_uring: add comment for drain_next

Draining the middle of a link is tricky, so leave a comment there

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

io_uring: don't attempt to copy iovec for READ/WRITE

For the non-vectored variant of READV/WRITEV, we don't need to setup an
async io context, and we flag that appropriately in the io_op_defs
array. However, in fixing this for the 5.5 kernel in commit 74566df3a71c
we didn't have these opcodes, so the check there was added just for the
READ_FIXED and WRITE_FIXED opcodes. Replace that check with just a
single check for needing async context, that covers all four of these
read/write variants that don't use an iovec.

Signed-off-by: Jens Axboe <[email protected]>

scripts/find-unused-docs: Fix massive false positives

scripts/find-unused-docs.sh invokes scripts/kernel-doc to find out if a
source file contains kerneldoc or not.

However, as it passes the no longer supported "-text" option to
scripts/kernel-doc, the latter prints out its help text, causing all
files to be considered containing kerneldoc.

Get rid of these false positives by removing the no longer supported
"-text" option from the scripts/kernel-doc invocation.

Cc: [email protected] # 4.16+
Fixes: b05142675310d2ac ("scripts: kernel-doc: get rid of unused output formats")
Signed-off-by: Geert Uytterhoeven <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>

Merge tag 'ioremap-5.6' of git://git.infradead.org/users/hch/ioremap

Pull ioremap updates from Christoph Hellwig:
"Remove the ioremap_nocache API (plus wrappers) that are always
  identical to ioremap"

* tag 'ioremap-5.6' of git://git.infradead.org/users/hch/ioremap:
  remove ioremap_nocache and devm_ioremap_nocache
  MIPS: define ioremap_nocache to ioremap

Merge tag 'for-5.6/libata-2020-01-27' of git://git.kernel.dk/linux-block

Pull libata updates from Jens Axboe:
"As usual pretty quiet, mostly trivial fixes (or dead code removal),
  outside of various fixes for ahci_bcrm"

* tag 'for-5.6/libata-2020-01-27' of git://git.kernel.dk/linux-block:
  ata/acard_ahci: remove unused variable n_elem
  ata: pata_macio: fix comparing pointer to 0
  ata: ahci_brcm: BCM7216 reset is self de-asserting
  ata: ahci_brcm: Perform reset after obtaining resources
  ata: brcm: fix reset controller API usage
  ata: brcm: mark PM functions as __maybe_unused
  ata: ahci_brcm: Support BCM7216 reset controller name
  dt-bindings: ata: Document BCM7216 AHCI controller compatible
  ata: ahci_brcm: Add a shutdown callback
  ata: ahci_brcm: Manage reset line during suspend/resume

Merge tag 'for-5.6/drivers-2020-01-27' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
"Like the core side, not a lot of changes here, just two main items:

   - Series of patches (via Coly) with fixes for bcache (Coly,
     Christoph)

   - MD pull request from Song"

* tag 'for-5.6/drivers-2020-01-27' of git://git.kernel.dk/linux-block: (31 commits)
  bcache: reap from tail of c->btree_cache in bch_mca_scan()
  bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
  bcache: remove member accessed from struct btree
  bcache: print written and keys in trace_bcache_btree_write
  bcache: avoid unnecessary btree nodes flushing in btree_flush_write()
  bcache: add code comments for state->pool in __btree_sort()
  lib: crc64: include <linux/crc64.h> for 'crc64_be'
  bcache: use read_cache_page_gfp to read the superblock
  bcache: store a pointer to the on-disk sb in the cache and cached_dev structures
  bcache: return a pointer to the on-disk sb from read_super
  bcache: transfer the sb_page reference to register_{bdev,cache}
  bcache: fix use-after-free in register_bcache()
  bcache: properly initialize 'path' and 'err' in register_bcache()
  bcache: rework error unwinding in register_bcache
  bcache: use a separate data structure for the on-disk super block
  bcache: cached_dev_free needs to put the sb page
  md/raid1: introduce wait_for_serialization
  md/raid1: use bucket based mechanism for IO serialization
  md: introduce a new struct for IO serialization
  md: don't destroy serial_info_pool if serialize_policy is true
  ...

Merge tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
"This may be the most quiet round we've had in years. I'm not
  complaining. Really not a lot to detail here, outside of spelling and
  documentation improvements/fixes, we have:

   - Allow t10-pi to be modular (Herbert)

   - Remove dead code in bfq (Alex)

   - Mark zone management requests with REQ_SYNC (Chaitanya)

   - BFQ division improvement (Wen)

   - Small series improving plugging (Pavel)"

* tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-block:
  partitions/ldm: fix spelling mistake "to" -> "too"
  block, bfq: improve arithmetic division in bfq_delta()
  block/bfq: remove unused bfq_class_rt which never used
  block: mark zone-mgmt bios with REQ_SYNC
  blk-mq: Document functions for sending request
  block: Allow t10-pi to be modular
  blk-mq: optimise blk_mq_flush_plug_list()
  list: introduce list_for_each_continue()
  blk-mq: optimise rq sort function

Merge tag 'pnp-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull PNP updates from Rafael Wysocki:
"Get rid of unused variable and function (yu kuai)"

* tag 'pnp-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PNP: isapnp: remove defined but not used function 'isapnp_checksum'
PNP: isapnp: remove set but not used variable 'checksum'

Merge tag 'devprop-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull device properties framework updates from Rafael Wysocki:
"Add support for reference properties in sofrware nodes (Dmitry
  Torokhov) and a basic test for property entries along with fixes on
  top of it (Dmitry Torokhov, Qian Cai, Alan Maguire)"

* tag 'devprop-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  software node: introduce CONFIG_KUNIT_DRIVER_PE_TEST
  usb: dwc3: use proper initializers for property entries
  drivers/base/test: fix global-out-of-bounds error
  software node: add basic tests for property entries
  software node: remove separate handling of references
  platform/x86: intel_cht_int33fe: use inline reference properties
  software node: implement reference properties
  software node: allow embedding of small arrays into property_entry
  software node: replace is_array with is_inline

dm: fix potential for q->make_request_fn NULL pointer

Move blk_queue_make_request() to dm.c:alloc_dev() so that
q->make_request_fn is never NULL during the lifetime of a DM device
(even one that is created without a DM table).

Otherwise generic_make_request() will crash simply by doing:
dmsetup create -n test
mount /dev/dm-N /mnt

While at it, move ->congested_data initialization out of
dm.c:alloc_dev() and into the bio-based specific init method.

Reported-by: Stefan Bader <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/1860231
Fixes: ff36ab34583a ("dm: remove request-based logic from make_request_fn wrapper")
Depends-on: c12c9a3c3860c ("dm: various cleanups to md->queue initialization code")
Cc: [email protected]
Signed-off-by: Mike Snitzer <[email protected]>

Merge tag 'acpi-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI updates from Rafael Wysocki:
"These update the ACPICA code in the kernel to the most recent upstream
  revision (20200110), add new hardware support to a handful of ACPI
  drivers, make the ACPI fan driver expose power states information for
  fans, add some more quirks, fix bugs and clean up assorted things.

  Specifics:

   - Update the ACPICA code in the kernel to upstream revision 20200110
     including:
      - Update of copyright notices to 2020 (Bob Moore).
      - Dispatcher fix to always generate buffer objects for the ASL
        create_field() operator (Maximilian Luz).
      - Debugger cleanup (Colin Ian King).
      - Disassembler change to create buffer fields in
        ACPI_PARSE_LOAD_PASS1 (Erik Kaneda).
      - UNIX line ending support for non-windows builds in acpisrc (Erik
        Kaneda).

   - Update the list of ACPICA maintainers (Rafael Wysocki).

   - Add Intel Tiger Lake ACPI device IDs to the ACPI DPTF, ACPI fan,
     int340x_thermal and intel-hid drivers (Gayatri Kammela).

   - Make the ACPI fan driver create additional sysfs attributes to
     expose power states information for fans (Srinivas Pandruvada).

   - Fix up the ACPI battery driver to deal with unexpected battery
     capacity information in a better way (Hans de Goede).

   - Add ACPI backlight quirks for Lenovo E41-25/45 and MSI MS-7721
     boards (Aaron Ma, Hans de Goede).

   - Add DMI quirk for Razer Blade Stealth 13 late 2019 lid switch to
     the ACPI button driver (Jason Ekstrand).

   - Drop TIMER_DEFERRABLE from the GHES polling mode timer function
     flags to make it run precisely at the configured time (Bhaskar
     Upadhaya).

   - Fix race condition related to the reference counting of query
     handlers in the ACPI EC driver (Rafael Wysocki).

   - Fix ACPI tools build issue (Zhengyuan Liu).

   - Replace dma_request_slave_channel() with dma_request_chan() in the
     firmware guide documentation for ACPI (Peter Ujfalusi).

   - Fix typo in a comment and clean up function parameter data type
     inconsistencies (Kacper Piwiński, Tian Tao)"

* tag 'acpi-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (25 commits)
  ACPICA: Update version to 20200110
  ACPICA: All acpica: Update copyrights to 2020 Including tool signons.
  apei/ghes: Do not delay GHES polling
  ACPI: button: Add DMI quirk for Razer Blade Stealth 13 late 2019 lid switch
  ACPI: PPTT: Consistently use unsigned int as parameter type
  ACPI: EC: Reference count query handlers under lock
  ACPICA: Update the list of maintainers
  ACPICA: Update version to 20191213
  ACPICA: Dispatcher: always generate buffer objects for ASL create_field() operator
  ACPICA: acpisrc: add unix line ending support for non-windows build
  ACPICA: Disassembler: create buffer fields in ACPI_PARSE_LOAD_PASS1
  ACPICA: debugger: fix spelling mistake "adress" -> "address"
  ACPI: video: Do not export a non working backlight interface on MSI MS-7721 boards
  docs: firmware-guide: ACPI: Replace dma_request_slave_channel() with dma_request_chan()
  thermal: int340x_thermal: Add Tiger Lake ACPI device IDs
  platform/x86: intel-hid: Add Tiger Lake ACPI device ID
  ACPI: fan: Add Tiger Lake ACPI device ID
  ACPI: DPTF: Add Tiger Lake ACPI device IDs
  ACPI: fan: Expose fan performance state information
  tools/power/acpi: fix compilation error
  ...

Merge tag 'pm-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
"These add ACPI support to the intel_idle driver along with an admin
  guide document for it, add support for CPR (Core Power Reduction) to
  the AVS (Adaptive Voltage Scaling) subsystem, add new hardware support
  in a few places, add some new sysfs attributes, debugfs files and
  tracepoints, fix bugs and clean up a bunch of things all over.

  Specifics:

   - Update the ACPI processor driver in order to export
     acpi_processor_evaluate_cst() to the code outside of it, add ACPI
     support to the intel_idle driver based on that and clean up that
     driver somewhat (Rafael Wysocki).

   - Add an admin guide document for the intel_idle driver (Rafael
     Wysocki).

   - Clean up cpuidle core and drivers, enable compilation testing for
     some of them (Benjamin Gaignard, Krzysztof Kozlowski, Rafael
     Wysocki, Yangtao Li).

   - Fix reference counting of OPP (operating performance points) table
     structures (Viresh Kumar).

   - Add support for CPR (Core Power Reduction) to the AVS (Adaptive
     Voltage Scaling) subsystem (Niklas Cassel, Colin Ian King,
     YueHaibing).

   - Add support for TigerLake Mobile and JasperLake to the Intel RAPL
     power capping driver (Zhang Rui).

   - Update cpufreq drivers:
      - Add i.MX8MP support to imx-cpufreq-dt (Anson Huang).
      - Fix usage of a macro in loongson2_cpufreq (Alexandre Oliva).
      - Fix cpufreq policy reference counting issues in s3c and
        brcmstb-avs (chenqiwu).
      - Fix ACPI table reference counting issue and HiSilicon quirk
        handling in the CPPC driver (Hanjun Guo).
      - Clean up spelling mistake in intel_pstate (Harry Pan).
      - Convert the kirkwood and tegra186 drivers to using
        devm_platform_ioremap_resource() (Yangtao Li).

   - Update devfreq core:
      - Add 'name' sysfs attribute for devfreq devices (Chanwoo Choi).
      - Clean up the handing of transition statistics and allow them to
        be reset by writing 0 to the 'trans_stat' devfreq device
        attribute in sysfs (Kamil Konieczny).
      - Add 'devfreq_summary' to debugfs (Chanwoo Choi).
      - Clean up kerneldoc comments and Kconfig indentation (Krzysztof
        Kozlowski, Randy Dunlap).

   - Update devfreq drivers:
      - Add dynamic scaling for the imx8m DDR controller and clean up
        imx8m-ddrc (Leonard Crestez, YueHaibing).
      - Fix DT node reference counting and nitialization error code path
        in rk3399_dmc and add COMPILE_TEST and HAVE_ARM_SMCCC dependency
        for it (Chanwoo Choi, Yangtao Li).
      - Fix DT node reference counting in rockchip-dfi and make it use
        devm_platform_ioremap_resource() (Yangtao Li).
      - Fix excessive stack usage in exynos-ppmu (Arnd Bergmann).
      - Fix initialization error code paths in exynos-bus (Yangtao Li).
      - Clean up exynos-bus and exynos somewhat (Artur Świgoń, Krzysztof
        Kozlowski).

   - Add tracepoints for tracking usage_count updates unrelated to
     status changes in PM-runtime (Michał Mirosław).

   - Add sysfs attribute to control the "sync on suspend" behavior
     during system-wide suspend (Jonas Meurer).

   - Switch system-wide suspend tests over to 64-bit time (Alexandre
     Belloni).

   - Make wakeup sources statistics in debugfs cover deleted ones which
     used to be the case some time ago (zhuguangqing).

   - Clean up computations carried out during hibernation, update
     messages related to hibernation and fix a spelling mistake in one
     of them (Wen Yang, Luigi Semenzato, Colin Ian King).

   - Add mailmap entry for maintainer e-mail address that has not been
     functional for several years (Rafael Wysocki)"

* tag 'pm-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (83 commits)
  cpufreq: loongson2_cpufreq: adjust cpufreq uses of LOONGSON_CHIPCFG
  intel_idle: Clean up irtl_2_usec()
  intel_idle: Move 3 functions closer to their callers
  intel_idle: Annotate initialization code and data structures
  intel_idle: Move and clean up intel_idle_cpuidle_devices_uninit()
  intel_idle: Rearrange intel_idle_cpuidle_driver_init()
  intel_idle: Clean up NULL pointer check in intel_idle_init()
  intel_idle: Fold intel_idle_probe() into intel_idle_init()
  intel_idle: Eliminate __setup_broadcast_timer()
  cpuidle: fix cpuidle_find_deepest_state() kerneldoc warnings
  cpuidle: sysfs: fix warnings when compiling with W=1
  cpuidle: coupled: fix warnings when compiling with W=1
  cpufreq: brcmstb-avs: fix imbalance of cpufreq policy refcount
  PM: suspend: Add sysfs attribute to control the "sync on suspend" behavior
  PM / devfreq: Add debugfs support with devfreq_summary file
  Documentation: admin-guide: PM: Add intel_idle document
  cpuidle: arm: Enable compile testing for some of drivers
  PM-runtime: add tracepoints for usage_count changes
  cpufreq: intel_pstate: fix spelling mistake: "Whethet" -> "Whether"
  PM: hibernate: fix spelling mistake "shapshot" -> "snapshot"
  ...

security: remove EARLY_LSM_COUNT which never used

This macro is never used from it was introduced in commit e6b1db98cf4d5
("security: Support early LSMs"), better to remove it.

Signed-off-by: Alex Shi <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: James Morris <[email protected]>

Merge tag 'regulator-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Pull regulator updates from Mark Brown:
"Hardly anything going on in the core this time around with the
  regulator API and pretty quiet on the driver front:

   - An API for comparing regulators, useful for devices that need to
     check if supply voltages exactly match rather than just nominally
     match.

   - Conversion of several DT bindings to YAML format.

   - Conversion of I2C drivers to probe_new().

   - New drivers for Monolithic MPQ7920 and MP8859, and Rohm BD71828"

* tag 'regulator-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (34 commits)
  dt-bindings: regulator: add document bindings for mpq7920
  regulator: core: Fix exported symbols to the exported GPL version
  regulator: mpq7920: Fix incorrect defines
  regulator: vqmmc-ipq4019: Fix platform_no_drv_owner.cocci warnings
  regulator: vctrl-regulator: Avoid deadlock getting and setting the voltage
  regulator fix for "regulator: core: Add regulator_is_equal() helper"
  regulator: core: Add regulator_is_equal() helper
  regulator: mpq7920: Convert to use .probe_new
  regulator: mpq7920: Remove unneeded fields from struct mpq7920_regulator_info
  regulator: vqmmc-ipq4019: Trivial clean up
  regulator: vqmmc-ipq4019: Remove ipq4019_regulator_remove
  regulator: bindings: Drop document bindings for mpq7920
  dt-bindings: Drop entry for Monolithic Power System, MPS
  regulator: bd718x7: Simplify the code by removing struct bd718xx_pmic_inits
  regulator: add IPQ4019 SDHCI VQMMC LDO driver
  regulator: Convert i2c drivers to use .probe_new
  regulator: mpq7920: Check the correct variable in mpq7920_regulator_register()
  regulator: mpq7920: Fix Woverflow warning on conversion
  regulator: mp8859: tidy up white space in probe
  regulator: mpq7920: add mpq7920 regulator driver
  ...

Merge tag 'spi-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi updates from Mark Brown:
"Not much going on in the core for SPI this time but a reasonable
  amount of change in the drivers:

   - Removal of dmal_request_slave_channel() from Peter Ujfalusi.

   - More conversions of drivers to GPIO descriptors from Linus Walleij.

   - A big rework of the sh-msiof driver from Geert Uytterhoeven moving
     it over to the generic native chipselect support.

   - DMA support for the uniphier driver from Kunihiko Hayashi.

   - New driver support for HiSilcon v3xx SPI NOR controllers from John
     Garry"

* tag 'spi-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (52 commits)
  dt-binding: spi: add NPCM PSPI reset binding
  spi: pxa2xx: Avoid touching SSCR0_SSE on MMP2
  spi: spi-fsl-qspi: Ensure width is respected in spi-mem operations
  spi: npcm-pspi: modify reset support
  spi: npcm-pspi: improve spi transfer performance
  spi: spi-ti-qspi: fix warning
  spi: npcm-pspi: fix 16 bit send and receive support
  spi: pxa2xx: Add support for Intel Comet Lake PCH-V
  spi: fsl: simplify error path in of_fsl_spi_probe()
  spi: fsl-lpspi: fix only one cs-gpio working
  spi: spi-ti-qspi: optimize byte-transfers
  spi: spi-ti-qspi: support large flash devices
  spi: spi-qcom-qspi: Use device managed memory for clk_bulk_data
  MAINTAINERS: Add a maintainer for the HiSilicon v3xx SFC driver
  spi: Add HiSilicon v3xx SPI NOR flash controller driver
  dt-bindings: spi_atmel: add microchip,sam9x60-spi
  spi: bcm2835: Raise maximum number of slaves to 4
  spi: sh-msiof: Do not redefine STR while compile testing
  spi: rspi: Add support for GPIO chip selects
  spi: rspi: Add support for multiple native chip selects
  ...

Merge tag 'regmap-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap

Pull regmap updates from Mark Brown:
"This is quite a busy release for a subsystem that's usually very
  quiet, though still a small set of updates in the grand scheme of
  things:

   - A fix for writes to non-incrementing registers.

   - An iopoll() style helper for use with atomic safe regmaps, making
     it easier to transition from raw memory mapped I/O.

   - Some constification"

* tag 'regmap-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
  regmap: fix writes to non incrementing registers
  regmap: add iopoll-like atomic polling macro
  regmap-i2c: constify regmap_bus structures

KVM: x86: Use a typedef for fastop functions

Add a typedef to for the fastop function prototype to make the code more
readable.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Add 'else' to unify fastop and execute call path

It also helps eliminate some duplicated code.

Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: inline memslot_valid_for_gpte

The function now has a single caller, so there is no point
in keeping it separate.

Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Use huge pages for DAX-backed files

Walk the host page tables to identify hugepage mappings for ZONE_DEVICE
pfns, i.e. DAX pages. Explicitly query kvm_is_zone_device_pfn() when
deciding whether or not to bother walking the host page tables, as DAX
pages do not set up the head/tail infrastructure, i.e. will return false
for PageCompound() even when using huge pages.

Zap ZONE_DEVICE sptes when disabling dirty logging, e.g. if live
migration fails, to allow KVM to rebuild large pages for DAX-based
mappings. Presumably DAX favors large pages, and worst case scenario is
a minor performance hit as KVM will need to re-fault all DAX-based
pages.

Suggested-by: Barret Rhoden <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jason Zeng <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Liran Alon <[email protected]>
Cc: linux-nvdimm <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Remove lpage_is_disallowed() check from set_spte()

Remove the late "lpage is disallowed" check from set_spte() now that the
initial check is performed after acquiring mmu_lock. Fold the guts of
the remaining helper, __mmu_gfn_lpage_is_disallowed(), into
kvm_mmu_hugepage_adjust() to eliminate the unnecessary slot !NULL check.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust()

Fold max_mapping_level() into kvm_mmu_hugepage_adjust() now that HugeTLB
mappings are handled in kvm_mmu_hugepage_adjust(), i.e. there isn't a
need to pre-calculate the max mapping level. Co-locating all hugepage
checks eliminates a memslot lookup, at the cost of performing the
__mmu_gfn_lpage_is_disallowed() checks while holding mmu_lock.

The latency of lpage_is_disallowed() is likely negligible relative to
the rest of the code run while holding mmu_lock, and can be offset to
some extent by eliminating the mmu_gfn_lpage_is_disallowed() check in
set_spte() in a future patch. Eliminating the check in set_spte() is
made possible by performing the initial lpage_is_disallowed() checks
while holding mmu_lock.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Zap any compound page when collapsing sptes

Zap any compound page, e.g. THP or HugeTLB pages, when zapping sptes
that can potentially be converted to huge sptes after disabling dirty
logging on the associated memslot. Note, this approach could result in
false positives, e.g. if a random compound page is mapped into the
guest, but mapping non-huge compound pages into the guest is far from
the norm, and toggling dirty logging is not a frequent operation.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Remove obsolete gfn restoration in FNAME(fetch)

Remove logic to retrieve the original gfn now that HugeTLB mappings are
are identified in FNAME(fetch), i.e. FNAME(page_fault) no longer adjusts
the level or gfn.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Rely on host page tables to find HugeTLB mappings

Remove KVM's HugeTLB specific logic and instead rely on walking the host
page tables (already done for THP) to identify HugeTLB mappings.
Eliminating the HugeTLB-only logic avoids taking mmap_sem and calling
find_vma() for all hugepage compatible page faults, and simplifies KVM's
page fault code by consolidating all hugepage adjustments into a common
helper.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Drop level optimization from fast_page_fault()

Remove fast_page_fault()'s optimization to stop the shadow walk if the
iterator level drops below the intended map level. The intended map
level is only acccurate for HugeTLB mappings (THP mappings are detected
after fast_page_fault()), i.e. it's not required for correctness, and
a future patch will also move HugeTLB mapping detection to after
fast_page_fault().

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Walk host page tables to find THP mappings

Explicitly walk the host page tables to identify THP mappings instead
of relying solely on the metadata in struct page. This sets the stage
for using a common method of identifying huge mappings regardless of the
underlying implementation (HugeTLB vs THB vs DAX), and hopefully avoids
the pitfalls of relying on metadata to identify THP mappings, e.g. see
commit 169226f7e0d2 ("mm: thp: handle page cache THP correctly in
PageTransCompoundMap") and the need for KVM to explicitly check for a
THP compound page. KVM will also naturally work with 1gb THP pages, if
they are ever supported.

Walking the tables for THP mappings is likely marginally slower than
querying metadata, but a future patch will reuse the walk to identify
HugeTLB mappings, at which point eliminating the existing VMA lookup for
HugeTLB will make this a net positive.

Cc: Andrea Arcangeli <[email protected]>
Cc: Barret Rhoden <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Refactor THP adjust to prep for changing query

Refactor transparent_hugepage_adjust() in preparation for walking the
host page tables to identify hugepage mappings, initially for THP pages,
and eventualy for HugeTLB and DAX-backed pages as well. The latter
cases support 1gb pages, i.e. the adjustment logic needs access to the
max allowed level.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

x86/mm: Introduce lookup_address_in_mm()

Add a helper, lookup_address_in_mm(), to traverse the page tables of a
given mm struct. KVM will use the helper to retrieve the host mapping
level, e.g. 4k vs. 2mb vs. 1gb, of a compound (or DAX-backed) page
without having to resort to implementation specific metadata. E.g. KVM
currently uses different logic for HugeTLB vs. THP, and would add a
third variant for DAX-backed files.

Cc: Dan Williams <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Play nice with read-only memslots when querying host page size

Avoid the "writable" check in __gfn_to_hva_many(), which will always fail
on read-only memslots due to gfn_to_hva() assuming writes. Functionally,
this allows x86 to create large mappings for read-only memslots that
are backed by HugeTLB mappings.

Note, the changelog for commit 05da45583de9 ("KVM: MMU: large page
support") states "If the largepage contains write-protected pages, a
large pte is not used.", but "write-protected" refers to pages that are
temporarily read-only, e.g. read-only memslots didn't even exist at the
time.

Fixes: 4d8b81abc47b ("KVM: introduce readonly memslot")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
[Redone using kvm_vcpu_gfn_to_memslot_prot. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Use vcpu-specific gva->hva translation when querying host page size

Use kvm_vcpu_gfn_to_hva() when retrieving the host page size so that the
correct set of memslots is used when handling x86 page faults in SMM.

Fixes: 54bf36aac520 ("KVM: x86: use vcpu-specific functions to read/write/translate GFNs")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

mm: thp: KVM: Explicitly check for THP when populating secondary MMU

Add a helper, is_transparent_hugepage(), to explicitly check whether a
compound page is a THP and use it when populating KVM's secondary MMU.
The explicit check fixes a bug where a remapped compound page, e.g. for
an XDP Rx socket, is mapped into a KVM guest and is mistaken for a THP,
which results in KVM incorrectly creating a huge page in its secondary
MMU.

Fixes: 936a5fe6e6148 ("thp: kvm mmu transparent hugepage support")
Reported-by: [email protected]
Cc: Andrea Arcangeli <[email protected]>
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Enforce max_level on HugeTLB mappings

Limit KVM's mapping level for HugeTLB based on its calculated max_level.
The max_level check prior to invoking host_mapping_level() only filters
out the case where KVM cannot create a 2mb mapping, it doesn't handle
the scenario where KVM can create a 2mb but not 1gb mapping, and the
host is using a 1gb HugeTLB mapping.

Fixes: 2f57b7051fe8 ("KVM: x86/mmu: Persist gfn_lpage_is_disallowed() to max_level")
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Return immediately if __kvm_gfn_to_hva_cache_init() fails

Check the result of __kvm_gfn_to_hva_cache_init() and return immediately
instead of relying on the kvm_is_error_hva() check to detect errors so
that it's abundantly clear KVM intends to immediately bail on an error.

Note, the hva check is still mandatory to handle errors on subqeuesnt
calls with the same generation. Similarly, always return -EFAULT on
error so that multiple (bad) calls for a given generation will get the
same result, e.g. on an illegal gfn wrap, propagating the return from
__kvm_gfn_to_hva_cache_init() would cause the initial call to return
-EINVAL and subsequent calls to return -EFAULT.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Clean up __kvm_gfn_to_hva_cache_init() and its callers

Barret reported a (technically benign) bug where nr_pages_avail can be
accessed without being initialized if gfn_to_hva_many() fails.

  virt/kvm/kvm_main.c:2193:13: warning: 'nr_pages_avail' may be
  used uninitialized in this function [-Wmaybe-uninitialized]

Rather than simply squashing the warning by initializing nr_pages_avail,
fix the underlying issues by reworking __kvm_gfn_to_hva_cache_init() to
return immediately instead of continuing on.  Now that all callers check
the result and/or bail immediately on a bad hva, there's no need to
explicitly nullify the memslot on error.

Reported-by: Barret Rhoden <[email protected]>
Fixes: f1b9dd5eb86c ("kvm: Disallow wraparound in kvm_gfn_to_hva_cache_init")
Cc: Jim Mattson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Check for a bad hva before dropping into the ghc slow path

When reading/writing using the guest/host cache, check for a bad hva
before checking for a NULL memslot, which triggers the slow path for
handing cross-page accesses. Because the memslot is nullified on error
by __kvm_gfn_to_hva_cache_init(), if the bad hva is encountered after
crossing into a new page, then the kvm_{read,write}_guest() slow path
could potentially write/access the first chunk prior to detecting the
bad hva.

Arguably, performing a partial access is semantically correct from an
architectural perspective, but that behavior is certainly not intended.
In the original implementation, memslot was not explicitly nullified
and therefore the partial access behavior varied based on whether the
memslot itself was null, or if the hva was simply bad. The current
behavior was introduced as a seemingly unintentional side effect in
commit f1b9dd5eb86c ("kvm: Disallow wraparound in
kvm_gfn_to_hva_cache_init"), which justified the change with "since some
callers don't check the return code from this function, it sit seems
prudent to clear ghc->memslot in the event of an error".

Regardless of intent, the partial access is dependent on _not_ checking
the result of the cache initialization, which is arguably a bug in its
own right, at best simply weird.

Fixes: 8f964525a121 ("KVM: Allow cross page reads and writes from cached translations.")
Cc: Jim Mattson <[email protected]>
Cc: Andrew Honig <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm/x86: export kvm_vector_hashing_enabled() is unnecessary

kvm_vector_hashing_enabled() is just called in kvm.ko module.

Signed-off-by: Peng Hao <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: remove duplicated segment cache clear

vmx_set_segment() clears segment cache unconditionally, so we should not
clear it again by calling vmx_segment_cache_clear().

Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Adding 'else' to reduce checking.

These two conditions are in conflict, adding 'else' to reduce checking.

Signed-off-by: Haiwei Li <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Check GUEST_DR7 on vmentry of nested guests

According to section "Checks on Guest Control Registers, Debug Registers, and
and MSRs" in Intel SDM vol 3C, the following checks are performed on vmentry
of nested guests:

If the "load debug controls" VM-entry control is 1, bits 63:32 in the DR7
field must be 0.

In KVM, GUEST_DR7 is set prior to the vmcs02 VM-entry by kvm_set_dr() and the
latter synthesizes a #GP if any bit in the high dword in the former is set.
Hence this field needs to be checked in software.

Signed-off-by: Krish Sadhukhan <[email protected]>
Reviewed-by: Karl Heubaum <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: remove unused guest_enter

After commit 61bd0f66ff92 ("KVM: PPC: Book3S HV: Fix guest time accounting
with VIRT_CPU_ACCOUNTING_GEN"), no one use this function anymore, So better
to remove it.

Signed-off-by: Alex Shi <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Move running VCPU from ARM to common code

For ring-based dirty log tracking, it will be more efficient to account
writes during schedule-out or schedule-in to the currently running VCPU.
We would like to do it even if the write doesn't use the current VCPU's
address space, as is the case for cached writes (see commit 4e335d9e7ddb,
"Revert "KVM: Support vCPU-based gfn->hva cache"", 2017-05-02).

Therefore, add a mechanism to track the currently-loaded kvm_vcpu struct.
There is already something similar in KVM/ARM; one important difference
is that kvm_arch_vcpu_{load,put} have two callers in virt/kvm/kvm_main.c:
we have to update both the architecture-independent vcpu_{load,put} and
the preempt notifiers.

Another change made in the process is to allow using kvm_get_running_vcpu()
in preemptible code. This is allowed because preempt notifiers ensure
that the value does not change even after the VCPU thread is migrated.

Signed-off-by: Paolo Bonzini <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Drop x86_set_memory_region()

The helper x86_set_memory_region() is only used in vmx_set_tss_addr()
and kvm_arch_destroy_vm(). Push the lock upper in both cases. With
that, drop x86_set_memory_region().

This prepares to allow __x86_set_memory_region() to return a HVA
mapped, because the HVA will need to be protected by the lock too even
after __x86_set_memory_region() returns.

Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Don't take srcu lock in init_rmode_identity_map()

We've already got the slots_lock, so we should be safe.

Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Add build-time error check on kvm_run size

It's already going to reach 2400 Bytes (which is over half of page
size on 4K page archs), so maybe it's good to have this build-time
check in case it overflows when adding new fields.

Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Remove kvm_read_guest_atomic()

Remove kvm_read_guest_atomic() because it's not used anywhere.

Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

x86/kvm/hyper-v: remove stale evmcs_already_enabled check from nested_enable_evmcs()

In nested_enable_evmcs() evmcs_already_enabled check doesn't really do
anything: controls are already sanitized and we return '0' regardless.
Just drop the check.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Liran Alon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Perform non-canonical checks in 32-bit KVM

Remove the CONFIG_X86_64 condition from the low level non-canonical
helpers to effectively enable non-canonical checks on 32-bit KVM.
Non-canonical checks are performed by hardware if the CPU *supports*
64-bit mode, whether or not the CPU is actually in 64-bit mode is
irrelevant.

For the most part, skipping non-canonical checks on 32-bit KVM is ok-ish
because 32-bit KVM always (hopefully) drops bits 63:32 of whatever value
it's checking before propagating it to hardware, and architecturally,
the expected behavior for the guest is a bit of a grey area since the
vCPU itself doesn't support 64-bit mode.  I.e. a 32-bit KVM guest can
observe the missed checks in several paths, e.g. INVVPID and VM-Enter,
but it's debatable whether or not the missed checks constitute a bug
because technically the vCPU doesn't support 64-bit mode.

The primary motivation for enabling the non-canonical checks is defense
in depth.  As mentioned above, a guest can trigger a missed check via
INVVPID or VM-Enter.  INVVPID is straightforward as it takes a 64-bit
virtual address as part of its 128-bit INVVPID descriptor and fails if
the address is non-canonical, even if INVVPID is executed in 32-bit PM.
Nested VM-Enter is a bit more convoluted as it requires the guest to
write natural width VMCS fields via memory accesses and then VMPTRLD the
VMCS, but it's still possible.  In both cases, KVM is saved from a true
bug only because its flows that propagate values to hardware (correctly)
take "unsigned long" parameters and so drop bits 63:32 of the bad value.

Explicitly performing the non-canonical checks makes it less likely that
a bad value will be propagated to hardware, e.g. in the INVVPID case,
if __invvpid() didn't implicitly drop bits 63:32 then KVM would BUG() on
the resulting unexpected INVVPID failure due to hardware rejecting the
non-canonical address.

The only downside to enabling the non-canonical checks is that it adds a
relatively small amount of overhead, but the affected flows are not hot
paths, i.e. the overhead is negligible.

Note, KVM technically could gate the non-canonical checks on 32-bit KVM
with static_cpu_has(X86_FEATURE_LM), but on bare metal that's an even
bigger waste of code for everyone except the 0.00000000000001% of the
population running on Yonah, and nested 32-bit on 64-bit already fudges
things with respect to 64-bit CPU behavior.

Signed-off-by: Sean Christopherson <[email protected]>
[Also do so in nested_vmx_check_host_state as reported by Krish. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: WARN on failure to set IA32_PERF_GLOBAL_CTRL

Writes to MSR_CORE_PERF_GLOBAL_CONTROL should never fail if the VM-exit
and VM-entry controls are exposed to L1. Promote the checks to perform a
full WARN if kvm_set_msr() fails and remove the now unused macro
SET_MSR_OR_WARN().

Suggested-by: Sean Christopherson <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Oliver Upton <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Remove unused ctxt param from emulator's FPU accessors

Remove an unused struct x86_emulate_ctxt * param from low level helpers
used to access guest FPU state. The unused param was left behind by
commit 6ab0b9feb82a ("x86,kvm: remove KVM emulator get_fpu / put_fpu").

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Revert "KVM: X86: Fix fpu state crash in kvm guest"

Reload the current thread's FPU state, which contains the guest's FPU
state, to the CPU registers if necessary during vcpu_enter_guest().
TIF_NEED_FPU_LOAD can be set any time control is transferred out of KVM,
e.g. if I/O is triggered during a KVM call to get_user_pages() or if a
softirq occurs while KVM is scheduled in.

Moving the handling of TIF_NEED_FPU_LOAD from vcpu_enter_guest() to
kvm_arch_vcpu_load(), effectively kvm_sched_in(), papered over a bug
where kvm_put_guest_fpu() failed to account for TIF_NEED_FPU_LOAD.  The
easiest way to the kvm_put_guest_fpu() bug was to run with involuntary
preemption enable, thus handling TIF_NEED_FPU_LOAD during kvm_sched_in()
made the bug go away.  But, removing the handling in vcpu_enter_guest()
exposed KVM to the rare case of a softirq triggering kernel_fpu_begin()
between vcpu_load() and vcpu_enter_guest().

Now that kvm_{load,put}_guest_fpu() correctly handle TIF_NEED_FPU_LOAD,
revert the commit to both restore the vcpu_enter_guest() behavior and
eliminate the superfluous switch_fpu_return() in kvm_arch_vcpu_load().

Note, leaving the handling in kvm_arch_vcpu_load() isn't wrong per se,
but it is unnecessary, and most critically, makes it extremely difficult
to find bugs such as the kvm_put_guest_fpu() issue due to shrinking the
window where a softirq can corrupt state.

A sample trace triggered by warning if TIF_NEED_FPU_LOAD is set while
vcpu state is loaded:

<IRQ>
  gcmaes_crypt_by_sg.constprop.12+0x26e/0x660
  ? 0xffffffffc024547d
  ? __qdisc_run+0x83/0x510
  ? __dev_queue_xmit+0x45e/0x990
  ? ip_finish_output2+0x1a8/0x570
  ? fib4_rule_action+0x61/0x70
  ? fib4_rule_action+0x70/0x70
  ? fib_rules_lookup+0x13f/0x1c0
  ? helper_rfc4106_decrypt+0x82/0xa0
  ? crypto_aead_decrypt+0x40/0x70
  ? crypto_aead_decrypt+0x40/0x70
  ? crypto_aead_decrypt+0x40/0x70
  ? esp_output_tail+0x8f4/0xa5a [esp4]
  ? skb_ext_add+0xd3/0x170
  ? xfrm_input+0x7a6/0x12c0
  ? xfrm4_rcv_encap+0xae/0xd0
  ? xfrm4_transport_finish+0x200/0x200
  ? udp_queue_rcv_one_skb+0x1ba/0x460
  ? udp_unicast_rcv_skb.isra.63+0x72/0x90
  ? __udp4_lib_rcv+0x51b/0xb00
  ? ip_protocol_deliver_rcu+0xd2/0x1c0
  ? ip_local_deliver_finish+0x44/0x50
  ? ip_local_deliver+0xe0/0xf0
  ? ip_protocol_deliver_rcu+0x1c0/0x1c0
  ? ip_rcv+0xbc/0xd0
  ? ip_rcv_finish_core.isra.19+0x380/0x380
  ? __netif_receive_skb_one_core+0x7e/0x90
  ? netif_receive_skb_internal+0x3d/0xb0
  ? napi_gro_receive+0xed/0x150
  ? 0xffffffffc0243c77
  ? net_rx_action+0x149/0x3b0
  ? __do_softirq+0xe4/0x2f8
  ? handle_irq_event_percpu+0x6a/0x80
  ? irq_exit+0xe6/0xf0
  ? do_IRQ+0x7f/0xd0
  ? common_interrupt+0xf/0xf
  </IRQ>
  ? irq_entries_start+0x20/0x660
  ? vmx_get_interrupt_shadow+0x2f0/0x710 [kvm_intel]
  ? kvm_set_msr_common+0xfc7/0x2380 [kvm]
  ? recalibrate_cpu_khz+0x10/0x10
  ? ktime_get+0x3a/0xa0
  ? kvm_arch_vcpu_ioctl_run+0x107/0x560 [kvm]
  ? kvm_init+0x6bf/0xd00 [kvm]
  ? __seccomp_filter+0x7a/0x680
  ? do_vfs_ioctl+0xa4/0x630
  ? security_file_ioctl+0x32/0x50
  ? ksys_ioctl+0x60/0x90
  ? __x64_sys_ioctl+0x16/0x20
  ? do_syscall_64+0x5f/0x1a0
  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace 9564a1ccad733a90 ]---

This reverts commit e751732486eb3f159089a64d1901992b1357e7cc.

Fixes: e751732486eb3 ("KVM: X86: Fix fpu state crash in kvm guest")
Reported-by: Derek Yerger <[email protected]>
Reported-by: [email protected]
Cc: Wanpeng Li <[email protected]>
Cc: Thomas Lambertz <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Ensure guest's FPU state is loaded when accessing for emulation

Lock the FPU regs and reload the current thread's FPU state, which holds
the guest's FPU state, to the CPU registers if necessary prior to
accessing guest FPU state as part of emulation. kernel_fpu_begin() can
be called from softirq context, therefore KVM must ensure softirqs are
disabled (locking the FPU regs disables softirqs) when touching CPU FPU
state.

Note, for all intents and purposes this reverts commit 6ab0b9feb82a7
("x86,kvm: remove KVM emulator get_fpu / put_fpu"), but at the time it
was applied, removing get/put_fpu() was correct. The re-introduction
of {get,put}_fpu() is necessitated by the deferring of FPU state load.

Fixes: 5f409e20b7945 ("x86/fpu: Defer FPU state load until return to userspace")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Handle TIF_NEED_FPU_LOAD in kvm_{load,put}_guest_fpu()

Handle TIF_NEED_FPU_LOAD similar to how fpu__copy() handles the flag
when duplicating FPU state to a new task struct. TIF_NEED_FPU_LOAD can
be set any time control is transferred out of KVM, be it voluntarily,
e.g. if I/O is triggered during a KVM call to get_user_pages, or
involuntarily, e.g. if softirq runs after an IRQ occurs. Therefore,
KVM must account for TIF_NEED_FPU_LOAD whenever it is (potentially)
accessing CPU FPU state.

Fixes: 5f409e20b7945 ("x86/fpu: Defer FPU state load until return to userspace")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>