Git Repo - linux.git/log

kbuild: make multiple directory targets work

Currently, the single-target build does not work when two
or more sub-directories are given:

  $ make fs/ kernel/ lib/
    CALL    scripts/checksyscalls.sh
    CALL    scripts/atomic/check-atomics.sh
    DESCEND  objtool
  make[2]: Nothing to be done for 'kernel/'.
  make[2]: Nothing to be done for 'fs/'.
  make[2]: Nothing to be done for 'lib/'.

Make it work properly.

Reported-by: Linus Torvalds <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>

Merge branch 'bpf-xsk-fixes'

Maciej Fijalkowski says:

====================
Cameron reported [0] that on fresh bpf-next he could not run multiple
xdpsock instances in Tx-only mode on single network interface with i40e
driver.

Turns out that Maxim's series [1] which was adding RCU protection around
ndo_xsk_wakeup added check against the __I40E_CONFIG_BUSY being set on
pf->state within i40e_xsk_wakeup() - if it's set, return -ENETDOWN.
Since this bit is set per PF when UMEM is being enabled/disabled, the
situation Cameron stumbled upon was that when he launched second xdpsock
instance, second UMEM was being registered, hence set __I40E_CONFIG_BUSY
which is now observed by first xdpsock and therefore xdpsock's kick_tx()
gets -ENETDOWN as errno.

-ENETDOWN currently is not allowed in kick_tx(), so we were exiting the
first application. Such exit means also XDP program being unloaded and
its dedicated resources, which caused an -ENXIO being return in the
second xdpsock instance.

Let's fix the issue from both sides - protect ourselves from future
xdpsock crashes by allowing for -ENETDOWN errno being set in kick_tx()
(patch 3) and from driver side, return -EAGAIN for the case where PF is
busy (patch 1).

Remove also doubled variable from xdpsock_user.c (patch 2).

Note that ixgbe seems not to be affected since UMEM registration sets
the busy/disable bit per ring, not per PF.

[0]: https://www.spinics.net/lists/xdp-newbies/msg01558.html
[1]: https://lore.kernel.org/netdev/20191217162023 [email protected]/
====================

Signed-off-by: Daniel Borkmann <[email protected]>

samples: bpf: Allow for -ENETDOWN in xdpsock

ndo_xsk_wakeup() can return -ENETDOWN and there's no particular reason
to bail the whole application out on that case. Let's check in kick_tx()
whether errno was set to mentioned value and basically allow application
to further process frames.

Fixes: 248c7f9c0e21 ("samples/bpf: convert xdpsock to use libbpf for AF_XDP access")
Reported-by: Cameron Elliott <[email protected]>
Signed-off-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Björn Töpel <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

samples: bpf: Drop doubled variable declaration in xdpsock

Seems that by accident there is a doubled declaration of global variable
opt_xdp_bind_flags in xdpsock_user.c. The second one is uninitialized so
compiler was simply ignoring it.

To keep things clean, drop the doubled variable.

Fixes: c543f5469822 ("samples/bpf: add unaligned chunks mode support to xdpsock")
Signed-off-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Björn Töpel <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

i40e: Relax i40e_xsk_wakeup's return value when PF is busy

Return -EAGAIN instead of -ENETDOWN to provide a slightly milder
information to user space so that an application will know to retry the
syscall when __I40E_CONFIG_BUSY bit is set on pf->state.

Fixes: b3873a5be757 ("net/i40e: Fix concurrency issues between config flow and XSK")
Signed-off-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Björn Töpel <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

tools/bpf/runqslower: Rebuild libbpf.a on libbpf source change

Add missing dependency of $(BPFOBJ) to $(LIBBPF_SRC), so that running make
in runqslower/ will rebuild libbpf.a when there is change in libbpf/.

Fixes: 9c01546d26d2 ("tools/bpf: Add runqslower tool to tools/bpf")
Signed-off-by: Song Liu <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

Merge tag 'pwm/for-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm

Pull pwm updates from Thierry Reding:
"Mostly cleanups and minor improvements with some new chip support for
  some drivers"

* tag 'pwm/for-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm: (37 commits)
  pwm: Remove set but not set variable 'pwm'
  pwm: sun4i: Initialize variables before use
  pwm: stm32: Remove automatic output enable
  pwm: sun4i: Narrow scope of local variable
  pwm: bcm2835: Allow building for ARCH_BRCMSTB
  pwm: imx27: Eliminate error message for defer probe
  pwm: sun4i: Fix inconsistent IS_ERR and PTR_ERR
  pwm: sun4i: Move pwm_calculate() out of spin_lock()
  pwm: omap-dmtimer: Allow compiling with COMPILE_TEST
  pwm: omap-dmtimer: put_device() after of_find_device_by_node()
  pwm: omap-dmtimer: Simplify error handling
  pwm: omap-dmtimer: Remove PWM chip in .remove before making it unfunctional
  pwm: Implement tracing for .get_state() and .apply_state()
  pwm: rcar: Document inability to set duty_cycle = 0
  pwm: rcar: Drop useless call to pwm_get_state()
  pwm: Fix minor Kconfig whitespace issues
  pwm: atmel: Implement .get_state()
  pwm: atmel: Use register accessors for channels
  pwm: atmel: Document known weaknesses of both hardware and software
  pwm: atmel: Replace loop in prescale calculation by ad-hoc calculation
  ...

Merge tag 'dmaengine-fix-5.6-rc1' of git://git.infradead.org/users/vkoul/slave-dma

Pull dmaengine fixes from Vinod Koul:
"Fixes for:

   - Documentation build error fix

   - Fix dma_request_chan() error return

   - Remove unneeded conversion in idxd driver

   - Fix pointer check for dma_async_device_channel_register()

   - Fix slave-channel symlink cleanup"

* tag 'dmaengine-fix-5.6-rc1' of git://git.infradead.org/users/vkoul/slave-dma:
  dmaengine: Cleanups for the slave <-> channel symlink support
  dmaengine: fix null ptr check for __dma_async_device_channel_register()
  dmaengine: idxd: fix boolconv.cocci warnings
  dmaengine: Fix return value for dma_request_chan() in case of failure
  dmaengine: doc: Properly indent metadata title

PCI/ATS: Use PF PASID for VFs

Per PCIe r5.0, sec 9.3.7.14, if a PF implements the PASID Capability, the
PF PASID configuration is shared by its VFs, and VFs must not implement
their own PASID Capability. But commit 751035b8dc06 ("PCI/ATS: Cache PASID
Capability offset") changed pci_max_pasids() and pci_pasid_features() to
use the PASID Capability of the VF device instead of the associated PF
device. This leads to IOMMU bind failures when pci_max_pasids() and
pci_pasid_features() are called for VFs.

In pci_max_pasids() and pci_pasid_features(), always use the PF PASID
Capability.

Fixes: 751035b8dc06 ("PCI/ATS: Cache PASID Capability offset")
Link: https://lore.kernel.org/r/fe891f9755cb18349389609e7fed9940fc5b081a.1580325170.git.sathyanarayanan.kuppuswamy@linux.intel.com
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
CC: [email protected] # v5.5+

Merge tag 'iommu-updates-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu updates from Joerg Roedel:

- Allow compiling the ARM-SMMU drivers as modules.

- Fixes and cleanups for the ARM-SMMU drivers and io-pgtable code
   collected by Will Deacon. The merge-commit (6855d1ba7537) has all the
   details.

- Cleanup of the iommu_put_resv_regions() call-backs in various
   drivers.

- AMD IOMMU driver cleanups.

- Update for the x2APIC support in the AMD IOMMU driver.

- Preparation patches for Intel VT-d nested mode support.

- RMRR and identity domain handling fixes for the Intel VT-d driver.

- More small fixes and cleanups.

* tag 'iommu-updates-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (87 commits)
  iommu/amd: Remove the unnecessary assignment
  iommu/vt-d: Remove unnecessary WARN_ON_ONCE()
  iommu/vt-d: Unnecessary to handle default identity domain
  iommu/vt-d: Allow devices with RMRRs to use identity domain
  iommu/vt-d: Add RMRR base and end addresses sanity check
  iommu/vt-d: Mark firmware tainted if RMRR fails sanity check
  iommu/amd: Remove unused struct member
  iommu/amd: Replace two consecutive readl calls with one readq
  iommu/vt-d: Don't reject Host Bridge due to scope mismatch
  PCI/ATS: Add PASID stubs
  iommu/arm-smmu-v3: Return -EBUSY when trying to re-add a device
  iommu/arm-smmu-v3: Improve add_device() error handling
  iommu/arm-smmu-v3: Use WRITE_ONCE() when changing validity of an STE
  iommu/arm-smmu-v3: Add second level of context descriptor table
  iommu/arm-smmu-v3: Prepare for handling arm_smmu_write_ctx_desc() failure
  iommu/arm-smmu-v3: Propagate ssid_bits
  iommu/arm-smmu-v3: Add support for Substream IDs
  iommu/arm-smmu-v3: Add context descriptor tables allocators
  iommu/arm-smmu-v3: Prepare arm_smmu_s1_cfg for SSID support
  ACPI/IORT: Parse SSID property of named component node
  ...

Merge tag 'for-linus-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen updates from Juergen Gross:

- fix a bug introduced in 5.5 in the Xen gntdev driver

- fix the Xen balloon driver when running on ancient Xen versions

- allow Xen stubdoms to control interrupt enable flags of
   passed-through PCI cards

- release resources in Xen backends under memory pressure

* tag 'for-linus-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/blkback: Consistently insert one empty line between functions
  xen/blkback: Remove unnecessary static variable name prefixes
  xen/blkback: Squeeze page pools if a memory pressure is detected
  xenbus/backend: Protect xenbus callback with lock
  xenbus/backend: Add memory pressure handler callback
  xen/gntdev: Do not use mm notifiers with autotranslating guests
  xen/balloon: Support xend-based toolstack take two
  xen-pciback: optionally allow interrupt enable flag writes

Merge tag 'devicetree-fixes-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux

Pull devicetree fixes from Rob Herring:

- Fix incorrect $id paths in schemas

- Two fixes for Intel LGM SoC binding schemas

* tag 'devicetree-fixes-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  dt-bindings: Fix paths in schema $id fields
  dt-bindings: PCI: intel: Fix dt_binding_check compilation failure
  dt-bindings: phy: Fix errors in intel,lgm-emmc-phy example

Allow git builds of Sphinx

When using a non-release version of Sphinx, from a local build (with
improvements for kernel doc handling, why not),

sphinx-build --version

reports versions of the form

sphinx-build 3.0.0+/4703d9119972

i.e. base version, a plus symbol, slash, and the start of the git hash
of whatever repository the command is run in (no, not the hash that
was used to build Sphinx!).

This patch fixes the installation check in sphinx-pre-install to
recognise such version output.

Signed-off-by: Stephen Kitt <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>

Merge tag 's390-5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull more s390 updates from Vasily Gorbik:
"The second round of s390 fixes and features for 5.6:

   - Add KPROBES_ON_FTRACE support

   - Add EP11 AES secure keys support

   - PAES rework and prerequisites for paes-s390 ciphers selftests

   - Fix page table upgrade for hugetlbfs"

* tag 's390-5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/pkey/zcrypt: Support EP11 AES secure keys
  s390/zcrypt: extend EP11 card and queue sysfs attributes
  s390/zcrypt: add new low level ep11 functions support file
  s390/zcrypt: ep11 structs rework, export zcrypt_send_ep11_cprb
  s390/zcrypt: enable card/domain autoselect on ep11 cprbs
  s390/crypto: enable clear key values for paes ciphers
  s390/pkey: Add support for key blob with clear key value
  s390/crypto: Rework on paes implementation
  s390: support KPROBES_ON_FTRACE
  s390/mm: fix dynamic pagetable upgrade for hugetlbfs

Documentation: changes.rst: update several outdated project URLs

Update projects URLs in the changes.rst file.

Signed-off-by: Randy Dunlap <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Acked-by: Theodore Ts'o <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>

Documentation: build warnings related to missing blank lines after explicit markups has been fixed

Fix for several documentation build warnings related to missing blank lines
after explicit mark up.

Exact warning message:
WARNING: Explicit markup ends without a blank line; unexpected unindent.

Signed-off-by: Sameer Rahmani <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>

mailmap: add entry for Tiezhu Yang

Add an entry to connect all my email addresses.

Signed-off-by: Tiezhu Yang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>

Documentation/ko_KR/howto: Update a broken link

Signed-off-by: SeongJae Park <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>

Documentation/ko_KR/howto: Update broken web addresses

Commit 0ea6e6112219 ("Documentation: update broken web addresses.")
removed a link to 'http://patchwork.ozlabs.org' in howto, but the change
has not applied to the Korean translation. This commit simply applies
the change to the Korean translation. The link is restored now, though.

Signed-off-by: SeongJae Park <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>

docs/locking: Fix outdated section names

Commit 2e4f5382d12a ("locking/doc: Rename LOCK/UNLOCK to
ACQUIRE/RELEASE") has not appied to 'spinlock.rst'. This commit updates
the doc for the change.

Signed-off-by: SeongJae Park <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>

KVM: vmx: delete meaningless vmx_decache_cr0_guest_bits() declaration

The function vmx_decache_cr0_guest_bits() is only called below its
implementation. So this is meaningless and should be removed.

Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Mark CR4.UMIP as reserved based on associated CPUID bit

Re-add code to mark CR4.UMIP as reserved if UMIP is not supported by the
host.  The UMIP handling was unintentionally dropped during a recent
refactoring.

Not flagging CR4.UMIP allows the guest to set its CR4.UMIP regardless of
host support or userspace desires.  On CPUs with UMIP support, including
emulated UMIP, this allows the guest to enable UMIP against the wishes
of the userspace VMM.  On CPUs without any form of UMIP, this results in
a failed VM-Enter due to invalid guest state.

Fixes: 345599f9a2928 ("KVM: x86: Add macro to ensure reserved cr4 bits checks stay in sync")
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

x86: vmxfeatures: rename features for consistency with KVM and manual

Three of the feature bits in vmxfeatures.h have names that are different
from the Intel SDM. The names have been adjusted recently in KVM but they
were using the old name in the tip tree's x86/cpu branch. Adjust for
consistency.

Signed-off-by: Paolo Bonzini <[email protected]>

Merge tag 'kvm-s390-next-5.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

KVM: s390: Fixes and cleanups for 5.6
- fix register corruption
- ENOTSUPP/EOPNOTSUPP mixed
- reset cleanups/fixes
- selftests

KVM: SVM: relax conditions for allowing MSR_IA32_SPEC_CTRL accesses

Userspace that does not know about the AMD_IBRS bit might still
allow the guest to protect itself with MSR_IA32_SPEC_CTRL using
the Intel SPEC_CTRL bit. However, svm.c disallows this and will
cause a #GP in the guest when writing to the MSR. Fix this by
loosening the test and allowing the Intel CPUID bit, and in fact
allow the AMD_STIBP bit as well since it allows writing to
MSR_IA32_SPEC_CTRL too.

Reported-by: Zhiyi Guo <[email protected]>
Analyzed-by: Dr. David Alan Gilbert <[email protected]>
Analyzed-by: Laszlo Ersek <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Fix perfctr WRMSR for running counters

Correct the logic in intel_pmu_set_msr() for fixed and general purpose
counters. This was recently changed to set pmc->counter without taking
in to account the value of pmc_read_counter() which will be incorrect if
the counter is currently running and non-zero; this changes back to the
old logic which accounted for the value of currently running counters.

Signed-off-by: Eric Hankland <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

x86/kvm/hyper-v: don't allow to turn on unsupported VMX controls for nested guests

Sane L1 hypervisors are not supposed to turn any of the unsupported VMX
controls on for its guests and nested_vmx_check_controls() checks for
that. This is, however, not the case for the controls which are supported
on the host but are missing in enlightened VMCS and when eVMCS is in use.

It would certainly be possible to add these missing checks to
nested_check_vm_execution_controls()/_vm_exit_controls()/.. but it seems
preferable to keep eVMCS-specific stuff in eVMCS and reduce the impact on
non-eVMCS guests by doing less unrelated checks. Create a separate
nested_evmcs_check_controls() for this purpose.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

x86/kvm/hyper-v: move VMX controls sanitization out of nested_enable_evmcs()

With fine grained VMX feature enablement QEMU>=4.2 tries to do KVM_SET_MSRS
with default (matching CPU model) values and in case eVMCS is also enabled,
fails.

It would be possible to drop VMX feature filtering completely and make
this a guest's responsibility: if it decides to use eVMCS it should know
which fields are available and which are not. Hyper-V mostly complies to
this, however, there are some problematic controls:
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES
VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL

which Hyper-V enables. As there are no corresponding fields in eVMCS, we
can't handle this properly in KVM. This is a Hyper-V issue.

Move VMX controls sanitization from nested_enable_evmcs() to vmx_get_msr(),
and do the bare minimum (only clear controls which are known to cause issues).
This allows userspace to keep setting controls it wants and at the same
time hides them from the guest.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: mmu: Separate generating and setting mmio ptes

Separate the functions for generating MMIO page table entries from the
function that inserts them into the paging structure. This refactoring
will facilitate changes to the MMU sychronization model to use atomic
compare / exchanges (which are not guaranteed to succeed) instead of a
monolithic MMU lock.

No functional change expected.

Tested by running kvm-unit-tests on an Intel Haswell machine. This
commit introduced no new failures.

Signed-off-by: Ben Gardon <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Reviewed-by: Peter Shier <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: mmu: Replace unsigned with unsigned int for PTE access

There are several functions which pass an access permission mask for
SPTEs as an unsigned. This works, but checkpatch complains about it.
Switch the occurrences of unsigned to unsigned int to satisfy checkpatch.

No functional change expected.

Tested by running kvm-unit-tests on an Intel Haswell machine. This
commit introduced no new failures.

Signed-off-by: Ben Gardon <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Remove stale comment from nested_vmx_load_cr3()

The blurb pertaining to the return value of nested_vmx_load_cr3() no
longer matches reality, remove it entirely as the behavior it is
attempting to document is quite obvious when reading the actual code.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Krish Sadhukhan <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MIPS: Fold comparecount_func() into comparecount_wakeup()

Fold kvm_mips_comparecount_func() into kvm_mips_comparecount_wakeup() to
eliminate the nondescript function name as well as its unnecessary cast
of a vcpu to "unsigned long" and back to a vcpu. Presumably func() was
used as a callback at some point during pre-upstream development, as
wakeup() is the only user of func() and has been the only user since
both with introduced by commit 669e846e6c4e ("KVM/MIPS32: MIPS arch
specific APIs for KVM").

Cc: Davidlohr Bueso <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MIPS: Fix a build error due to referencing not-yet-defined function

Hoist kvm_mips_comparecount_wakeup() above its only user,
kvm_arch_vcpu_create() to fix a compilation error due to referencing an
undefined function.

Fixes: d11dfed5d700 ("KVM: MIPS: Move all vcpu init code into kvm_arch_vcpu_create()")
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

x86/kvm: do not setup pv tlb flush when not paravirtualized

kvm_setup_pv_tlb_flush will waste memory and print a misguiding message
when KVM paravirtualization is not available.

Intel SDM says that the when cpuid is used with EAX higher than the
maximum supported value for basic of extended function, the data for the
highest supported basic function will be returned.

So, in some systems, kvm_arch_para_features will return bogus data,
causing kvm_setup_pv_tlb_flush to detect support for pv tlb flush.

Testing for kvm_para_available will work as it checks for the hypervisor
signature.

Besides, when the "nopv" command line parameter is used, it should not
continue as well, as kvm_guest_init will no be called in that case.

Signed-off-by: Thadeu Lima de Souza Cascardo <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: fix overflow of zero page refcount with ksm running

We are testing Virtual Machine with KSM on v5.4-rc2 kernel,
and found the zero_page refcount overflow.
The cause of refcount overflow is increased in try_async_pf
(get_user_page) without being decreased in mmu_set_spte()
while handling ept violation.
In kvm_release_pfn_clean(), only unreserved page will call
put_page. However, zero page is reserved.
So, as well as creating and destroy vm, the refcount of
zero page will continue to increase until it overflows.

step1:
echo 10000 > /sys/kernel/pages_to_scan/pages_to_scan
echo 1 > /sys/kernel/pages_to_scan/run
echo 1 > /sys/kernel/pages_to_scan/use_zero_pages

step2:
just create several normal qemu kvm vms.
And destroy it after 10s.
Repeat this action all the time.

After a long period of time, all domains hang because
of the refcount of zero page overflow.

Qemu print error log as follow:
â€¦
error: kvm run failed Bad address
EAX=00006cdc EBX=00000008 ECX=80202001 EDX=078bfbfd
ESI=ffffffff EDI=00000000 EBP=00000008 ESP=00006cc4
EIP=000efd75 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT=     000f7070 00000037
IDT=     000f70ae 00000000
CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=00 01 00 00 00 e9 e8 00 00 00 c7 05 4c 55 0f 00 01 00 00 00 <8b> 35 00 00 01 00 8b 3d 04 00 01 00 b8 d8 d3 00 00 c1 e0 08 0c ea a3 00 00 01 00 c7 05 04
â€¦

Meanwhile, a kernel warning is departed.

[40914.836375] WARNING: CPU: 3 PID: 82067 at ./include/linux/mm.h:987 try_get_page+0x1f/0x30
[40914.836412] CPU: 3 PID: 82067 Comm: CPU 0/KVM Kdump: loaded Tainted: G           OE     5.2.0-rc2 #5
[40914.836415] RIP: 0010:try_get_page+0x1f/0x30
[40914.836417] Code: 40 00 c3 0f 1f 84 00 00 00 00 00 48 8b 47 08 a8 01 75 11 8b 47 34 85 c0 7e 10 f0 ff 47 34 b8 01 00 00 00 c3 48 8d 78 ff eb e9 <0f> 0b 31 c0 c3 66 90 66 2e 0f 1f 84 00 0
0 00 00 00 48 8b 47 08 a8
[40914.836418] RSP: 0018:ffffb4144e523988 EFLAGS: 00010286
[40914.836419] RAX: 0000000080000000 RBX: 0000000000000326 RCX: 0000000000000000
[40914.836420] RDX: 0000000000000000 RSI: 00004ffdeba10000 RDI: ffffdf07093f6440
[40914.836421] RBP: ffffdf07093f6440 R08: 800000424fd91225 R09: 0000000000000000
[40914.836421] R10: ffff9eb41bfeebb8 R11: 0000000000000000 R12: ffffdf06bbd1e8a8
[40914.836422] R13: 0000000000000080 R14: 800000424fd91225 R15: ffffdf07093f6440
[40914.836423] FS:  00007fb60ffff700(0000) GS:ffff9eb4802c0000(0000) knlGS:0000000000000000
[40914.836425] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[40914.836426] CR2: 0000000000000000 CR3: 0000002f220e6002 CR4: 00000000003626e0
[40914.836427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[40914.836427] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[40914.836428] Call Trace:
[40914.836433]  follow_page_pte+0x302/0x47b
[40914.836437]  __get_user_pages+0xf1/0x7d0
[40914.836441]  ? irq_work_queue+0x9/0x70
[40914.836443]  get_user_pages_unlocked+0x13f/0x1e0
[40914.836469]  __gfn_to_pfn_memslot+0x10e/0x400 [kvm]
[40914.836486]  try_async_pf+0x87/0x240 [kvm]
[40914.836503]  tdp_page_fault+0x139/0x270 [kvm]
[40914.836523]  kvm_mmu_page_fault+0x76/0x5e0 [kvm]
[40914.836588]  vcpu_enter_guest+0xb45/0x1570 [kvm]
[40914.836632]  kvm_arch_vcpu_ioctl_run+0x35d/0x580 [kvm]
[40914.836645]  kvm_vcpu_ioctl+0x26e/0x5d0 [kvm]
[40914.836650]  do_vfs_ioctl+0xa9/0x620
[40914.836653]  ksys_ioctl+0x60/0x90
[40914.836654]  __x64_sys_ioctl+0x16/0x20
[40914.836658]  do_syscall_64+0x5b/0x180
[40914.836664]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[40914.836666] RIP: 0033:0x7fb61cb6bfc7

Signed-off-by: LinFeng <[email protected]>
Signed-off-by: Zhuang Yanying <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

qed: Fix timestamping issue for L2 unicast ptp packets.

commit cedeac9df4b8 ("qed: Add support for Timestamping the unicast
PTP packets.") handles the timestamping of L4 ptp packets only.
This patch adds driver changes to detect/timestamp both L2/L4 unicast
PTP packets.

Fixes: cedeac9df4b8 ("qed: Add support for Timestamping the unicast PTP packets.")
Signed-off-by: Sudarsana Reddy Kalluru <[email protected]>
Signed-off-by: Ariel Elior <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

KVM: x86: Take a u64 when checking for a valid dr7 value

Take a u64 instead of an unsigned long in kvm_dr7_valid() to fix a build
warning on i386 due to right-shifting a 32-bit value by 32 when checking
for bits being set in dr7[63:32].

Alternatively, the warning could be resolved by rewriting the check to
use an i386-friendly method, but taking a u64 fixes another oddity on
32-bit KVM. Beause KVM implements natural width VMCS fields as u64s to
avoid layout issues between 32-bit and 64-bit, a devious guest can stuff
vmcs12->guest_dr7 with a 64-bit value even when both the guest and host
are 32-bit kernels. KVM eventually drops vmcs12->guest_dr7[63:32] when
propagating vmcs12->guest_dr7 to vmcs02, but ideally KVM would not rely
on that behavior for correctness.

Cc: Jim Mattson <[email protected]>
Cc: Krish Sadhukhan <[email protected]>
Fixes: ecb697d10f70 ("KVM: nVMX: Check GUEST_DR7 on vmentry of nested guests")
Reported-by: Randy Dunlap <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: use raw clock values consistently

Commit 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw
clock") changed kvmclock to use tkr_raw instead of tkr_mono.  However,
the default kvmclock_offset for the VM was still based on the monotonic
clock and, if the raw clock drifted enough from the monotonic clock,
this could cause a negative system_time to be written to the guest's
struct pvclock.  RHEL5 does not like it and (if it boots fast enough to
observe a negative time value) it hangs.

There is another thing to be careful about: getboottime64 returns the
host boot time with tkr_mono frequency, and subtracting the tkr_raw-based
kvmclock value will cause the wallclock to be off if tkr_raw drifts
from tkr_mono.  To avoid this, compute the wallclock delta from the
current time instead of being clever and using getboottime64.

Fixes: 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw clock")
Cc: [email protected]
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: reorganize pvclock_gtod_data members

We will need a copy of tk->offs_boot in the next patch. Store it and
cleanup the struct: instead of storing tk->tkr_xxx.base with the tk->offs_boot
included, store the raw value in struct pvclock_clock and sum it in
do_monotonic_raw and do_realtime. tk->tkr_xxx.xtime_nsec also moves
to struct pvclock_clock.

While at it, fix a (usually harmless) typo in do_monotonic_raw, which
was using gtod->clock.shift instead of gtod->raw_clock.shift.

Fixes: 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw clock")
Cc: [email protected]
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: delete meaningless nested_vmx_run() declaration

The function nested_vmx_run() declaration is below its implementation. So
this is meaningless and should be removed.

Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: allow AVIC without split irqchip

SVM is now able to disable AVIC dynamically whenever the in-kernel PIT sets
up an ack notifier, so we can enable it even if in-kernel IOAPIC/PIC/PIT
are in use.

Signed-off-by: Paolo Bonzini <[email protected]>

kvm: ioapic: Lazy update IOAPIC EOI

In-kernel IOAPIC does not receive EOI with AMD SVM AVIC
since the processor accelerate write to APIC EOI register and
does not trap if the interrupt is edge-triggered.

Workaround this by lazy check for pending APIC EOI at the time when
setting new IOPIC irq, and update IOAPIC EOI if no pending APIC EOI.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: ioapic: Refactor kvm_ioapic_update_eoi()

Refactor code for handling IOAPIC EOI for subsequent patch.
There is no functional change.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: i8254: Deactivate APICv when using in-kernel PIT re-injection mode.

AMD SVM AVIC accelerates EOI write and does not trap. This causes
in-kernel PIT re-injection mode to fail since it relies on irq-ack
notifier mechanism. So, APICv is activated only when in-kernel PIT
is in discard mode e.g. w/ qemu option:

-global kvm-pit.lost_tick_policy=discard

Also, introduce APICV_INHIBIT_REASON_PIT_REINJ bit to be used for this
reason.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

svm: Temporarily deactivate AVIC during ExtINT handling

AMD AVIC does not support ExtINT. Therefore, AVIC must be temporary
deactivated and fall back to using legacy interrupt injection via vINTR
and interrupt window.

Also, introduce APICV_INHIBIT_REASON_IRQWIN to be used for this reason.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
[Rename svm_request_update_avic to svm_toggle_avic_for_extint. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>

svm: Deactivate AVIC when launching guest with nested SVM support

Since AVIC does not currently work w/ nested virtualization,
deactivate AVIC for the guest if setting CPUID Fn80000001_ECX[SVM]
(i.e. indicate support for SVM, which is needed for nested virtualization).
Also, introduce a new APICV_INHIBIT_REASON_NESTED bit to be used for
this reason.

Suggested-by: Alexander Graf <[email protected]>
Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: hyperv: Use APICv update request interface

Since disabling APICv has to be done for all vcpus on AMD-based
system, adopt the newly introduced kvm_request_apicv_update()
interface, and introduce a new APICV_INHIBIT_REASON_HYPERV.

Also, remove the kvm_vcpu_deactivate_apicv() since no longer used.

Cc: Roman Kagan <[email protected]>
Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

svm: Add support for dynamic APICv

Add necessary logics to support (de)activate AVIC at runtime.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: Introduce x86 ops hook for pre-update APICv

AMD SVM AVIC needs to update APIC backing page mapping before changing
APICv mode. Introduce struct kvm_x86_ops.pre_update_apicv_exec_ctrl
function hook to be called prior KVM APICv update request to each vcpu.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: Introduce APICv x86 ops for checking APIC inhibit reasons

Inibit reason bits are used to determine if APICv deactivation is
applicable for a particular hardware virtualization architecture.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: svm: avic: Add support for dynamic setup/teardown of virtual APIC backing page

Re-factor avic_init_access_page() to avic_update_access_page() since
activate/deactivate AVIC requires setting/unsetting the memory region used
for virtual APIC backing page (APIC_ACCESS_PAGE_PRIVATE_MEMSLOT).

Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: svm: Add support to (de)activate posted interrupts

Introduce interface for (de)activate posted interrupts, and
implement SVM hooks to toggle AMD IOMMU guest virtual APIC mode.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: Add APICv (de)activate request trace points

Add trace points when sending request to (de)activate APICv.

Suggested-by: Alexander Graf <[email protected]>
Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: Add support for dynamic APICv activation

Certain runtime conditions require APICv to be temporary deactivated
during runtime. The current implementation only support run-time
deactivation of APICv when Hyper-V SynIC is enabled, which is not
temporary.

In addition, for AMD, when APICv is (de)activated at runtime,
all vcpus in the VM have to operate in the same mode. Thus the
requesting vcpu must notify the others.

So, introduce the following:
* A new KVM_REQ_APICV_UPDATE request bit
* Interfaces to request all vcpus to update APICv status
* A new interface to update APICV-related parameters for each vcpu

Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: remove get_enable_apicv from kvm_x86_ops

It is unused now.

Signed-off-by: Paolo Bonzini <[email protected]>

kvm: x86: Introduce APICv inhibit reason bits

There are several reasons in which a VM needs to deactivate APICv
e.g. disable APICv via parameter during module loading, or when
enable Hyper-V SynIC support. Additional inhibit reasons will be
introduced later on when dynamic APICv is supported,

Introduce KVM APICv inhibit reason bits along with a new variable,
apicv_inhibit_reasons, to help keep track of APICv state for each VM,

Initially, the APICV_INHIBIT_REASON_DISABLE bit is used to indicate
the case where APICv is disabled during KVM module load.
(e.g. insmod kvm_amd avic=0 or insmod kvm_intel enable_apicv=0).

Signed-off-by: Suravee Suthikulpanit <[email protected]>
[Do not use get_enable_apicv; consider irqchip_split in svm.c. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: lapic: Introduce APICv update helper function

Re-factor code into a helper function for setting lapic parameters when
activate/deactivate APICv, and export the function for subsequent usage.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Merge branch 'macb-TSO-bug-fixes'

Harini Katakam says:

====================
macb: TSO bug fixes

An IP errata was recently discovered when testing TSO enabled versions
with perf test tools where a false amba error is reported by the IP.
Some ways to reproduce would be to use iperf or applications with payload
descriptor sizes very close to 16K. Once the error is observed TXERR (or
bit 6 of ISR) will be constantly triggered leading to a series of tx path
error handling and clean up. Workaround the same by limiting this size to
0x3FC0 as recommended by Cadence. There was no performance impact on 1G
system that I tested with.

Note on patch 1: The alignment code may be unused but leaving it there
in case anyone is using UFO.

Added Fixes tag to patch 1.
====================

Signed-off-by: David S. Miller <[email protected]>

net: macb: Limit maximum GEM TX length in TSO

GEM_MAX_TX_LEN currently resolves to 0x3FF8 for any IP version supporting
TSO with full 14bits of length field in payload descriptor. But an IP
errata causes false amba_error (bit 6 of ISR) when length in payload
descriptors is specified above 16387. The error occurs because the DMA
falsely concludes that there is not enough space in SRAM for incoming
payload. These errors were observed continuously under stress of large
packets using iperf on a version where SRAM was 16K for each queue. This
errata will be documented shortly and affects all versions since TSO
functionality was added. Hence limit the max length to 0x3FC0 (rounded).

Signed-off-by: Harini Katakam <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: macb: Remove unnecessary alignment check for TSO

The IP TSO implementation does NOT require the length to be a
multiple of 8. That is only a requirement for UFO as per IP
documentation. Hence, exit macb_features_check function in the
beginning if the protocol is not UDP. Only when it is UDP,
proceed further to the alignment checks. Update comments to
reflect the same. Also remove dead code checking for protocol
TCP when calculating header length.

Fixes: 1629dd4f763c ("cadence: Add LSO support.")
Signed-off-by: Harini Katakam <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bonding/alb: properly access headers in bond_alb_xmit()

syzbot managed to send an IPX packet through bond_alb_xmit()
and af_packet and triggered a use-after-free.

First, bond_alb_xmit() was using ipx_hdr() helper to reach
the IPX header, but ipx_hdr() was using the transport offset
instead of the network offset. In the particular syzbot
report transport offset was 0xFFFF

This patch removes ipx_hdr() since it was only (mis)used from bonding.

Then we need to make sure IPv4/IPv6/IPX headers are pulled
in skb->head before dereferencing anything.

BUG: KASAN: use-after-free in bond_alb_xmit+0x153a/0x1590 drivers/net/bonding/bond_alb.c:1452
Read of size 2 at addr ffff8801ce56dfff by task syz-executor.2/18108
(if (ipx_hdr(skb)->ipx_checksum != IPX_NO_CHECKSUM) ...)

Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
[<ffffffff8441fc42>] __dump_stack lib/dump_stack.c:17 [inline]
[<ffffffff8441fc42>] dump_stack+0x14d/0x20b lib/dump_stack.c:53
[<ffffffff81a7dec4>] print_address_description+0x6f/0x20b mm/kasan/report.c:282
[<ffffffff81a7e0ec>] kasan_report_error mm/kasan/report.c:380 [inline]
[<ffffffff81a7e0ec>] kasan_report mm/kasan/report.c:438 [inline]
[<ffffffff81a7e0ec>] kasan_report.cold+0x8c/0x2a0 mm/kasan/report.c:422
[<ffffffff81a7dc4f>] __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:469
[<ffffffff82c8c00a>] bond_alb_xmit+0x153a/0x1590 drivers/net/bonding/bond_alb.c:1452
[<ffffffff82c60c74>] __bond_start_xmit drivers/net/bonding/bond_main.c:4199 [inline]
[<ffffffff82c60c74>] bond_start_xmit+0x4f4/0x1570 drivers/net/bonding/bond_main.c:4224
[<ffffffff83baa558>] __netdev_start_xmit include/linux/netdevice.h:4525 [inline]
[<ffffffff83baa558>] netdev_start_xmit include/linux/netdevice.h:4539 [inline]
[<ffffffff83baa558>] xmit_one net/core/dev.c:3611 [inline]
[<ffffffff83baa558>] dev_hard_start_xmit+0x168/0x910 net/core/dev.c:3627
[<ffffffff83bacf35>] __dev_queue_xmit+0x1f55/0x33b0 net/core/dev.c:4238
[<ffffffff83bae3a8>] dev_queue_xmit+0x18/0x20 net/core/dev.c:4278
[<ffffffff84339189>] packet_snd net/packet/af_packet.c:3226 [inline]
[<ffffffff84339189>] packet_sendmsg+0x4919/0x70b0 net/packet/af_packet.c:3252
[<ffffffff83b1ac0c>] sock_sendmsg_nosec net/socket.c:673 [inline]
[<ffffffff83b1ac0c>] sock_sendmsg+0x12c/0x160 net/socket.c:684
[<ffffffff83b1f5a2>] __sys_sendto+0x262/0x380 net/socket.c:1996
[<ffffffff83b1f700>] SYSC_sendto net/socket.c:2008 [inline]
[<ffffffff83b1f700>] SyS_sendto+0x40/0x60 net/socket.c:2004

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
Cc: Jay Vosburgh <[email protected]>
Cc: Veaceslav Falico <[email protected]>
Cc: Andy Gospodarek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

devlink: report 0 after hitting end in region read

commit fdd41ec21e15 ("devlink: Return right error code in case of errors
for region read") modified the region read code to report errors
properly in unexpected cases.

In the case where the start_offset and ret_offset match, it unilaterally
converted this into an error. This causes an issue for the "dump"
version of the command. In this case, the devlink region dump will
always report an invalid argument:

000000000000ffd0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
000000000000ffe0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
devlink answers: Invalid argument
000000000000fff0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

This occurs because the expected flow for the dump is to return 0 after
there is no further data.

The simplest fix would be to stop converting the error code to -EINVAL
if start_offset == ret_offset. However, avoid unnecessary work by
checking for when start_offset is larger than the region size and
returning 0 upfront.

Fixes: fdd41ec21e15 ("devlink: Return right error code in case of errors for region read")
Signed-off-by: Jacob Keller <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethernet: dec: tulip: Fix length mask in receive length calculation

The receive frame length calculation uses a wrong mask to calculate the
length of the received frames.

Per spec table 4-1 the length is contained in the FL (Frame Length)
field in bits 30:16.

This didn't show up as an issue so far since frames were limited to
1500 bytes which falls within the 11 bit window.

Signed-off-by: Moritz Fischer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'wg-fixes'

Jason A. Donenfeld says:

====================
wireguard fixes for 5.6-rc1

Here are fixes for WireGuard before 5.6-rc1 is tagged. It includes:

1) A fix for a UaF (caused by kmalloc failing during a very small
   allocation) that syzkaller found, from Eric Dumazet.

2) A fix for a deadlock that syzkaller found, along with an additional
   selftest to ensure that the bug fix remains correct, from me.

3) Two little fixes/cleanups to the selftests from Krzysztof Kozlowski
   and me.
====================

Signed-off-by: David S. Miller <[email protected]>

wireguard: selftests: tie socket waiting to target pid

Without this, we wind up proceeding too early sometimes when the
previous process has just used the same listening port. So, we tie the
listening socket query to the specific pid we're interested in.

Signed-off-by: Jason A. Donenfeld <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

wireguard: selftests: cleanup CONFIG_ENABLE_WARN_DEPRECATED

CONFIG_ENABLE_WARN_DEPRECATED is gone since commit 771c035372a0
("deprecate the '__deprecated' attribute warnings entirely and for
good").

Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Jason A. Donenfeld <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

wireguard: selftests: ensure non-addition of peers with failed precomputation

Ensure that peers with low order points are ignored, both in the case
where we already have a device private key and in the case where we do
not. This adds points that naturally give a zero output.

Signed-off-by: Jason A. Donenfeld <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

wireguard: noise: reject peers with low order public keys

Our static-static calculation returns a failure if the public key is of
low order. We check for this when peers are added, and don't allow them
to be added if they're low order, except in the case where we haven't
yet been given a private key. In that case, we would defer the removal
of the peer until we're given a private key, since at that point we're
doing new static-static calculations which incur failures we can act on.
This meant, however, that we wound up removing peers rather late in the
configuration flow.

Syzkaller points out that peer_remove calls flush_workqueue, which in
turn might then wait for sending a handshake initiation to complete.
Since handshake initiation needs the static identity lock, holding the
static identity lock while calling peer_remove can result in a rare
deadlock. We have precisely this case in this situation of late-stage
peer removal based on an invalid public key. We can't drop the lock when
removing, because then incoming handshakes might interact with a bogus
static-static calculation.

While the band-aid patch for this would involve breaking up the peer
removal into two steps like wg_peer_remove_all does, in order to solve
the locking issue, there's actually a much more elegant way of fixing
this:

If the static-static calculation succeeds with one private key, it
*must* succeed with all others, because all 32-byte strings map to valid
private keys, thanks to clamping. That means we can get rid of this
silly dance and locking headaches of removing peers late in the
configuration flow, and instead just reject them early on, regardless of
whether the device has yet been assigned a private key. For the case
where the device doesn't yet have a private key, we safely use zeros
just for the purposes of checking for low order points by way of
checking the output of the calculation.

The following PoC will trigger the deadlock:

ip link add wg0 type wireguard
ip addr add 10.0.0.1/24 dev wg0
ip link set wg0 up
ping -f 10.0.0.2 &
while true; do
        wg set wg0 private-key /dev/null peer AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA= allowed-ips 10.0.0.0/24 endpoint 10.0.0.3:1234
        wg set wg0 private-key <(echo AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=)
done

[    0.949105] ======================================================
[    0.949550] WARNING: possible circular locking dependency detected
[    0.950143] 5.5.0-debug+ #18 Not tainted
[    0.950431] ------------------------------------------------------
[    0.950959] wg/89 is trying to acquire lock:
[    0.951252] ffff8880333e2128 ((wq_completion)wg-kex-wg0){+.+.}, at: flush_workqueue+0xe3/0x12f0
[    0.951865]
[    0.951865] but task is already holding lock:
[    0.952280] ffff888032819bc0 (&wg->static_identity.lock){++++}, at: wg_set_device+0x95d/0xcc0
[    0.953011]
[    0.953011] which lock already depends on the new lock.
[    0.953011]
[    0.953651]
[    0.953651] the existing dependency chain (in reverse order) is:
[    0.954292]
[    0.954292] -> #2 (&wg->static_identity.lock){++++}:
[    0.954804]        lock_acquire+0x127/0x350
[    0.955133]        down_read+0x83/0x410
[    0.955428]        wg_noise_handshake_create_initiation+0x97/0x700
[    0.955885]        wg_packet_send_handshake_initiation+0x13a/0x280
[    0.956401]        wg_packet_handshake_send_worker+0x10/0x20
[    0.956841]        process_one_work+0x806/0x1500
[    0.957167]        worker_thread+0x8c/0xcb0
[    0.957549]        kthread+0x2ee/0x3b0
[    0.957792]        ret_from_fork+0x24/0x30
[    0.958234]
[    0.958234] -> #1 ((work_completion)(&peer->transmit_handshake_work)){+.+.}:
[    0.958808]        lock_acquire+0x127/0x350
[    0.959075]        process_one_work+0x7ab/0x1500
[    0.959369]        worker_thread+0x8c/0xcb0
[    0.959639]        kthread+0x2ee/0x3b0
[    0.959896]        ret_from_fork+0x24/0x30
[    0.960346]
[    0.960346] -> #0 ((wq_completion)wg-kex-wg0){+.+.}:
[    0.960945]        check_prev_add+0x167/0x1e20
[    0.961351]        __lock_acquire+0x2012/0x3170
[    0.961725]        lock_acquire+0x127/0x350
[    0.961990]        flush_workqueue+0x106/0x12f0
[    0.962280]        peer_remove_after_dead+0x160/0x220
[    0.962600]        wg_set_device+0xa24/0xcc0
[    0.962994]        genl_rcv_msg+0x52f/0xe90
[    0.963298]        netlink_rcv_skb+0x111/0x320
[    0.963618]        genl_rcv+0x1f/0x30
[    0.963853]        netlink_unicast+0x3f6/0x610
[    0.964245]        netlink_sendmsg+0x700/0xb80
[    0.964586]        __sys_sendto+0x1dd/0x2c0
[    0.964854]        __x64_sys_sendto+0xd8/0x1b0
[    0.965141]        do_syscall_64+0x90/0xd9a
[    0.965408]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[    0.965769]
[    0.965769] other info that might help us debug this:
[    0.965769]
[    0.966337] Chain exists of:
[    0.966337]   (wq_completion)wg-kex-wg0 --> (work_completion)(&peer->transmit_handshake_work) --> &wg->static_identity.lock
[    0.966337]
[    0.967417]  Possible unsafe locking scenario:
[    0.967417]
[    0.967836]        CPU0                    CPU1
[    0.968155]        ----                    ----
[    0.968497]   lock(&wg->static_identity.lock);
[    0.968779]                                lock((work_completion)(&peer->transmit_handshake_work));
[    0.969345]                                lock(&wg->static_identity.lock);
[    0.969809]   lock((wq_completion)wg-kex-wg0);
[    0.970146]
[    0.970146]  *** DEADLOCK ***
[    0.970146]
[    0.970531] 5 locks held by wg/89:
[    0.970908]  #0: ffffffff827433c8 (cb_lock){++++}, at: genl_rcv+0x10/0x30
[    0.971400]  #1: ffffffff82743480 (genl_mutex){+.+.}, at: genl_rcv_msg+0x642/0xe90
[    0.971924]  #2: ffffffff827160c0 (rtnl_mutex){+.+.}, at: wg_set_device+0x9f/0xcc0
[    0.972488]  #3: ffff888032819de0 (&wg->device_update_lock){+.+.}, at: wg_set_device+0xb0/0xcc0
[    0.973095]  #4: ffff888032819bc0 (&wg->static_identity.lock){++++}, at: wg_set_device+0x95d/0xcc0
[    0.973653]
[    0.973653] stack backtrace:
[    0.973932] CPU: 1 PID: 89 Comm: wg Not tainted 5.5.0-debug+ #18
[    0.974476] Call Trace:
[    0.974638]  dump_stack+0x97/0xe0
[    0.974869]  check_noncircular+0x312/0x3e0
[    0.975132]  ? print_circular_bug+0x1f0/0x1f0
[    0.975410]  ? __kernel_text_address+0x9/0x30
[    0.975727]  ? unwind_get_return_address+0x51/0x90
[    0.976024]  check_prev_add+0x167/0x1e20
[    0.976367]  ? graph_lock+0x70/0x160
[    0.976682]  __lock_acquire+0x2012/0x3170
[    0.976998]  ? register_lock_class+0x1140/0x1140
[    0.977323]  lock_acquire+0x127/0x350
[    0.977627]  ? flush_workqueue+0xe3/0x12f0
[    0.977890]  flush_workqueue+0x106/0x12f0
[    0.978147]  ? flush_workqueue+0xe3/0x12f0
[    0.978410]  ? find_held_lock+0x2c/0x110
[    0.978662]  ? lock_downgrade+0x6e0/0x6e0
[    0.978919]  ? queue_rcu_work+0x60/0x60
[    0.979166]  ? netif_napi_del+0x151/0x3b0
[    0.979501]  ? peer_remove_after_dead+0x160/0x220
[    0.979871]  peer_remove_after_dead+0x160/0x220
[    0.980232]  wg_set_device+0xa24/0xcc0
[    0.980516]  ? deref_stack_reg+0x8e/0xc0
[    0.980801]  ? set_peer+0xe10/0xe10
[    0.981040]  ? __ww_mutex_check_waiters+0x150/0x150
[    0.981430]  ? __nla_validate_parse+0x163/0x270
[    0.981719]  ? genl_family_rcv_msg_attrs_parse+0x13f/0x310
[    0.982078]  genl_rcv_msg+0x52f/0xe90
[    0.982348]  ? genl_family_rcv_msg_attrs_parse+0x310/0x310
[    0.982690]  ? register_lock_class+0x1140/0x1140
[    0.983049]  netlink_rcv_skb+0x111/0x320
[    0.983298]  ? genl_family_rcv_msg_attrs_parse+0x310/0x310
[    0.983645]  ? netlink_ack+0x880/0x880
[    0.983888]  genl_rcv+0x1f/0x30
[    0.984168]  netlink_unicast+0x3f6/0x610
[    0.984443]  ? netlink_detachskb+0x60/0x60
[    0.984729]  ? find_held_lock+0x2c/0x110
[    0.984976]  netlink_sendmsg+0x700/0xb80
[    0.985220]  ? netlink_broadcast_filtered+0xa60/0xa60
[    0.985533]  __sys_sendto+0x1dd/0x2c0
[    0.985763]  ? __x64_sys_getpeername+0xb0/0xb0
[    0.986039]  ? sockfd_lookup_light+0x17/0x160
[    0.986397]  ? __sys_recvmsg+0x8c/0xf0
[    0.986711]  ? __sys_recvmsg_sock+0xd0/0xd0
[    0.987018]  __x64_sys_sendto+0xd8/0x1b0
[    0.987283]  ? lockdep_hardirqs_on+0x39b/0x5a0
[    0.987666]  do_syscall_64+0x90/0xd9a
[    0.987903]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[    0.988223] RIP: 0033:0x7fe77c12003e
[    0.988508] Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 4
[    0.989666] RSP: 002b:00007fffada2ed58 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[    0.990137] RAX: ffffffffffffffda RBX: 00007fe77c159d48 RCX: 00007fe77c12003e
[    0.990583] RDX: 0000000000000040 RSI: 000055fd1d38e020 RDI: 0000000000000004
[    0.991091] RBP: 000055fd1d38e020 R08: 000055fd1cb63358 R09: 000000000000000c
[    0.991568] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000002c
[    0.992014] R13: 0000000000000004 R14: 000055fd1d38e020 R15: 0000000000000001

Signed-off-by: Jason A. Donenfeld <[email protected]>
Reported-by: syzbot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

wireguard: allowedips: fix use-after-free in root_remove_peer_lists

In the unlikely case a new node could not be allocated, we need to
remove @newnode from @peer->allowedips_list before freeing it.

syzbot reported:

BUG: KASAN: use-after-free in __list_del_entry_valid+0xdc/0xf5 lib/list_debug.c:54
Read of size 8 at addr ffff88809881a538 by task syz-executor.4/30133

CPU: 0 PID: 30133 Comm: syz-executor.4 Not tainted 5.5.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x197/0x210 lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
__kasan_report.cold+0x1b/0x32 mm/kasan/report.c:506
kasan_report+0x12/0x20 mm/kasan/common.c:639
__asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:135
__list_del_entry_valid+0xdc/0xf5 lib/list_debug.c:54
__list_del_entry include/linux/list.h:132 [inline]
list_del include/linux/list.h:146 [inline]
root_remove_peer_lists+0x24f/0x4b0 drivers/net/wireguard/allowedips.c:65
wg_allowedips_free+0x232/0x390 drivers/net/wireguard/allowedips.c:300
wg_peer_remove_all+0xd5/0x620 drivers/net/wireguard/peer.c:187
wg_set_device+0xd01/0x1350 drivers/net/wireguard/netlink.c:542
genl_family_rcv_msg_doit net/netlink/genetlink.c:672 [inline]
genl_family_rcv_msg net/netlink/genetlink.c:717 [inline]
genl_rcv_msg+0x67d/0xea0 net/netlink/genetlink.c:734
netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
genl_rcv+0x29/0x40 net/netlink/genetlink.c:745
netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328
netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:672
____sys_sendmsg+0x753/0x880 net/socket.c:2343
___sys_sendmsg+0x100/0x170 net/socket.c:2397
__sys_sendmsg+0x105/0x1d0 net/socket.c:2430
__do_sys_sendmsg net/socket.c:2439 [inline]
__se_sys_sendmsg net/socket.c:2437 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2437
do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x45b399
Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f99a9bcdc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f99a9bce6d4 RCX: 000000000045b399
RDX: 0000000000000000 RSI: 0000000020001340 RDI: 0000000000000003
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000004
R13: 00000000000009ba R14: 00000000004cb2b8 R15: 0000000000000009

Allocated by task 30103:
save_stack+0x23/0x90 mm/kasan/common.c:72
set_track mm/kasan/common.c:80 [inline]
__kasan_kmalloc mm/kasan/common.c:513 [inline]
__kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486
kasan_kmalloc+0x9/0x10 mm/kasan/common.c:527
kmem_cache_alloc_trace+0x158/0x790 mm/slab.c:3551
kmalloc include/linux/slab.h:556 [inline]
kzalloc include/linux/slab.h:670 [inline]
add+0x70a/0x1970 drivers/net/wireguard/allowedips.c:236
wg_allowedips_insert_v4+0xf6/0x160 drivers/net/wireguard/allowedips.c:320
set_allowedip drivers/net/wireguard/netlink.c:343 [inline]
set_peer+0xfb9/0x1150 drivers/net/wireguard/netlink.c:468
wg_set_device+0xbd4/0x1350 drivers/net/wireguard/netlink.c:591
genl_family_rcv_msg_doit net/netlink/genetlink.c:672 [inline]
genl_family_rcv_msg net/netlink/genetlink.c:717 [inline]
genl_rcv_msg+0x67d/0xea0 net/netlink/genetlink.c:734
netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
genl_rcv+0x29/0x40 net/netlink/genetlink.c:745
netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328
netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:672
____sys_sendmsg+0x753/0x880 net/socket.c:2343
___sys_sendmsg+0x100/0x170 net/socket.c:2397
__sys_sendmsg+0x105/0x1d0 net/socket.c:2430
__do_sys_sendmsg net/socket.c:2439 [inline]
__se_sys_sendmsg net/socket.c:2437 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2437
do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 30103:
save_stack+0x23/0x90 mm/kasan/common.c:72
set_track mm/kasan/common.c:80 [inline]
kasan_set_free_info mm/kasan/common.c:335 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/common.c:474
kasan_slab_free+0xe/0x10 mm/kasan/common.c:483
__cache_free mm/slab.c:3426 [inline]
kfree+0x10a/0x2c0 mm/slab.c:3757
add+0x12d2/0x1970 drivers/net/wireguard/allowedips.c:266
wg_allowedips_insert_v4+0xf6/0x160 drivers/net/wireguard/allowedips.c:320
set_allowedip drivers/net/wireguard/netlink.c:343 [inline]
set_peer+0xfb9/0x1150 drivers/net/wireguard/netlink.c:468
wg_set_device+0xbd4/0x1350 drivers/net/wireguard/netlink.c:591
genl_family_rcv_msg_doit net/netlink/genetlink.c:672 [inline]
genl_family_rcv_msg net/netlink/genetlink.c:717 [inline]
genl_rcv_msg+0x67d/0xea0 net/netlink/genetlink.c:734
netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
genl_rcv+0x29/0x40 net/netlink/genetlink.c:745
netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328
netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:672
____sys_sendmsg+0x753/0x880 net/socket.c:2343
___sys_sendmsg+0x100/0x170 net/socket.c:2397
__sys_sendmsg+0x105/0x1d0 net/socket.c:2430
__do_sys_sendmsg net/socket.c:2439 [inline]
__se_sys_sendmsg net/socket.c:2437 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2437
do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff88809881a500
which belongs to the cache kmalloc-64 of size 64
The buggy address is located 56 bytes inside of
64-byte region [ffff88809881a500, ffff88809881a540)
The buggy address belongs to the page:
page:ffffea0002620680 refcount:1 mapcount:0 mapping:ffff8880aa400380 index:0x0
raw: 00fffe0000000200 ffffea000250b748 ffffea000254bac8 ffff8880aa400380
raw: 0000000000000000 ffff88809881a000 0000000100000020 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff88809881a400: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
ffff88809881a480: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
>ffff88809881a500: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
^
ffff88809881a580: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
ffff88809881a600: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc

Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: [email protected]
Signed-off-by: Jason A. Donenfeld <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net_sched: fix a resource leak in tcindex_set_parms()

Jakub noticed there is a potential resource leak in
tcindex_set_parms(): when tcindex_filter_result_init() fails
and it jumps to 'errout1' which doesn't release the memory
and resources allocated by tcindex_alloc_perfect_hash().

We should just jump to 'errout_alloc' which calls
tcindex_free_perfect_hash().

Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()")
Reported-by: Jakub Kicinski <[email protected]>
Cc: Jamal Hadi Salim <[email protected]>
Cc: Jiri Pirko <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mptcp: fix use-after-free on tcp fallback

When an mptcp socket connects to a tcp peer or when a middlebox interferes
with tcp options, mptcp needs to fall back to plain tcp.
Problem is that mptcp is trying to be too clever in this case:

It attempts to close the mptcp meta sk and transparently replace it with
the (only) subflow tcp sk.

Unfortunately, this is racy -- the socket is already exposed to userspace.
Any parallel calls to send/recv/setsockopt etc. can cause use-after-free:

BUG: KASAN: use-after-free in atomic_try_cmpxchg include/asm-generic/atomic-instrumented.h:693 [inline]
CPU: 1 PID: 2083 Comm: syz-executor.1 Not tainted 5.5.0 #2
atomic_try_cmpxchg include/asm-generic/atomic-instrumented.h:693 [inline]
queued_spin_lock include/asm-generic/qspinlock.h:78 [inline]
do_raw_spin_lock include/linux/spinlock.h:181 [inline]
__raw_spin_lock_bh include/linux/spinlock_api_smp.h:136 [inline]
_raw_spin_lock_bh+0x71/0xd0 kernel/locking/spinlock.c:175
spin_lock_bh include/linux/spinlock.h:343 [inline]
__lock_sock+0x105/0x190 net/core/sock.c:2414
lock_sock_nested+0x10f/0x140 net/core/sock.c:2938
lock_sock include/net/sock.h:1516 [inline]
mptcp_setsockopt+0x2f/0x1f0 net/mptcp/protocol.c:800
__sys_setsockopt+0x152/0x240 net/socket.c:2130
__do_sys_setsockopt net/socket.c:2146 [inline]
__se_sys_setsockopt net/socket.c:2143 [inline]
__x64_sys_setsockopt+0xba/0x150 net/socket.c:2143
do_syscall_64+0xb7/0x3d0 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x44/0xa9

While the use-after-free can be resolved, there is another problem:
sock->ops and sock->sk assignments are not atomic, i.e. we may get calls
into mptcp functions with sock->sk already pointing at the subflow socket,
or calls into tcp functions with a mptcp meta sk.

Remove the fallback code and call the relevant functions for the (only)
subflow in case the mptcp socket is connected to tcp peer.

Reported-by: Christoph Paasch <[email protected]>
Diagnosed-by: Paolo Abeni <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Reviewed-by: Mat Martineau <[email protected]>
Tested-by: Christoph Paasch <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: microchip: Platform data shan't include kernel.h

Replace with appropriate types.h.

Signed-off-by: Andy Shevchenko <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: b53: Platform data shan't include kernel.h

Replace with appropriate types.h.

Signed-off-by: Andy Shevchenko <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

netdevsim: fix ptr_ret.cocci warnings

drivers/net/netdevsim/dev.c:937:1-3: WARNING: PTR_ERR_OR_ZERO can be used

Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Generated by: scripts/coccinelle/api/ptr_ret.cocci

Fixes: 6556ff32f12d ("netdevsim: use IS_ERR instead of IS_ERR_OR_NULL for debugfs")
CC: Taehee Yoo <[email protected]>
Signed-off-by: kbuild test robot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sgi: ioc3-eth: Remove leftover free_irq()

Commit 0ce5ebd24d25 ("mfd: ioc3: Add driver for SGI IOC3 chip") moved
request_irq() from ioc3_open into probe function, but forgot to remove
free_irq() from ioc3_close.

Fixes: 0ce5ebd24d25 ("mfd: ioc3: Add driver for SGI IOC3 chip")
Signed-off-by: Thomas Bogendoerfer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

cifs: fail i/o on soft mounts if sessionsetup errors out

RHBZ: 1579050

If we have a soft mount we should fail commands for session-setup
failures (such as the password having changed/ account being deleted/ ...)
and return an error back to the application.

Signed-off-by: Ronnie Sahlberg <[email protected]>
Signed-off-by: Steve French <[email protected]>
CC: Stable <[email protected]>

smb3: fix problem with null cifs super block with previous patch

Add check for null cifs_sb to create_options helper

Signed-off-by: Steve French <[email protected]>
Reviewed-by: Amir Goldstein <[email protected]>
Reviewed-by: Aurelien Aptel <[email protected]>

Merge tag 'asoc-v5.6-2' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

ASoC: Fixes for v5.6

A collection of updates for bugs fixed since the initial pull
request, the most important one being the addition of COMMON_CLK
for wcd934x which is needed for MFD to be merged.

ASoC: wcd934x: Add missing COMMON_CLK dependency to SND_SOC_ALL_CODECS

Just adding a dependency on COMMON_CLK to SND_SOC_WCD934X is not
sufficient, as enabling SND_SOC_ALL_CODECS will still select it,
breaking the build later:

    WARNING: unmet direct dependencies detected for SND_SOC_WCD934X
      Depends on [n]: SOUND [=m] && !UML && SND [=m] && SND_SOC [=m] && COMMON_CLK [=n] && MFD_WCD934X [=m]
      Selected by [m]:
      - SND_SOC_ALL_CODECS [=m] && SOUND [=m] && !UML && SND [=m] && SND_SOC [=m] && COMPILE_TEST [=y] && MFD_WCD934X [=m]
    ...
    ERROR: "of_clk_add_provider" [sound/soc/codecs/snd-soc-wcd934x.ko] undefined!
    ERROR: "of_clk_src_simple_get" [sound/soc/codecs/snd-soc-wcd934x.ko] undefined!
    ERROR: "clk_hw_register" [sound/soc/codecs/snd-soc-wcd934x.ko] undefined!
    ERROR: "__clk_get_name" [sound/soc/codecs/snd-soc-wcd934x.ko] undefined!

Fix this by adding the missing dependency to SND_SOC_ALL_CODECS

Fixes: 42b716359beca106 ("ASoC: wcd934x: Add missing COMMON_CLK dependency")
Signed-off-by: Geert Uytterhoeven <[email protected]>
Tested-by: Stephen Rothwell <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>

bootconfig: Only load bootconfig if "bootconfig" is on the kernel cmdline

As the bootconfig is appended to the initrd it is not as easy to modify as
the kernel command line. If there's some issue with the kernel, and the
developer wants to boot a pristine kernel, it should not be needed to modify
the initrd to remove the bootconfig for a single boot.

As bootconfig is silently added (if the admin does not know where to look
they may not know it's being loaded). It should be explicitly added to the
kernel cmdline. The loading of the bootconfig is only done if "bootconfig"
is on the kernel command line. This will let admins know that the kernel
command line is extended.

Note, after adding printk()s for when the size is too great or the checksum
is wrong, exposed that the current method always looked for the boot config,
and if this size and checksum matched, it would parse it (as if either is
wrong a printk has been added to show this). It's better to only check this
if the boot config is asked to be looked for.

Link: https://lore.kernel.org/r/CAHk-=wjfjO+h6bQzrTf=YCZA53Y3EDyAs3Z4gEsT7icA3u_Psw@mail.gmail.com
Acked-by: Masami Hiramatsu <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>

dt-bindings: Fix paths in schema $id fields

The $id path checks were inadequately checking the path part of the $id
value. With the check fixed, there's a number of errors that need to be
fixed. Most of the errors are including 'bindings/' in the path which
should not be as that is considered the root.

Cc: Andy Gross <[email protected]>
Cc: Bjorn Andersson <[email protected]>
Cc: Manivannan Sadhasivam <[email protected]>
Cc: Michael Turquette <[email protected]>
Cc: Shawn Guo <[email protected]>
Cc: Sascha Hauer <[email protected]>
Cc: Pengutronix Kernel Team <[email protected]>
Cc: Fabio Estevam <[email protected]>
Cc: NXP Linux Team <[email protected]>
Cc: Maxime Coquelin <[email protected]>
Cc: Alexandre Torgue <[email protected]>
Cc: "Nuno Sá" <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Stefan Popa <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Marcus Folkesson <[email protected]>
Cc: Kent Gustavsson <[email protected]>
Cc: Dmitry Torokhov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Reviewed-by: Guenter Roeck <[email protected]>
Reviewed-by: Stephen Boyd <[email protected]>
Signed-off-by: Rob Herring <[email protected]>

Merge tag 'perf-core-for-mingo-5.6-20200201' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent

Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:

perf maps:

  Cengiz Can:

  - Add missing unlock to maps__insert() error case.

srcline:

  Changbin Du:

  - Make perf able to build with latest libbfd.

perf parse:

  Leo Yan:

  - Keep copy of string in perf_evsel_config_term() to fix sink terms
    processing in ARM CoreSight.

perf test:

  Thomas Richter:

  - Fix test case Merge cpu map, removing extra reference count drop that
    causes a segfault on s/390.

perf probe:

  Thomas Richter:

  - Add ustring support for perf probe command

Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

Merge branch 'linus' into perf/urgent, to synchronize with upstream

Signed-off-by: Ingo Molnar <[email protected]>

Merge branch 'parisc-5.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux

Pull parisc updates from Helge Deller:
"A page table initialization cleanup from Mike Rapoport and regenerated
  defconfig files from Helge Deller"

* 'parisc-5.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Regenerate parisc defconfigs
  parisc: map_pages(): cleanup page table initialization

xtensa: ISS: improve simcall assembly

Drop redundant result moving from inline assembly, use a1 and b1 values
as return value and errno value respectively.

Signed-off-by: Max Filippov <[email protected]>

xtensa: reorganize vectors placement

Allow vectors to be either merged into the kernel .text or put at a
fixed virtual address independently of XIP option. Drop option that
puts vectors at a fixed offset from the kernel text. Add choice to
Kconfig.
Vectors at fixed virtual address may be useful for XIP-aware MTD support
and for noMMU configurations with available IRAM. Configurations without
VECBASE register must put their vectors at specific locations regardless
of the selected option. All other configurations should happily use
merged vectors.

Signed-off-by: Max Filippov <[email protected]>

xtensa: separate SMP and XIP support

There's no real dependency between SMP and XIP, allow them to be
selected together. Always define 2- and 4-argument SECTION_VECTOR
macros, always use 4-argument macro for the secondary reset vector and
always define relocation entry for it.

Signed-off-by: Max Filippov <[email protected]>

xtensa: move fast exception handlers close to vectors

On XIP kernels it makes sense to have exception vectors and fast
exception handlers together (in a fast memory). In addition, with MTD
XIP support both vectors and fast exception handlers must be outside of
the FLASH.

Add section .exception.text and move fast exception handlers to it.
Put it together with vectors when vectors are outside of the .text.

Signed-off-by: Max Filippov <[email protected]>

Merge tag 'jfs-5.6' of git://github.com/kleikamp/linux-shaggy

Pull jfs update from David Kleikamp:
"Trivial cleanup for jfs"

* tag 'jfs-5.6' of git://github.com/kleikamp/linux-shaggy:
jfs: remove unused MAXL2PAGES

Merge branch 'work.recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs recursive removal updates from Al Viro:
"We have quite a few places where synthetic filesystems do an
  equivalent of 'rm -rf', with varying amounts of code duplication,
  wrong locking, etc. That really ought to be a library helper.

  Only debugfs (and very similar tracefs) are converted here - I have
  more conversions, but they'd never been in -next, so they'll have to
  wait"

* 'work.recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  simple_recursive_removal(): kernel-side rm -rf for ramfs-style filesystems

Merge branch 'imm.timestamp' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs timestamp updates from Al Viro:
"More 64bit timestamp work"

* 'imm.timestamp' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kernfs: don't bother with timestamp truncation
  fs: Do not overload update_time
  fs: Delete timespec64_trunc()
  fs: ubifs: Eliminate timespec64_trunc() usage
  fs: ceph: Delete timespec64_trunc() usage
  fs: cifs: Delete usage of timespec64_trunc
  fs: fat: Eliminate timespec64_trunc() usage
  utimes: Clamp the timestamps in notify_change()

kconfig: Invalidate all symbols after changing to y or m.

Since commit 89b9060987d9 ("kconfig: Add yes2modconfig and
mod2yesconfig targets.") forgot to clear SYMBOL_VALID bit after
changing to y or m, these targets did not save the changes.
Call sym_clear_all_valid() so that all symbols are revalidated.

Fixes: 89b9060987d9 ("kconfig: Add yes2modconfig and mod2yesconfig targets.")
Signed-off-by: Tetsuo Handa <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>

kallsyms: fix type of kallsyms_token_table[]

kallsyms_token_table[] only contains ASCII characters. It should be
char instead of u8.

Signed-off-by: Masahiro Yamada <[email protected]>
Reviewed-by: Masami Hiramatsu <[email protected]>

drm/amd/dm/mst: Ignore payload update failures

Disabling a display on MST can potentially happen after the entire MST
topology has been removed, which means that we can't communicate with
the topology at all in this scenario. Likewise, this also means that we
can't properly update payloads on the topology and as such, it's a good
idea to ignore payload update failures when disabling displays.
Currently, amdgpu makes the mistake of halting the payload update
process when any payload update failures occur, resulting in leaving
DC's local copies of the payload tables out of date.

This ends up causing problems with hotplugging MST topologies, and
causes modesets on the second hotplug to fail like so:

[drm] Failed to updateMST allocation table forpipe idx:1
------------[ cut here ]------------
WARNING: CPU: 5 PID: 1511 at
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_link.c:2677
update_mst_stream_alloc_table+0x11e/0x130 [amdgpu]
Modules linked in: cdc_ether usbnet fuse xt_conntrack nf_conntrack
nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4
nft_counter nft_compat nf_tables nfnetlink tun bridge stp llc sunrpc
vfat fat wmi_bmof uvcvideo snd_hda_codec_realtek snd_hda_codec_generic
snd_hda_codec_hdmi videobuf2_vmalloc snd_hda_intel videobuf2_memops
videobuf2_v4l2 snd_intel_dspcfg videobuf2_common crct10dif_pclmul
snd_hda_codec videodev crc32_pclmul snd_hwdep snd_hda_core
ghash_clmulni_intel snd_seq mc joydev pcspkr snd_seq_device snd_pcm
sp5100_tco k10temp i2c_piix4 snd_timer thinkpad_acpi ledtrig_audio snd
wmi soundcore video i2c_scmi acpi_cpufreq ip_tables amdgpu(O)
rtsx_pci_sdmmc amd_iommu_v2 gpu_sched mmc_core i2c_algo_bit ttm
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm
crc32c_intel serio_raw hid_multitouch r8152 mii nvme r8169 nvme_core
rtsx_pci pinctrl_amd
CPU: 5 PID: 1511 Comm: gnome-shell Tainted: G           O      5.5.0-rc7Lyude-Test+ #4
Hardware name: LENOVO FA495SIT26/FA495SIT26, BIOS R12ET22W(0.22 ) 01/31/2019
RIP: 0010:update_mst_stream_alloc_table+0x11e/0x130 [amdgpu]
Code: 28 00 00 00 75 2b 48 8d 65 e0 5b 41 5c 41 5d 41 5e 5d c3 0f b6 06
49 89 1c 24 41 88 44 24 08 0f b6 46 01 41 88 44 24 09 eb 93 <0f> 0b e9
2f ff ff ff e8 a6 82 a3 c2 66 0f 1f 44 00 00 0f 1f 44 00
RSP: 0018:ffffac428127f5b0 EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff8d1e166eee80 RCX: 0000000000000000
RDX: ffffac428127f668 RSI: ffff8d1e166eee80 RDI: ffffac428127f610
RBP: ffffac428127f640 R08: ffffffffc03d94a8 R09: 0000000000000000
R10: ffff8d1e24b02000 R11: ffffac428127f5b0 R12: ffff8d1e1b83d000
R13: ffff8d1e1bea0b08 R14: 0000000000000002 R15: 0000000000000002
FS:  00007fab23ffcd80(0000) GS:ffff8d1e28b40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f151f1711e8 CR3: 00000005997c0000 CR4: 00000000003406e0
Call Trace:
? mutex_lock+0xe/0x30
dc_link_allocate_mst_payload+0x9a/0x210 [amdgpu]
? dm_read_reg_func+0x39/0xb0 [amdgpu]
? core_link_enable_stream+0x656/0x730 [amdgpu]
core_link_enable_stream+0x656/0x730 [amdgpu]
dce110_apply_ctx_to_hw+0x58e/0x5d0 [amdgpu]
? dcn10_verify_allow_pstate_change_high+0x1d/0x280 [amdgpu]
? dcn10_wait_for_mpcc_disconnect+0x3c/0x130 [amdgpu]
dc_commit_state+0x292/0x770 [amdgpu]
? add_timer+0x101/0x1f0
? ttm_bo_put+0x1a1/0x2f0 [ttm]
amdgpu_dm_atomic_commit_tail+0xb59/0x1ff0 [amdgpu]
? amdgpu_move_blit.constprop.0+0xb8/0x1f0 [amdgpu]
? amdgpu_bo_move+0x16d/0x2b0 [amdgpu]
? ttm_bo_handle_move_mem+0x118/0x570 [ttm]
? ttm_bo_validate+0x134/0x150 [ttm]
? dm_plane_helper_prepare_fb+0x1b9/0x2a0 [amdgpu]
? _cond_resched+0x15/0x30
? wait_for_completion_timeout+0x38/0x160
? _cond_resched+0x15/0x30
? wait_for_completion_interruptible+0x33/0x190
commit_tail+0x94/0x130 [drm_kms_helper]
drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
drm_atomic_helper_set_config+0x70/0xb0 [drm_kms_helper]
drm_mode_setcrtc+0x194/0x6a0 [drm]
? _cond_resched+0x15/0x30
? mutex_lock+0xe/0x30
? drm_mode_getcrtc+0x180/0x180 [drm]
drm_ioctl_kernel+0xaa/0xf0 [drm]
drm_ioctl+0x208/0x390 [drm]
? drm_mode_getcrtc+0x180/0x180 [drm]
amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
do_vfs_ioctl+0x458/0x6d0
ksys_ioctl+0x5e/0x90
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x55/0x1b0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fab2121f87b
Code: 0f 1e fa 48 8b 05 0d 96 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff
ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d dd 95 2c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffd045f9068 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffd045f90a0 RCX: 00007fab2121f87b
RDX: 00007ffd045f90a0 RSI: 00000000c06864a2 RDI: 000000000000000b
RBP: 00007ffd045f90a0 R08: 0000000000000000 R09: 000055dbd2985d10
R10: 000055dbd2196280 R11: 0000000000000246 R12: 00000000c06864a2
R13: 000000000000000b R14: 0000000000000000 R15: 000055dbd2196280
---[ end trace 6ea888c24d2059cd ]---

Note as well, I have only been able to reproduce this on setups with 2
MST displays.

Changes since v1:
* Don't return false when part 1 or part 2 of updating the payloads
  fails, we don't want to abort at any step of the process even if
  things fail

Reviewed-by: Mikita Lipski <[email protected]>
Signed-off-by: Lyude Paul <[email protected]>
Acked-by: Harry Wentland <[email protected]>
Cc: [email protected]
Signed-off-by: Alex Deucher <[email protected]>

drm/amdgpu: update default voltage for boot od table for navi1x

It needed to be updated as well so it will show the proper values
if you reset to the defaults.

Bug: https://gitlab.freedesktop.org/drm/amd/issues/1020
Reviewed-by: Evan Quan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

io_uring: cleanup fixed file data table references

syzbot reports a use-after-free in io_ring_file_ref_switch() when it
tries to switch back to percpu mode. When we put the final reference to
the table by calling percpu_ref_kill_and_confirm(), we don't want the
zero reference to queue async work for flushing the potentially queued
up items. We currently do a few flush_work(), but they merely paper
around the issue, since the work item may not have been queued yet
depending on the when the percpu-ref callback gets run.

Coming into the file unregister, we know we have the ring quiesced.
io_ring_file_ref_switch() can check for whether or not the ref is dying
or not, and not queue anything async at that point. Once the ref has
been confirmed killed, flush any potential items manually.

Reported-by: [email protected]
Fixes: 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Signed-off-by: Jens Axboe <[email protected]>

cpuidle: Documentation: Clean up PM QoS description

Clean up the language in one paragraph in the PM QoS description in
Documentation/admin-guide/pm/cpuidle.rst.

Signed-off-by: Rafael J. Wysocki <[email protected]>

io_uring: spin for sq thread to idle on shutdown

As part of io_uring shutdown, we cancel work that is pending and won't
necessarily complete on its own. That includes requests like poll
commands and timeouts.

If we're using SQPOLL for kernel side submission and we shutdown the
ring immediately after queueing such work, we can race with the sqthread
doing the submission. This means we may miss cancelling some work, which
results in the io_uring shutdown hanging forever.

Cc: [email protected]
Signed-off-by: Jens Axboe <[email protected]>

dt-bindings: PCI: intel: Fix dt_binding_check compilation failure

Remove <dt-bindings/clock/intel,lgm-clk.h> dependency as
it is not present in the mainline tree. Use numeric value
instead of LGM_GCLK_PCIE10 macro.

Signed-off-by: Dilip Kota <[email protected]>
[robh: Also drop interrupt-parent from example]
Signed-off-by: Rob Herring <[email protected]>

dt-bindings: phy: Fix errors in intel,lgm-emmc-phy example

DT labels can't have '-' in them causing a compile failure in the example.
Fixing that leads to more warnings:

Documentation/devicetree/bindings/phy/intel,lgm-emmc-phy.example.dts:23.13-33: Warning (reg_format): /example-0/chiptop@e0200000/emmc-phy@a8:reg: property has invalid length (8 bytes) (#address-cells == 2, #size-cells == 1)
Documentation/devicetree/bindings/phy/intel,lgm-emmc-phy.example.dt.yaml: Warning (pci_device_bus_num): Failed prerequisite 'reg_format'
Documentation/devicetree/bindings/phy/intel,lgm-emmc-phy.example.dt.yaml: Warning (i2c_bus_reg): Failed prerequisite 'reg_format'
Documentation/devicetree/bindings/phy/intel,lgm-emmc-phy.example.dt.yaml: Warning (spi_bus_reg): Failed prerequisite 'reg_format'
Documentation/devicetree/bindings/phy/intel,lgm-emmc-phy.example.dts:21.33-26.13: Warning (avoid_default_addr_size): /example-0/chiptop@e0200000/emmc-phy@a8: Relying on default #address-cells value
Documentation/devicetree/bindings/phy/intel,lgm-emmc-phy.example.dts:21.33-26.13: Warning (avoid_default_addr_size): /example-0/chiptop@e0200000/emmc-phy@a8: Relying on default #size-cells value

Fixes: 5bc999108025 ("dt-bindings: phy: intel-emmc-phy: Add YAML schema for LGM eMMC PHY")
Cc: Ramuthevar Vadivel Murugan <[email protected]>
Cc: Dafna Hirschfeld <[email protected]>
Cc: Kishon Vijay Abraham I <[email protected]>
Signed-off-by: Rob Herring <[email protected]>