]> Git Repo - linux.git/log
linux.git
3 years agonet: ethernet: dec: tulip: de4x5: fix possible array overflows in type3_infoblock()
Teng Qi [Thu, 18 Nov 2021 07:01:18 +0000 (15:01 +0800)]
net: ethernet: dec: tulip: de4x5: fix possible array overflows in type3_infoblock()

The definition of macro MOTO_SROM_BUG is:
  #define MOTO_SROM_BUG    (lp->active == 8 && (get_unaligned_le32(
  dev->dev_addr) & 0x00ffffff) == 0x3e0008)

and the if statement
  if (MOTO_SROM_BUG) lp->active = 0;

using this macro indicates lp->active could be 8. If lp->active is 8 and
the second comparison of this macro is false. lp->active will remain 8 in:
  lp->phy[lp->active].gep = (*p ? p : NULL); p += (2 * (*p) + 1);
  lp->phy[lp->active].rst = (*p ? p : NULL); p += (2 * (*p) + 1);
  lp->phy[lp->active].mc  = get_unaligned_le16(p); p += 2;
  lp->phy[lp->active].ana = get_unaligned_le16(p); p += 2;
  lp->phy[lp->active].fdx = get_unaligned_le16(p); p += 2;
  lp->phy[lp->active].ttm = get_unaligned_le16(p); p += 2;
  lp->phy[lp->active].mci = *p;

However, the length of array lp->phy is 8, so array overflows can occur.
To fix these possible array overflows, we first check lp->active and then
return -EINVAL if it is greater or equal to ARRAY_SIZE(lp->phy) (i.e. 8).

Reported-by: TOTE Robot <[email protected]>
Signed-off-by: Teng Qi <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
3 years agonet: tulip: de4x5: fix the problem that the array 'lp->phy[8]' may be out of bound
zhangyue [Thu, 18 Nov 2021 05:46:32 +0000 (13:46 +0800)]
net: tulip: de4x5: fix the problem that the array 'lp->phy[8]' may be out of bound

In line 5001, if all id in the array 'lp->phy[8]' is not 0, when the
'for' end, the 'k' is 8.

At this time, the array 'lp->phy[8]' may be out of bound.

Signed-off-by: zhangyue <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
3 years agoMerge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-
David S. Miller [Thu, 18 Nov 2021 11:48:33 +0000 (11:48 +0000)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-
queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2021-11-17

This series contains updates to i40e driver only.

Eryk adds accounting for VLAN header in packet size when VF port VLAN is
configured. He also fixes TC queue distribution when the user has changed
queue counts as well as for configuration of VF ADQ which caused dropped
packets.

Michal adds tracking for when a VSI is being released to prevent null
pointer dereference when managing filters.

Karen ensures PF successfully initiates VF requested reset which could
cause a call trace otherwise.

Jedrzej moves validation of channel queue value earlier to prevent
partial configuration when the value is invalid.

Grzegorz corrects the reported error when adding filter fails.
====================

Signed-off-by: David S. Miller <[email protected]>
3 years agoipv6: check return value of ipv6_skip_exthdr
Jordy Zomer [Wed, 17 Nov 2021 19:06:48 +0000 (20:06 +0100)]
ipv6: check return value of ipv6_skip_exthdr

The offset value is used in pointer math on skb->data.
Since ipv6_skip_exthdr may return -1 the pointer to uh and th
may not point to the actual udp and tcp headers and potentially
overwrite other stuff. This is why I think this should be checked.

EDIT:  added {}'s, thanks Kees

Signed-off-by: Jordy Zomer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
3 years agoe100: fix device suspend/resume
Jesse Brandeburg [Wed, 17 Nov 2021 20:59:52 +0000 (12:59 -0800)]
e100: fix device suspend/resume

As reported in [1], e100 was no longer working for suspend/resume
cycles. The previous commit mentioned in the fixes appears to have
broken things and this attempts to practice best known methods for
device power management and keep wake-up working while allowing
suspend/resume to work. To do this, I reorder a little bit of code
and fix the resume path to make sure the device is enabled.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=214933

Fixes: 69a74aef8a18 ("e100: use generic power management")
Cc: Vaibhav Gupta <[email protected]>
Reported-by: Alexey Kuznetsov <[email protected]>
Signed-off-by: Jesse Brandeburg <[email protected]>
Tested-by: Alexey Kuznetsov <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
3 years agodevlink: Don't throw an error if flash notification sent before devlink visible
Leon Romanovsky [Wed, 17 Nov 2021 14:49:09 +0000 (16:49 +0200)]
devlink: Don't throw an error if flash notification sent before devlink visible

The mlxsw driver calls to various devlink flash routines even before
users can get any access to the devlink instance itself. For example,
mlxsw_core_fw_rev_validate() one of such functions.

__mlxsw_core_bus_device_register
 -> mlxsw_core_fw_rev_validate
  -> mlxsw_core_fw_flash
   -> mlxfw_firmware_flash
    -> mlxfw_status_notify
     -> devlink_flash_update_status_notify
      -> __devlink_flash_update_notify
       -> WARN_ON(...)

It causes to the WARN_ON to trigger warning about devlink not registered.

Fixes: cf530217408e ("devlink: Notify users when objects are accessible")
Reported-by: Danielle Ratson <[email protected]>
Tested-by: Danielle Ratson <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
3 years agopage_pool: Revert "page_pool: disable dma mapping support..."
Yunsheng Lin [Wed, 17 Nov 2021 07:56:52 +0000 (15:56 +0800)]
page_pool: Revert "page_pool: disable dma mapping support..."

This reverts commit d00e60ee54b12de945b8493cf18c1ada9e422514.

As reported by Guillaume in [1]:
Enabling LPAE always enables CONFIG_ARCH_DMA_ADDR_T_64BIT
in 32-bit systems, which breaks the bootup proceess when a
ethernet driver is using page pool with PP_FLAG_DMA_MAP flag.
As we were hoping we had no active consumers for such system
when we removed the dma mapping support, and LPAE seems like
a common feature for 32 bits system, so revert it.

1. https://www.spinics.net/lists/netdev/msg779890.html

Fixes: d00e60ee54b1 ("page_pool: disable dma mapping support for 32-bit arch with 64-bit DMA")
Signed-off-by: Yunsheng Lin <[email protected]>
Reported-by: "kernelci.org bot" <[email protected]>
Tested-by: "kernelci.org bot" <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Acked-by: Ilias Apalodimas <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
3 years agoethernet: hisilicon: hns: hns_dsaf_misc: fix a possible array overflow in hns_dsaf_ge...
Teng Qi [Wed, 17 Nov 2021 03:44:53 +0000 (11:44 +0800)]
ethernet: hisilicon: hns: hns_dsaf_misc: fix a possible array overflow in hns_dsaf_ge_srst_by_port()

The if statement:
  if (port >= DSAF_GE_NUM)
        return;

limits the value of port less than DSAF_GE_NUM (i.e., 8).
However, if the value of port is 6 or 7, an array overflow could occur:
  port_rst_off = dsaf_dev->mac_cb[port]->port_rst_off;

because the length of dsaf_dev->mac_cb is DSAF_MAX_PORT_NUM (i.e., 6).

To fix this possible array overflow, we first check port and if it is
greater than or equal to DSAF_MAX_PORT_NUM, the function returns.

Reported-by: TOTE Robot <[email protected]>
Signed-off-by: Teng Qi <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
3 years agoMerge branch 'rework/printk_safe-removal' into for-linus
Petr Mladek [Thu, 18 Nov 2021 09:03:47 +0000 (10:03 +0100)]
Merge branch 'rework/printk_safe-removal' into for-linus

3 years agoparisc: Enable CONFIG_PRINTK_TIME=y in 32bit defconfig
Helge Deller [Wed, 17 Nov 2021 14:48:45 +0000 (15:48 +0100)]
parisc: Enable CONFIG_PRINTK_TIME=y in 32bit defconfig

Signed-off-by: Helge Deller <[email protected]>
3 years agoRevert "parisc: Reduce sigreturn trampoline to 3 instructions"
Helge Deller [Wed, 17 Nov 2021 10:05:07 +0000 (11:05 +0100)]
Revert "parisc: Reduce sigreturn trampoline to 3 instructions"

This reverts commit e4f2006f1287e7ea17660490569cff323772dac4.

This patch shows problems with signal handling. Revert it for now.

Signed-off-by: Helge Deller <[email protected]>
Cc: <[email protected]> # v5.15
3 years agoparisc: Wrap assembler related defines inside __ASSEMBLY__
Helge Deller [Tue, 16 Nov 2021 12:12:21 +0000 (13:12 +0100)]
parisc: Wrap assembler related defines inside __ASSEMBLY__

Building allmodconfig shows errors in the gpu/drm/msm snapdragon drivers,
because a COND() define is used there which conflicts with the COND() for
PA-RISC assembly.  Although the snapdragon driver isn't relevant for parisc, it
is nevertheless compiled when CONFIG_COMPILE_TEST is defined.

Move the COND() define and other PA-RISC mnemonics inside the #ifdef
__ASSEMBLY__ part to avoid this conflict.

Signed-off-by: Helge Deller <[email protected]>
Reported-by: kernel test robot <[email protected]>
3 years agoparisc: Wire up futex_waitv
Helge Deller [Tue, 16 Nov 2021 12:11:26 +0000 (13:11 +0100)]
parisc: Wire up futex_waitv

Signed-off-by: Helge Deller <[email protected]>
3 years agoparisc: Include stringify.h to avoid build error in crypto/api.c
Helge Deller [Mon, 15 Nov 2021 16:34:50 +0000 (17:34 +0100)]
parisc: Include stringify.h to avoid build error in crypto/api.c

Include stringify.h to avoid this build error:
 arch/parisc/include/asm/jump_label.h: error: expected ':' before '__stringify'
 arch/parisc/include/asm/jump_label.h: error: label 'l_yes' defined but not used [-Werror=unused-label]

Signed-off-by: Helge Deller <[email protected]>
Reported-by: kernel test robot <[email protected]>
3 years agoKVM: x86: Cap KVM_CAP_NR_VCPUS by KVM_CAP_MAX_VCPUS
Vitaly Kuznetsov [Tue, 16 Nov 2021 16:34:43 +0000 (17:34 +0100)]
KVM: x86: Cap KVM_CAP_NR_VCPUS by KVM_CAP_MAX_VCPUS

It doesn't make sense to return the recommended maximum number of
vCPUs which exceeds the maximum possible number of vCPUs.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20211116163443[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: s390: Cap KVM_CAP_NR_VCPUS by num_online_cpus()
Vitaly Kuznetsov [Tue, 16 Nov 2021 16:34:42 +0000 (17:34 +0100)]
KVM: s390: Cap KVM_CAP_NR_VCPUS by num_online_cpus()

KVM_CAP_NR_VCPUS is a legacy advisory value which on other architectures
return num_online_cpus() caped by KVM_CAP_NR_VCPUS or something else
(ppc and arm64 are special cases). On s390, KVM_CAP_NR_VCPUS returns
the same as KVM_CAP_MAX_VCPUS and this may turn out to be a bad
'advice'. Switch s390 to returning caped num_online_cpus() too.

Acked-by: Christian Borntraeger <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Message-Id: <20211116163443[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: RISC-V: Cap KVM_CAP_NR_VCPUS by KVM_CAP_MAX_VCPUS
Vitaly Kuznetsov [Tue, 16 Nov 2021 16:34:41 +0000 (17:34 +0100)]
KVM: RISC-V: Cap KVM_CAP_NR_VCPUS by KVM_CAP_MAX_VCPUS

It doesn't make sense to return the recommended maximum number of
vCPUs which exceeds the maximum possible number of vCPUs.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Acked-by: Anup Patel <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Message-Id: <20211116163443[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: PPC: Cap KVM_CAP_NR_VCPUS by KVM_CAP_MAX_VCPUS
Vitaly Kuznetsov [Tue, 16 Nov 2021 16:34:40 +0000 (17:34 +0100)]
KVM: PPC: Cap KVM_CAP_NR_VCPUS by KVM_CAP_MAX_VCPUS

It doesn't make sense to return the recommended maximum number of
vCPUs which exceeds the maximum possible number of vCPUs.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20211116163443[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: MIPS: Cap KVM_CAP_NR_VCPUS by KVM_CAP_MAX_VCPUS
Vitaly Kuznetsov [Tue, 16 Nov 2021 16:34:39 +0000 (17:34 +0100)]
KVM: MIPS: Cap KVM_CAP_NR_VCPUS by KVM_CAP_MAX_VCPUS

It doesn't make sense to return the recommended maximum number of
vCPUs which exceeds the maximum possible number of vCPUs.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20211116163443[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: arm64: Cap KVM_CAP_NR_VCPUS by kvm_arm_default_max_vcpus()
Vitaly Kuznetsov [Tue, 16 Nov 2021 16:34:38 +0000 (17:34 +0100)]
KVM: arm64: Cap KVM_CAP_NR_VCPUS by kvm_arm_default_max_vcpus()

Generally, it doesn't make sense to return the recommended maximum number
of vCPUs which exceeds the maximum possible number of vCPUs.

Note: ARM64 is special as the value returned by KVM_CAP_MAX_VCPUS differs
depending on whether it is a system-wide ioctl or a per-VM one. Previously,
KVM_CAP_NR_VCPUS didn't have this difference and it seems preferable to
keep the status quo. Cap KVM_CAP_NR_VCPUS by kvm_arm_default_max_vcpus()
which is what gets returned by system-wide KVM_CAP_MAX_VCPUS.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20211116163443[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: x86: Assume a 64-bit hypercall for guests with protected state
Tom Lendacky [Mon, 24 May 2021 17:48:57 +0000 (12:48 -0500)]
KVM: x86: Assume a 64-bit hypercall for guests with protected state

When processing a hypercall for a guest with protected state, currently
SEV-ES guests, the guest CS segment register can't be checked to
determine if the guest is in 64-bit mode. For an SEV-ES guest, it is
expected that communication between the guest and the hypervisor is
performed to shared memory using the GHCB. In order to use the GHCB, the
guest must have been in long mode, otherwise writes by the guest to the
GHCB would be encrypted and not be able to be comprehended by the
hypervisor.

Create a new helper function, is_64_bit_hypercall(), that assumes the
guest is in 64-bit mode when the guest has protected state, and returns
true, otherwise invoking is_64_bit_mode() to determine the mode. Update
the hypercall related routines to use is_64_bit_hypercall() instead of
is_64_bit_mode().

Add a WARN_ON_ONCE() to is_64_bit_mode() to catch occurences of calls to
this helper function for a guest running with protected state.

Fixes: f1c6366e3043 ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
Reported-by: Sean Christopherson <[email protected]>
Signed-off-by: Tom Lendacky <[email protected]>
Message-Id: <e0b20c770c9d0d1403f23d83e785385104211f74.1621878537[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoselftests: KVM: Add /x86_64/sev_migrate_tests to .gitignore
Arnaldo Carvalho de Melo [Tue, 16 Nov 2021 15:03:25 +0000 (12:03 -0300)]
selftests: KVM: Add /x86_64/sev_migrate_tests to .gitignore

  $ git status
  nothing to commit, working tree clean
  $
  $ make -C tools/testing/selftests/kvm/ > /dev/null 2>&1
  $ git status

  Untracked files:
    (use "git add <file>..." to include in what will be committed)
   tools/testing/selftests/kvm/x86_64/sev_migrate_tests

  nothing added to commit but untracked files present (use "git add" to track)
  $

Fixes: 6a58150859fdec76 ("selftest: KVM: Add intra host migration tests")
Cc: Brijesh Singh <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marc Orr <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Gonda <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Tom Lendacky <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Message-Id: <YZPIPfvYgRDCZi/[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoriscv: kvm: fix non-kernel-doc comment block
Randy Dunlap [Sun, 7 Nov 2021 03:47:06 +0000 (20:47 -0700)]
riscv: kvm: fix non-kernel-doc comment block

Don't use "/**" to begin a comment block for a non-kernel-doc comment.

Prevents this docs build warning:

vcpu_sbi.c:3: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
 * Copyright (c) 2019 Western Digital Corporation or its affiliates.

Fixes: dea8ee31a039 ("RISC-V: KVM: Add SBI v0.1 support")
Signed-off-by: Randy Dunlap <[email protected]>
Reported-by: kernel test robot <[email protected]>
Cc: Atish Patra <[email protected]>
Cc: Anup Patel <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Message-Id: <20211107034706[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoMerge branch 'kvm-5.16-fixes' into kvm-master
Paolo Bonzini [Tue, 16 Nov 2021 13:08:19 +0000 (08:08 -0500)]
Merge branch 'kvm-5.16-fixes' into kvm-master

* Fixes for Xen emulation

* Kill kvm_map_gfn() / kvm_unmap_gfn() and broken gfn_to_pfn_cache

* Fixes for migration of 32-bit nested guests on 64-bit hypervisor

* Compilation fixes

* More SEV cleanups

3 years agoKVM: SEV: Fix typo in and tweak name of cmd_allowed_from_miror()
Sean Christopherson [Tue, 9 Nov 2021 21:51:01 +0000 (21:51 +0000)]
KVM: SEV: Fix typo in and tweak name of cmd_allowed_from_miror()

Rename cmd_allowed_from_miror() to is_cmd_allowed_from_mirror(), fixing
a typo and making it obvious that the result is a boolean where
false means "not allowed".

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211109215101.2211373[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: SEV: Drop a redundant setting of sev->asid during initialization
Sean Christopherson [Tue, 9 Nov 2021 21:51:00 +0000 (21:51 +0000)]
KVM: SEV: Drop a redundant setting of sev->asid during initialization

Remove a fully redundant write to sev->asid during SEV/SEV-ES guest
initialization.  The ASID is set a few lines earlier prior to the call to
sev_platform_init(), which doesn't take "sev" as a param, i.e. can't
muck with the ASID barring some truly magical behind-the-scenes code.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211109215101.2211373[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: SEV: WARN if SEV-ES is marked active but SEV is not
Sean Christopherson [Tue, 9 Nov 2021 21:50:59 +0000 (21:50 +0000)]
KVM: SEV: WARN if SEV-ES is marked active but SEV is not

WARN if the VM is tagged as SEV-ES but not SEV.  KVM relies on SEV and
SEV-ES being set atomically, and guards common flows with "is SEV", i.e.
observing SEV-ES without SEV means KVM has a fatal bug.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211109215101.2211373[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: SEV: Set sev_info.active after initial checks in sev_guest_init()
Sean Christopherson [Tue, 9 Nov 2021 21:50:58 +0000 (21:50 +0000)]
KVM: SEV: Set sev_info.active after initial checks in sev_guest_init()

Set sev_info.active during SEV/SEV-ES activation before calling any code
that can potentially consume sev_info.es_active, e.g. set "active" and
"es_active" as a pair immediately after the initial sanity checks.  KVM
generally expects that es_active can be true if and only if active is
true, e.g. sev_asid_new() deliberately avoids sev_es_guest() so that it
doesn't get a false negative.  This will allow WARNing in sev_es_guest()
if the VM is tagged as SEV-ES but not SEV.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211109215101.2211373[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: SEV: Disallow COPY_ENC_CONTEXT_FROM if target has created vCPUs
Sean Christopherson [Tue, 9 Nov 2021 21:50:56 +0000 (21:50 +0000)]
KVM: SEV: Disallow COPY_ENC_CONTEXT_FROM if target has created vCPUs

Reject COPY_ENC_CONTEXT_FROM if the destination VM has created vCPUs.
KVM relies on SEV activation to occur before vCPUs are created, e.g. to
set VMCB flags and intercepts correctly.

Fixes: 54526d1fd593 ("KVM: x86: Support KVM VMs sharing SEV context")
Cc: [email protected]
Cc: Peter Gonda <[email protected]>
Cc: Marc Orr <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Nathan Tempelman <[email protected]>
Cc: Brijesh Singh <[email protected]>
Cc: Tom Lendacky <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211109215101.2211373[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: Kill kvm_map_gfn() / kvm_unmap_gfn() and gfn_to_pfn_cache
David Woodhouse [Mon, 15 Nov 2021 16:50:27 +0000 (16:50 +0000)]
KVM: Kill kvm_map_gfn() / kvm_unmap_gfn() and gfn_to_pfn_cache

In commit 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time /
preempted status") I removed the only user of these functions because
it was basically impossible to use them safely.

There are two stages to the GFN->PFN mapping; first through the KVM
memslots to a userspace HVA and then through the page tables to
translate that HVA to an underlying PFN. Invalidations of the former
were being handled correctly, but no attempt was made to use the MMU
notifiers to invalidate the cache when the HVA->GFN mapping changed.

As a prelude to reinventing the gfn_to_pfn_cache with more usable
semantics, rip it out entirely and untangle the implementation of
the unsafe kvm_vcpu_map()/kvm_vcpu_unmap() functions from it.

All current users of kvm_vcpu_map() also look broken right now, and
will be dealt with separately. They broadly fall into two classes:

* Those which map, access the data and immediately unmap. This is
  mostly gratuitous and could just as well use the existing user
  HVA, and could probably benefit from a gfn_to_hva_cache as they
  do so.

* Those which keep the mapping around for a longer time, perhaps
  even using the PFN directly from the guest. These will need to
  be converted to the new gfn_to_pfn_cache and then kvm_vcpu_map()
  can be removed too.

Signed-off-by: David Woodhouse <[email protected]>
Message-Id: <20211115165030[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: nVMX: Use a gfn_to_hva_cache for vmptrld
David Woodhouse [Mon, 15 Nov 2021 16:50:26 +0000 (16:50 +0000)]
KVM: nVMX: Use a gfn_to_hva_cache for vmptrld

And thus another call to kvm_vcpu_map() can die.

Signed-off-by: David Woodhouse <[email protected]>
Message-Id: <20211115165030[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: nVMX: Use kvm_read_guest_offset_cached() for nested VMCS check
David Woodhouse [Mon, 15 Nov 2021 16:50:25 +0000 (16:50 +0000)]
KVM: nVMX: Use kvm_read_guest_offset_cached() for nested VMCS check

Kill another mostly gratuitous kvm_vcpu_map() which could just use the
userspace HVA for it.

Signed-off-by: David Woodhouse <[email protected]>
Message-Id: <20211115165030[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: x86/xen: Use sizeof_field() instead of open-coding it
David Woodhouse [Mon, 15 Nov 2021 16:50:23 +0000 (16:50 +0000)]
KVM: x86/xen: Use sizeof_field() instead of open-coding it

Signed-off-by: David Woodhouse <[email protected]>
Message-Id: <20211115165030[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: nVMX: Use kvm_{read,write}_guest_cached() for shadow_vmcs12
David Woodhouse [Mon, 15 Nov 2021 16:50:24 +0000 (16:50 +0000)]
KVM: nVMX: Use kvm_{read,write}_guest_cached() for shadow_vmcs12

Using kvm_vcpu_map() for reading from the guest is entirely gratuitous,
when all we do is a single memcpy and unmap it again. Fix it up to use
kvm_read_guest()... but in fact I couldn't bring myself to do that
without also making it use a gfn_to_hva_cache for both that *and* the
copy in the other direction.

Signed-off-by: David Woodhouse <[email protected]>
Message-Id: <20211115165030[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: x86/xen: Fix get_attr of KVM_XEN_ATTR_TYPE_SHARED_INFO
David Woodhouse [Mon, 15 Nov 2021 16:50:21 +0000 (16:50 +0000)]
KVM: x86/xen: Fix get_attr of KVM_XEN_ATTR_TYPE_SHARED_INFO

In commit 319afe68567b ("KVM: xen: do not use struct gfn_to_hva_cache") we
stopped storing this in-kernel as a GPA, and started storing it as a GFN.
Which means we probably should have stopped calling gpa_to_gfn() on it
when userspace asks for it back.

Cc: [email protected]
Fixes: 319afe68567b ("KVM: xen: do not use struct gfn_to_hva_cache")
Signed-off-by: David Woodhouse <[email protected]>
Message-Id: <20211115165030[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: x86/mmu: include EFER.LMA in extended mmu role
Maxim Levitsky [Mon, 15 Nov 2021 13:18:37 +0000 (15:18 +0200)]
KVM: x86/mmu: include EFER.LMA in extended mmu role

Incorporate EFER.LMA into kvm_mmu_extended_role, as it used to compute the
guest root level and is not reflected in kvm_mmu_page_role.level when TDP
is in use.  When simply running the guest, it is impossible for EFER.LMA
and kvm_mmu.root_level to get out of sync, as the guest cannot transition
from PAE paging to 64-bit paging without toggling CR0.PG, i.e. without
first bouncing through a different MMU context.  And stuffing guest state
via KVM_SET_SREGS{,2} also ensures a full MMU context reset.

However, if KVM_SET_SREGS{,2} is followed by KVM_SET_NESTED_STATE, e.g. to
set guest state when migrating the VM while L2 is active, the vCPU state
will reflect L2, not L1.  If L1 is using TDP for L2, then root_mmu will
have been configured using L2's state, despite not being used for L2.  If
L2.EFER.LMA != L1.EFER.LMA, and L2 is using PAE paging, then root_mmu will
be configured for guest PAE paging, but will match the mmu_role for 64-bit
paging and cause KVM to not reconfigure root_mmu on the next nested VM-Exit.

Alternatively, the root_mmu's role could be invalidated after a successful
KVM_SET_NESTED_STATE that yields vcpu->arch.mmu != vcpu->arch.root_mmu,
i.e. that switches the active mmu to guest_mmu, but doing so is unnecessarily
tricky, and not even needed if L1 and L2 do have the same role (e.g., they
are both 64-bit guests and run with the same CR4).

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20211115131837[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: nVMX: don't use vcpu->arch.efer when checking host state on nested state load
Maxim Levitsky [Mon, 15 Nov 2021 13:18:36 +0000 (15:18 +0200)]
KVM: nVMX: don't use vcpu->arch.efer when checking host state on nested state load

When loading nested state, don't use check vcpu->arch.efer to get the
L1 host's 64-bit vs. 32-bit state and don't check it for consistency
with respect to VM_EXIT_HOST_ADDR_SPACE_SIZE, as register state in vCPU
may be stale when KVM_SET_NESTED_STATE is called---and architecturally
does not exist.  When restoring L2 state in KVM, the CPU is placed in
non-root where nested VMX code has no snapshot of L1 host state: VMX
(conditionally) loads host state fields loaded on VM-exit, but they need
not correspond to the state before entry.  A simple case occurs in KVM
itself, where the host RIP field points to vmx_vmexit rather than the
instruction following vmlaunch/vmresume.

However, for the particular case of L1 being in 32- or 64-bit mode
on entry, the exit controls can be treated instead as the source of
truth regarding the state of L1 on entry, and can be used to check
that vmcs12.VM_EXIT_HOST_ADDR_SPACE_SIZE matches vmcs12.HOST_EFER if
vmcs12.VM_EXIT_LOAD_IA32_EFER is set.  The consistency check on CPU
EFER vs. vmcs12.VM_EXIT_HOST_ADDR_SPACE_SIZE, instead, happens only
on VM-Enter.  That's because, again, there's conceptually no "current"
L1 EFER to check on KVM_SET_NESTED_STATE.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <20211115131837[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: Fix steal time asm constraints
David Woodhouse [Sun, 14 Nov 2021 08:59:02 +0000 (08:59 +0000)]
KVM: Fix steal time asm constraints

In 64-bit mode, x86 instruction encoding allows us to use the low 8 bits
of any GPR as an 8-bit operand. In 32-bit mode, however, we can only use
the [abcd] registers. For which, GCC has the "q" constraint instead of
the less restrictive "r".

Also fix st->preempted, which is an input/output operand rather than an
input.

Fixes: 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Message-Id: <89bf72db1b859990355f9c40713a34e0d2d86c98[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agocpuid: kvm_find_kvm_cpuid_features() should be declared 'static'
Paul Durrant [Mon, 15 Nov 2021 14:41:31 +0000 (14:41 +0000)]
cpuid: kvm_find_kvm_cpuid_features() should be declared 'static'

The lack a static declaration currently results in:

arch/x86/kvm/cpuid.c:128:26: warning: no previous prototype for function 'kvm_find_kvm_cpuid_features'

when compiling with "W=1".

Reported-by: kernel test robot <[email protected]>
Fixes: 760849b1476c ("KVM: x86: Make sure KVM_CPUID_FEATURES really are KVM_CPUID_FEATURES")
Signed-off-by: Paul Durrant <[email protected]>
Message-Id: <20211115144131[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoocteontx2-af: debugfs: don't corrupt user memory
Dan Carpenter [Wed, 17 Nov 2021 07:34:54 +0000 (10:34 +0300)]
octeontx2-af: debugfs: don't corrupt user memory

The user supplies the "count" value to say how big its read buffer is.
The rvu_dbg_lmtst_map_table_display() function does not take the "count"
into account but instead just copies the whole table, potentially
corrupting the user's data.

Introduce the "ret" variable to store how many bytes we can copy.  Also
I changed the type of "off" to size_t to make using min() simpler.

Fixes: 0daa55d033b0 ("octeontx2-af: cn10k: debugfs for dumping LMTST map table")
Signed-off-by: Dan Carpenter <[email protected]>
Link: https://lore.kernel.org/r/20211117073454.GD5237@kili
Signed-off-by: Jakub Kicinski <[email protected]>
3 years agoNFC: add NCI_UNREG flag to eliminate the race
Lin Ma [Tue, 16 Nov 2021 15:27:32 +0000 (23:27 +0800)]
NFC: add NCI_UNREG flag to eliminate the race

There are two sites that calls queue_work() after the
destroy_workqueue() and lead to possible UAF.

The first site is nci_send_cmd(), which can happen after the
nci_close_device as below

nfcmrvl_nci_unregister_dev   |  nfc_genl_dev_up
  nci_close_device           |
    flush_workqueue          |
    del_timer_sync           |
  nci_unregister_device      |    nfc_get_device
    destroy_workqueue        |    nfc_dev_up
    nfc_unregister_device    |      nci_dev_up
      device_del             |        nci_open_device
                             |          __nci_request
                             |            nci_send_cmd
                             |              queue_work !!!

Another site is nci_cmd_timer, awaked by the nci_cmd_work from the
nci_send_cmd.

  ...                        |  ...
  nci_unregister_device      |  queue_work
    destroy_workqueue        |
    nfc_unregister_device    |  ...
      device_del             |  nci_cmd_work
                             |  mod_timer
                             |  ...
                             |  nci_cmd_timer
                             |    queue_work !!!

For the above two UAF, the root cause is that the nfc_dev_up can race
between the nci_unregister_device routine. Therefore, this patch
introduce NCI_UNREG flag to easily eliminate the possible race. In
addition, the mutex_lock in nci_close_device can act as a barrier.

Signed-off-by: Lin Ma <[email protected]>
Fixes: 6a2968aaf50c ("NFC: basic NCI protocol implementation")
Reviewed-by: Jakub Kicinski <[email protected]>
Reviewed-by: Krzysztof Kozlowski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
3 years agoNFC: reorder the logic in nfc_{un,}register_device
Lin Ma [Tue, 16 Nov 2021 15:26:52 +0000 (23:26 +0800)]
NFC: reorder the logic in nfc_{un,}register_device

There is a potential UAF between the unregistration routine and the NFC
netlink operations.

The race that cause that UAF can be shown as below:

 (FREE)                      |  (USE)
nfcmrvl_nci_unregister_dev   |  nfc_genl_dev_up
  nci_close_device           |
  nci_unregister_device      |    nfc_get_device
    nfc_unregister_device    |    nfc_dev_up
      rfkill_destory         |
      device_del             |      rfkill_blocked
  ...                        |    ...

The root cause for this race is concluded below:
1. The rfkill_blocked (USE) in nfc_dev_up is supposed to be placed after
the device_is_registered check.
2. Since the netlink operations are possible just after the device_add
in nfc_register_device, the nfc_dev_up() can happen anywhere during the
rfkill creation process, which leads to data race.

This patch reorder these actions to permit
1. Once device_del is finished, the nfc_dev_up cannot dereference the
rfkill object.
2. The rfkill_register need to be placed after the device_add of nfc_dev
because the parent device need to be created first. So this patch keeps
the order but inject device_lock to prevent the data race.

Signed-off-by: Lin Ma <[email protected]>
Fixes: be055b2f89b5 ("NFC: RFKILL support")
Reviewed-by: Jakub Kicinski <[email protected]>
Reviewed-by: Krzysztof Kozlowski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
3 years agoNFC: reorganize the functions in nci_request
Lin Ma [Mon, 15 Nov 2021 14:56:00 +0000 (22:56 +0800)]
NFC: reorganize the functions in nci_request

There is a possible data race as shown below:

thread-A in nci_request()       | thread-B in nci_close_device()
                                | mutex_lock(&ndev->req_lock);
test_bit(NCI_UP, &ndev->flags); |
...                             | test_and_clear_bit(NCI_UP, &ndev->flags)
mutex_lock(&ndev->req_lock);    |
                                |

This race will allow __nci_request() to be awaked while the device is
getting removed.

Similar to commit e2cb6b891ad2 ("bluetooth: eliminate the potential race
condition when removing the HCI controller"). this patch alters the
function sequence in nci_request() to prevent the data races between the
nci_close_device().

Signed-off-by: Lin Ma <[email protected]>
Fixes: 6a2968aaf50c ("NFC: basic NCI protocol implementation")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
3 years agotipc: check for null after calling kmemdup
Tadeusz Struk [Mon, 15 Nov 2021 16:01:43 +0000 (08:01 -0800)]
tipc: check for null after calling kmemdup

kmemdup can return a null pointer so need to check for it, otherwise
the null key will be dereferenced later in tipc_crypto_key_xmit as
can be seen in the trace [1].

Cc: [email protected]
Cc: [email protected] # 5.15, 5.14, 5.10
[1] https://syzkaller.appspot.com/bug?id=bca180abb29567b189efdbdb34cbf7ba851c2a58

Reported-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Tadeusz Struk <[email protected]>
Acked-by: Ying Xue <[email protected]>
Acked-by: Jon Maloy <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
3 years agoi40e: Fix display error code in dmesg
Grzegorz Szczurek [Fri, 29 Oct 2021 09:26:01 +0000 (11:26 +0200)]
i40e: Fix display error code in dmesg

Fix misleading display error in dmesg if tc filter return fail.
Only i40e status error code should be converted to string, not linux
error code. Otherwise, we return false information about the error.

Fixes: 2f4b411a3d67 ("i40e: Enable cloud filters via tc-flower")
Signed-off-by: Grzegorz Szczurek <[email protected]>
Signed-off-by: Mateusz Palczewski <[email protected]>
Tested-by: Dave Switzer <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
3 years agoi40e: Fix creation of first queue by omitting it if is not power of two
Jedrzej Jagielski [Mon, 21 Jun 2021 08:37:31 +0000 (08:37 +0000)]
i40e: Fix creation of first queue by omitting it if is not power of two

Reject TCs creation with proper message if the first queue
assignment is not equal to the power of two.
The first queue number was checked too late in the second queue
iteration, if second queue was configured at all. Now if first queue value
is not a power of two, then trying to create qdisc will be rejected.

Fixes: 8f88b3034db3 ("i40e: Add infrastructure for queue channel support")
Signed-off-by: Grzegorz Szczurek <[email protected]>
Signed-off-by: Jedrzej Jagielski <[email protected]>
Tested-by: Tony Brelinski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
3 years agoi40e: Fix warning message and call stack during rmmod i40e driver
Karen Sornek [Wed, 28 Apr 2021 08:19:41 +0000 (10:19 +0200)]
i40e: Fix warning message and call stack during rmmod i40e driver

Restore part of reset functionality used when reset is called
from the VF to reset itself. Without this fix warning message
is displayed when VF is being removed via sysfs.

Fix the crash of the VF during reset by ensuring
that the PF receives the reset message successfully.
Refactor code to use one function instead of two.

Fixes: 5c3c48ac6bf5 ("i40e: implement virtual device interface")
Signed-off-by: Grzegorz Szczurek <[email protected]>
Signed-off-by: Karen Sornek <[email protected]>
Tested-by: Tony Brelinski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
3 years agoMerge tag 'gfs2-v5.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Wed, 17 Nov 2021 23:55:07 +0000 (15:55 -0800)]
Merge tag 'gfs2-v5.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fixes from Andreas Gruenbacher:

 - The current iomap_file_buffered_write behavior of failing the entire
   write when part of the user buffer cannot be faulted in leads to an
   endless loop in gfs2. Work around that in gfs2 for now.

 - Various other bugs all over the place.

* tag 'gfs2-v5.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Prevent endless loops in gfs2_file_buffered_write
  gfs2: Fix "Introduce flag for glock holder auto-demotion"
  gfs2: Fix length of holes reported at end-of-file
  gfs2: release iopen glock early in evict
  gfs2: Fix atomic bug in gfs2_instantiate
  gfs2: Only dereference i->iov when iter_is_iovec(i)

3 years agoMerge tag 'mips-fixes_5.16_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips...
Linus Torvalds [Wed, 17 Nov 2021 23:12:50 +0000 (15:12 -0800)]
Merge tag 'mips-fixes_5.16_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux

Pull MIPS fixes from Thomas Bogendoerfer:

 - wire futex_waitv syscall

 - build fixes for lantiq and bcm63xx configs

 - yamon-dt bugfix

* tag 'mips-fixes_5.16_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  mips: lantiq: add support for clk_get_parent()
  mips: bcm63xx: add support for clk_get_parent()
  MIPS: generic/yamon-dt: fix uninitialized variable error
  MIPS: syscalls: Wire up futex_waitv syscall

3 years agoMerge tag 'hyperv-fixes-signed-20211117' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Wed, 17 Nov 2021 16:46:15 +0000 (08:46 -0800)]
Merge tag 'hyperv-fixes-signed-20211117' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

 - Fix ring size calculation for balloon driver (Boqun Feng)

 - Fix issues in Hyper-V setup code (Sean Christopherson)

* tag 'hyperv-fixes-signed-20211117' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  x86/hyperv: Move required MSRs check to initial platform probing
  x86/hyperv: Fix NULL deref in set_hv_tscchange_cb() if Hyper-V setup fails
  Drivers: hv: balloon: Use VMBUS_RING_SIZE() wrapper for dm_ring_size

3 years agoMerge tag 'nfsd-5.16-1' of git://linux-nfs.org/~bfields/linux
Linus Torvalds [Wed, 17 Nov 2021 16:38:00 +0000 (08:38 -0800)]
Merge tag 'nfsd-5.16-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfix from Bruce Fields:
 "This is just one bugfix for a buffer overflow in knfsd's xdr decoding"

* tag 'nfsd-5.16-1' of git://linux-nfs.org/~bfields/linux:
  NFSD: Fix exposure in nfsd4_decode_bitmap()

3 years agoi40e: Fix ping is lost after configuring ADq on VF
Eryk Rybak [Fri, 23 Apr 2021 11:43:26 +0000 (13:43 +0200)]
i40e: Fix ping is lost after configuring ADq on VF

Properly reconfigure VF VSIs after VF request ADQ.
Created new function to update queue mapping and queue pairs per TC
with AQ update VSI. This sets proper RSS size on NIC.
VFs num_queue_pairs should not be changed during setup of queue maps.
Previously, VF main VSI in ADQ had configured too many queues and had
wrong RSS size, which lead to packets not being consumed and drops in
connectivity.

Fixes: bc6d33c8d93f ("i40e: Fix the number of queues available to be mapped for use")
Co-developed-by: Przemyslaw Patynowski <[email protected]>
Signed-off-by: Przemyslaw Patynowski <[email protected]>
Signed-off-by: Eryk Rybak <[email protected]>
Tested-by: Tony Brelinski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
3 years agoi40e: Fix changing previously set num_queue_pairs for PFs
Eryk Rybak [Fri, 23 Apr 2021 11:43:25 +0000 (13:43 +0200)]
i40e: Fix changing previously set num_queue_pairs for PFs

Currently, the i40e_vsi_setup_queue_map is basing the count of queues in
TCs on a VSI's alloc_queue_pairs member which is not changed throughout
any user's action (for example via ethtool's set_channels callback).

This implies that vsi->tc_config.tc_info[n].qcount value that is given
to the kernel via netdev_set_tc_queue() that notifies about the count of
queues per particular traffic class is constant even if user has changed
the total count of queues.

This in turn caused the kernel warning after setting the queue count to
the lower value than the initial one:

$ ethtool -l ens801f0
Channel parameters for ens801f0:
Pre-set maximums:
RX:             0
TX:             0
Other:          1
Combined:       64
Current hardware settings:
RX:             0
TX:             0
Other:          1
Combined:       64

$ ethtool -L ens801f0 combined 40

[dmesg]
Number of in use tx queues changed invalidating tc mappings. Priority
traffic classification disabled!

Reason was that vsi->alloc_queue_pairs stayed at 64 value which was used
to set the qcount on TC0 (by default only TC0 exists so all of the
existing queues are assigned to TC0). we update the offset/qcount via
netdev_set_tc_queue() back to the old value but then the
netif_set_real_num_tx_queues() is using the vsi->num_queue_pairs as a
value which got set to 40.

Fix it by using vsi->req_queue_pairs as a queue count that will be
distributed across TCs. Do it only for non-zero values, which implies
that user actually requested the new count of queues.

For VSIs other than main, stay with the vsi->alloc_queue_pairs as we
only allow manipulating the queue count on main VSI.

Fixes: bc6d33c8d93f ("i40e: Fix the number of queues available to be mapped for use")
Co-developed-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Maciej Fijalkowski <[email protected]>
Co-developed-by: Przemyslaw Patynowski <[email protected]>
Signed-off-by: Przemyslaw Patynowski <[email protected]>
Signed-off-by: Eryk Rybak <[email protected]>
Tested-by: Tony Brelinski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
3 years agoi40e: Fix NULL ptr dereference on VSI filter sync
Michal Maloszewski [Wed, 24 Feb 2021 12:07:48 +0000 (12:07 +0000)]
i40e: Fix NULL ptr dereference on VSI filter sync

Remove the reason of null pointer dereference in sync VSI filters.
Added new I40E_VSI_RELEASING flag to signalize deleting and releasing
of VSI resources to sync this thread with sync filters subtask.
Without this patch it is possible to start update the VSI filter list
after VSI is removed, that's causing a kernel oops.

Fixes: 41c445ff0f48 ("i40e: main driver core")
Signed-off-by: Grzegorz Szczurek <[email protected]>
Signed-off-by: Michal Maloszewski <[email protected]>
Reviewed-by: Przemyslaw Patynowski <[email protected]>
Reviewed-by: Witold Fijalkowski <[email protected]>
Reviewed-by: Jaroslaw Gawin <[email protected]>
Reviewed-by: Aleksandr Loktionov <[email protected]>
Tested-by: Tony Brelinski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
3 years agoi40e: Fix correct max_pkt_size on VF RX queue
Eryk Rybak [Thu, 21 Jan 2021 16:17:22 +0000 (16:17 +0000)]
i40e: Fix correct max_pkt_size on VF RX queue

Setting VLAN port increasing RX queue max_pkt_size
by 4 bytes to take VLAN tag into account.
Trigger the VF reset when setting port VLAN for
VF to renegotiate its capabilities and reinitialize.

Fixes: ba4e003d29c1 ("i40e: don't hold spinlock while resetting VF")
Signed-off-by: Sylwester Dziedziuch <[email protected]>
Signed-off-by: Aleksandr Loktionov <[email protected]>
Signed-off-by: Eryk Rybak <[email protected]>
Tested-by: Konrad Jankowski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
3 years agonet: ax88796c: use bit numbers insetad of bit masks
Łukasz Stelmach [Tue, 16 Nov 2021 21:29:15 +0000 (22:29 +0100)]
net: ax88796c: use bit numbers insetad of bit masks

Change the values of EVENT_* constants from bit masks to bit numbers as
accepted by {clear,set,test}_bit() functions.

Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Łukasz Stelmach <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
3 years agonet: virtio_net_hdr_to_skb: count transport header in UFO
Jonathan Davies [Tue, 16 Nov 2021 17:42:42 +0000 (17:42 +0000)]
net: virtio_net_hdr_to_skb: count transport header in UFO

virtio_net_hdr_to_skb does not set the skb's gso_size and gso_type
correctly for UFO packets received via virtio-net that are a little over
the GSO size. This can lead to problems elsewhere in the networking
stack, e.g. ovs_vport_send dropping over-sized packets if gso_size is
not set.

This is due to the comparison

  if (skb->len - p_off > gso_size)

not properly accounting for the transport layer header.

p_off includes the size of the transport layer header (thlen), so
skb->len - p_off is the size of the TCP/UDP payload.

gso_size is read from the virtio-net header. For UFO, fragmentation
happens at the IP level so does not need to include the UDP header.

Hence the calculation could be comparing a TCP/UDP payload length with
an IP payload length, causing legitimate virtio-net packets to have
lack gso_type/gso_size information.

Example: a UDP packet with payload size 1473 has IP payload size 1481.
If the guest used UFO, it is not fragmented and the virtio-net header's
flags indicate that it is a GSO frame (VIRTIO_NET_HDR_GSO_UDP), with
gso_size = 1480 for an MTU of 1500.  skb->len will be 1515 and p_off
will be 42, so skb->len - p_off = 1473.  Hence the comparison fails, and
shinfo->gso_size and gso_type are not set as they should be.

Instead, add the UDP header length before comparing to gso_size when
using UFO. In this way, it is the size of the IP payload that is
compared to gso_size.

Fixes: 6dd912f82680 ("net: check untrusted gso_size at kernel entry")
Signed-off-by: Jonathan Davies <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
3 years agonet: dpaa2-eth: fix use-after-free in dpaa2_eth_remove
Pavel Skripkin [Tue, 16 Nov 2021 15:17:12 +0000 (18:17 +0300)]
net: dpaa2-eth: fix use-after-free in dpaa2_eth_remove

Access to netdev after free_netdev() will cause use-after-free bug.
Move debug log before free_netdev() call to avoid it.

Fixes: 7472dd9f6499 ("staging: fsl-dpaa2/eth: Move print message")
Signed-off-by: Pavel Skripkin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
3 years agonet: usb: r8152: Add MAC passthrough support for more Lenovo Docks
Aaron Ma [Tue, 16 Nov 2021 14:19:17 +0000 (22:19 +0800)]
net: usb: r8152: Add MAC passthrough support for more Lenovo Docks

Like ThinkaPad Thunderbolt 4 Dock, more Lenovo docks start to use the original
Realtek USB ethernet chip ID 0bda:8153.

Lenovo Docks always use their own IDs for usb hub, even for older Docks.
If parent hub is from Lenovo, then r8152 should try MAC passthrough.
Verified on Lenovo TBT3 dock too.

Signed-off-by: Aaron Ma <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
3 years agoDocumentation/process: fix a cross reference
Mauro Carvalho Chehab [Tue, 16 Nov 2021 12:11:23 +0000 (12:11 +0000)]
Documentation/process: fix a cross reference

The cross-reference for the handbooks section works. However, it is
meant to describe the path inside the Kernel's doc where the section
is, but there's an space instead of a dash, plus it lacks the .rst at
the end, which makes:

./scripts/documentation-file-ref-check

to complain.

Fixes: 604370e106cc ("Documentation/process: Add maintainer handbooks section")
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Jonathan Corbet <[email protected]>
3 years agoDocumentation: update vcpu-requests.rst reference
Mauro Carvalho Chehab [Tue, 16 Nov 2021 12:11:22 +0000 (12:11 +0000)]
Documentation: update vcpu-requests.rst reference

Changeset 2f5947dfcaec ("Documentation: move Documentation/virtual to Documentation/virt")
renamed: Documentation/virtual/kvm/vcpu-requests.rst
to: Documentation/virt/kvm/vcpu-requests.rst.

Update its cross-reference accordingly.

Fixes: 2f5947dfcaec ("Documentation: move Documentation/virtual to Documentation/virt")
Reviewed-by: Anup Patel <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Signed-off-by: Jonathan Corbet <[email protected]>
3 years agodocs: accounting: update delay-accounting.rst reference
Mauro Carvalho Chehab [Tue, 16 Nov 2021 12:11:21 +0000 (12:11 +0000)]
docs: accounting: update delay-accounting.rst reference

The file name: accounting/delay-accounting.rst
should be, instead: Documentation/accounting/delay-accounting.rst.

Also, there's no need to use doc:`foo`, as automarkup.py will
automatically handle plain text mentions to Documentation/
files.

So, update its cross-reference accordingly.

Fixes: fcb501704554 ("delayacct: Document task_delayacct sysctl")
Fixes: c3123552aad3 ("docs: accounting: convert to ReST")
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Jonathan Corbet <[email protected]>
3 years agolibbpf: update index.rst reference
Mauro Carvalho Chehab [Tue, 16 Nov 2021 12:11:20 +0000 (12:11 +0000)]
libbpf: update index.rst reference

Changeset d20b41115ad5 ("libbpf: Rename libbpf documentation index file")
renamed: Documentation/bpf/libbpf/libbpf.rst
to: Documentation/bpf/libbpf/index.rst.

Update its cross-reference accordingly.

Fixes: d20b41115ad5 ("libbpf: Rename libbpf documentation index file")
Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Jonathan Corbet <[email protected]>
3 years agoMerge tag 'mlx5-fixes-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git...
David S. Miller [Wed, 17 Nov 2021 10:50:53 +0000 (10:50 +0000)]
Merge tag 'mlx5-fixes-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-fixes-2021-11-16

Please pull this mlx5 fixes series, or let me know in case of any problem.
====================

Signed-off-by: David S. Miller <[email protected]>
3 years agoparisc/sticon: fix reverse colors
Sven Schnelle [Sun, 14 Nov 2021 16:08:17 +0000 (17:08 +0100)]
parisc/sticon: fix reverse colors

sticon_build_attr() checked the reverse argument and flipped
background and foreground color, but returned the non-reverse
value afterwards. Fix this and also add two local variables
for foreground and background color to make the code easier
to read.

Signed-off-by: Sven Schnelle <[email protected]>
Cc: <[email protected]>
Signed-off-by: Helge Deller <[email protected]>
3 years agofs: handle circular mappings correctly
Christian Brauner [Tue, 9 Nov 2021 14:57:12 +0000 (15:57 +0100)]
fs: handle circular mappings correctly

When calling setattr_prepare() to determine the validity of the attributes the
ia_{g,u}id fields contain the value that will be written to inode->i_{g,u}id.
When the {g,u}id attribute of the file isn't altered and the caller's fs{g,u}id
matches the current {g,u}id attribute the attribute change is allowed.

The value in ia_{g,u}id does already account for idmapped mounts and will have
taken the relevant idmapping into account. So in order to verify that the
{g,u}id attribute isn't changed we simple need to compare the ia_{g,u}id value
against the inode's i_{g,u}id value.

This only has any meaning for idmapped mounts as idmapping helpers are
idempotent without them. And for idmapped mounts this really only has a meaning
when circular idmappings are used, i.e. mappings where e.g. id 1000 is mapped
to id 1001 and id 1001 is mapped to id 1000. Such ciruclar mappings can e.g. be
useful when sharing the same home directory between multiple users at the same
time.

As an example consider a directory with two files: /source/file1 owned by
{g,u}id 1000 and /source/file2 owned by {g,u}id 1001. Assume we create an
idmapped mount at /target with an idmapping that maps files owned by {g,u}id
1000 to being owned by {g,u}id 1001 and files owned by {g,u}id 1001 to being
owned by {g,u}id 1000. In effect, the idmapped mount at /target switches the
ownership of /source/file1 and source/file2, i.e. /target/file1 will be owned
by {g,u}id 1001 and /target/file2 will be owned by {g,u}id 1000.

This means that a user with fs{g,u}id 1000 must be allowed to setattr
/target/file2 from {g,u}id 1000 to {g,u}id 1000. Similar, a user with fs{g,u}id
1001 must be allowed to setattr /target/file1 from {g,u}id 1001 to {g,u}id
1001. Conversely, a user with fs{g,u}id 1000 must fail to setattr /target/file1
from {g,u}id 1001 to {g,u}id 1000. And a user with fs{g,u}id 1001 must fail to
setattr /target/file2 from {g,u}id 1000 to {g,u}id 1000. Both cases must fail
with EPERM for non-capable callers.

Before this patch we could end up denying legitimate attribute changes and
allowing invalid attribute changes when circular mappings are used. To even get
into this situation the caller must've been privileged both to create that
mapping and to create that idmapped mount.

This hasn't been seen in the wild anywhere but came up when expanding the
testsuite during work on a series of hardening patches. All idmapped fstests
pass without any regressions and we add new tests to verify the behavior of
circular mappings.

Link: https://lore.kernel.org/r/[email protected]
Fixes: 2f221d6f7b88 ("attr: handle idmapped mounts")
Cc: Seth Forshee <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Al Viro <[email protected]>
Cc: [email protected]
CC: [email protected]
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Seth Forshee <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
3 years agonet: stmmac: Fix signed/unsigned wreckage
Thomas Gleixner [Mon, 15 Nov 2021 15:21:23 +0000 (16:21 +0100)]
net: stmmac: Fix signed/unsigned wreckage

The recent addition of timestamp correction to compensate the CDC error
introduced a subtle signed/unsigned bug in stmmac_get_tx_hwtstamp() while
it managed for some obscure reason to avoid that in stmmac_get_rx_hwtstamp().

The issue is:

    s64 adjust = 0;
    u64 ns;

    adjust += -(2 * (NSEC_PER_SEC / priv->plat->clk_ptp_rate));
    ns += adjust;

works by chance on 64bit, but falls apart on 32bit because the compiler
knows that adjust fits into 32bit and then treats the addition as a u64 +
u32 resulting in an off by ~2 seconds failure.

The RX variant uses an u64 for adjust and does the adjustment via

    ns -= adjust;

because consistency is obviously overrated.

Get rid of the pointless zero initialized adjust variable and do:

ns -= (2 * NSEC_PER_SEC) / priv->plat->clk_ptp_rate;

which is obviously correct and spares the adjust obfuscation. Aside of that
it yields a more accurate result because the multiplication takes place
before the integer divide truncation and not afterwards.

Stick the calculation into an inline so it can't be accidentally
disimproved. Return an u32 from that inline as the result is guaranteed
to fit which lets the compiler optimize the substraction.

Cc: [email protected]
Fixes: 3600be5f58c1 ("net: stmmac: add timestamp correction to rid CDC sync error")
Reported-by: Benedikt Spranger <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Benedikt Spranger <[email protected]>
Tested-by: Kurt Kanzenbach <[email protected]> # Intel EHL
Link: https://lore.kernel.org/r/87mtm578cs.ffs@tglx
Signed-off-by: Jakub Kicinski <[email protected]>
3 years agoMerge branch 'net-fix-the-mirred-packet-drop-due-to-the-incorrect-dst'
Jakub Kicinski [Wed, 17 Nov 2021 03:17:41 +0000 (19:17 -0800)]
Merge branch 'net-fix-the-mirred-packet-drop-due-to-the-incorrect-dst'

Xin Long says:

====================
net: fix the mirred packet drop due to the incorrect dst

This issue was found when using OVS HWOL on OVN-k8s. These packets
dropped on rx path were seen with output dst, which should've been
dropped from the skbs when redirecting them.

The 1st patch is to the fix and the 2nd is a selftest to reproduce
and verify it.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
3 years agoselftests: add a test case for mirred egress to ingress
Davide Caratti [Fri, 12 Nov 2021 16:33:12 +0000 (11:33 -0500)]
selftests: add a test case for mirred egress to ingress

add a selftest that verifies the correct behavior of TC act_mirred egress
to ingress: in particular, it checks if the dst_entry is removed from skb
before redirect egress -> ingress. The correct behavior is: an ICMP 'echo
request' generated by ping will be received and generate a reply the same
way as the one generated by mausezahn.

Suggested-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Acked-by: Cong Wang <[email protected]>
Reviewed-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
3 years agonet: sched: act_mirred: drop dst for the direction from egress to ingress
Xin Long [Fri, 12 Nov 2021 16:33:11 +0000 (11:33 -0500)]
net: sched: act_mirred: drop dst for the direction from egress to ingress

Without dropping dst, the packets sent from local mirred/redirected
to ingress will may still use the old dst. ip_rcv() will drop it as
the old dst is for output and its .input is dst_discard.

This patch is to fix by also dropping dst for those packets that are
mirred or redirected from egress to ingress in act_mirred.

Note that we don't drop it for the direction change from ingress to
egress, as on which there might be a user case attaching a metadata
dst by act_tunnel_key that would be used later.

Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
Signed-off-by: Xin Long <[email protected]>
Acked-by: Cong Wang <[email protected]>
Reviewed-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
3 years agoamt: cancel delayed_work synchronously in amt_fini()
Taehee Yoo [Tue, 16 Nov 2021 16:09:23 +0000 (16:09 +0000)]
amt: cancel delayed_work synchronously in amt_fini()

When the amt module is being removed, it calls cancel_delayed_work()
to cancel pending delayed_work. But this function doesn't wait for
canceling delayed_work.
So, workers can be still doing after module delete.

In order to avoid this, cancel_delayed_work_sync() should be used instead.

Suggested-by: Jakub Kicinski <[email protected]>
Fixes: bc54e49c140b ("amt: add multicast(IGMP) report message handler")
Signed-off-by: Taehee Yoo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
3 years agoMAINTAINERS: remove [email protected]
Pavel Skripkin [Tue, 16 Nov 2021 14:13:03 +0000 (17:13 +0300)]
MAINTAINERS: remove [email protected]

I've sent a patch to [email protected] few days ago and
got a reply from [email protected]:

Delivery has failed to these recipients or groups:

[email protected]<mailto:[email protected]>
The email address you entered couldn't be found. Please check the
recipient's email address and try to resend the message. If the problem
continues, please contact your helpdesk.

As requested by Alok Prasad, replacing [email protected]
with Manish Chopra's email address. [0]

Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Pavel Skripkin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
3 years agobnxt_en: Fix compile error regression when CONFIG_BNXT_SRIOV is not set
Michael Chan [Tue, 16 Nov 2021 19:26:10 +0000 (14:26 -0500)]
bnxt_en: Fix compile error regression when CONFIG_BNXT_SRIOV is not set

bp->sriov_cfg is not defined when CONFIG_BNXT_SRIOV is not set.  Fix
it by adding a helper function bnxt_sriov_cfg() to handle the logic
with or without the config option.

Fixes: 46d08f55d24e ("bnxt_en: extend RTNL to VF check in devlink driver_reinit")
Reported-by: kernel test robot <[email protected]>
Reviewed-by: Edwin Peer <[email protected]>
Reviewed-by: Andy Gospodarek <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
3 years agonet: mvmdio: fix compilation warning
Marcin Wojtas [Mon, 15 Nov 2021 15:30:24 +0000 (16:30 +0100)]
net: mvmdio: fix compilation warning

The kernel test robot reported a following issue:

>> drivers/net/ethernet/marvell/mvmdio.c:426:36: warning:
unused variable 'orion_mdio_acpi_match' [-Wunused-const-variable]
   static const struct acpi_device_id orion_mdio_acpi_match[] = {
                                      ^
   1 warning generated.

Fix that by surrounding the variable by appropriate ifdef.

Fixes: c54da4c1acb1 ("net: mvmdio: add ACPI support")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Marcin Wojtas <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
3 years agoMerge tag 'mac80211-for-net-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel...
Jakub Kicinski [Wed, 17 Nov 2021 00:53:58 +0000 (16:53 -0800)]
Merge tag 'mac80211-for-net-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Couple of fixes:
 * bad dont-reorder check
 * throughput LED trigger for various new(ish) paths
 * radiotap header generation
 * locking assertions in mac80211 with monitor mode
 * radio statistics
 * don't try to access IV when not present
 * call stop_ap for P2P_GO as well as we should

* tag 'mac80211-for-net-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211:
  mac80211: fix throughput LED trigger
  mac80211: fix monitor_sdata RCU/locking assertions
  mac80211: drop check for DONT_REORDER in __ieee80211_select_queue
  mac80211: fix radiotap header generation
  mac80211: do not access the IV when it was stripped
  nl80211: fix radio statistics in survey dump
  cfg80211: call cfg80211_stop_ap when switch from P2P_GO type
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
3 years agoMerge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Jakub Kicinski [Wed, 17 Nov 2021 00:53:47 +0000 (16:53 -0800)]
Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2021-11-16

We've added 12 non-merge commits during the last 5 day(s) which contain
a total of 23 files changed, 573 insertions(+), 73 deletions(-).

The main changes are:

1) Fix pruning regression where verifier went overly conservative rejecting
   previsouly accepted programs, from Alexei Starovoitov and Lorenz Bauer.

2) Fix verifier TOCTOU bug when using read-only map's values as constant
   scalars during verification, from Daniel Borkmann.

3) Fix a crash due to a double free in XSK's buffer pool, from Magnus Karlsson.

4) Fix libbpf regression when cross-building runqslower, from Jean-Philippe Brucker.

5) Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_*() helpers in tracing
   programs due to deadlock possibilities, from Dmitrii Banshchikov.

6) Fix checksum validation in sockmap's udp_read_sock() callback, from Cong Wang.

7) Various BPF sample fixes such as XDP stats in xdp_sample_user, from Alexander Lobakin.

8) Fix libbpf gen_loader error handling wrt fd cleanup, from Kumar Kartikeya Dwivedi.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  udp: Validate checksum in udp_read_sock()
  bpf: Fix toctou on read-only map's constant scalar tracking
  samples/bpf: Fix build error due to -isystem removal
  selftests/bpf: Add tests for restricted helpers
  bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs
  libbpf: Perform map fd cleanup for gen_loader in case of error
  samples/bpf: Fix incorrect use of strlen in xdp_redirect_cpu
  tools/runqslower: Fix cross-build
  samples/bpf: Fix summary per-sec stats in xdp_sample_user
  selftests/bpf: Check map in map pruning
  bpf: Fix inner map state pruning regression.
  xsk: Fix crash on double free in buffer pool
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
3 years agonet/mlx5: E-Switch, return error if encap isn't supported
Raed Salem [Mon, 1 Nov 2021 14:18:53 +0000 (16:18 +0200)]
net/mlx5: E-Switch, return error if encap isn't supported

On regular ConnectX HCAs getting encap mode isn't supported when the
E-Switch is in NONE mode. Current code would return no error code when
trying to get encap mode in such case which is wrong.

Fix by returning error value to indicate failure to caller in such case.

Fixes: 8e0aa4bc959c ("net/mlx5: E-switch, Protect eswitch mode changes")
Signed-off-by: Raed Salem <[email protected]>
Reviewed-by: Mark Bloch <[email protected]>
Reviewed-by: Maor Dickman <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
3 years agonet/mlx5: Lag, update tracker when state change event received
Maher Sanalla [Fri, 5 Nov 2021 09:19:48 +0000 (11:19 +0200)]
net/mlx5: Lag, update tracker when state change event received

Currently, In NETDEV_CHANGELOWERSTATE/NETDEV_CHANGEUPPERSTATE events
handling, tracking is not fully completed if the LAG device is not ready
at the time the events occur. But, we must keep track of the upper and
lower states after receiving the events because RoCE needs this info in
mlx5_lag_get_roce_netdev() - in order to return the corresponding port
that its running on. Returning the wrong (not most recent) port will lead
to gids table being incorrect.

For example: If during the attachment of a slave to the bond, the other
non-attached port performs pci_reload, then the LAG device is not ready,
but that should not result in dismissing attached slave tracker update
automatically (which is performed in mlx5_handle_changelowerstate()), Since
these events might not come later, which can lead to both bond ports
having tx_enabled=0 - which is not a valid state of LAG bond.

Fixes: 9b412cc35f00 ("net/mlx5e: Add LAG warning if bond slave is not lag master")
Signed-off-by: Maher Sanalla <[email protected]>
Reviewed-by: Mark Bloch <[email protected]>
Reviewed-by: Jianbo Liu <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
3 years agonet/mlx5e: CT, Fix multiple allocations and memleak of mod acts
Roi Dayan [Mon, 8 Nov 2021 14:41:05 +0000 (16:41 +0200)]
net/mlx5e: CT, Fix multiple allocations and memleak of mod acts

CT clear action offload adds additional mod hdr actions to the
flow's original mod actions in order to clear the registers which
hold ct_state.
When such flow also includes encap action, a neigh update event
can cause the driver to unoffload the flow and then reoffload it.

Each time this happens, the ct clear handling adds that same set
of mod hdr actions to reset ct_state until the max of mod hdr
actions is reached.

Also the driver never releases the allocated mod hdr actions and
causing a memleak.

Fix above two issues by moving CT clear mod acts allocation
into the parsing actions phase and only use it when offloading the rule.
The release of mod acts will be done in the normal flow_put().

 backtrace:
    [<000000007316e2f3>] krealloc+0x83/0xd0
    [<00000000ef157de1>] mlx5e_mod_hdr_alloc+0x147/0x300 [mlx5_core]
    [<00000000970ce4ae>] mlx5e_tc_match_to_reg_set_and_get_id+0xd7/0x240 [mlx5_core]
    [<0000000067c5fa17>] mlx5e_tc_match_to_reg_set+0xa/0x20 [mlx5_core]
    [<00000000d032eb98>] mlx5_tc_ct_entry_set_registers.isra.0+0x36/0xc0 [mlx5_core]
    [<00000000fd23b869>] mlx5_tc_ct_flow_offload+0x272/0x1f10 [mlx5_core]
    [<000000004fc24acc>] mlx5e_tc_offload_fdb_rules.part.0+0x150/0x620 [mlx5_core]
    [<00000000dc741c17>] mlx5e_tc_encap_flows_add+0x489/0x690 [mlx5_core]
    [<00000000e92e49d7>] mlx5e_rep_update_flows+0x6e4/0x9b0 [mlx5_core]
    [<00000000f60f5602>] mlx5e_rep_neigh_update+0x39a/0x5d0 [mlx5_core]

Fixes: 1ef3018f5af3 ("net/mlx5e: CT: Support clear action")
Signed-off-by: Roi Dayan <[email protected]>
Reviewed-by: Paul Blakey <[email protected]>
Reviewed-by: Maor Dickman <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
3 years agonet/mlx5: Fix flow counters SF bulk query len
Avihai Horon [Wed, 3 Nov 2021 11:04:23 +0000 (13:04 +0200)]
net/mlx5: Fix flow counters SF bulk query len

When doing a flow counters bulk query, the number of counters to query
must be aligned to 4. Current SF bulk query len is not aligned to 4,
which leads to an error when trying to query more than 4 counters.

Fix it by aligning SF bulk query len to 4.

Fixes: 2fdeb4f4c2ae ("net/mlx5: Reduce flow counters bulk query buffer size for SFs")
Signed-off-by: Avihai Horon <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
3 years agonet/mlx5: E-Switch, rebuild lag only when needed
Mark Bloch [Wed, 10 Nov 2021 15:19:12 +0000 (15:19 +0000)]
net/mlx5: E-Switch, rebuild lag only when needed

A user can enable VFs without changing E-Switch mode, this can happen
when a user moves straight to switchdev mode and only once in switchdev
VFs are enabled via the sysfs interface.

The cited commit assumed this isn't possible and exposed a single
API function where the E-switch calls into the lag code, breaks the lag
and prevents any other lag operations to take place until the
E-switch update has ended.

Breaking the hardware lag when it isn't needed can make it such that
hardware lag can't be enabled again.

In the sysfs call path check if the current E-Switch mode is NONE,
in the context of the function it can only mean the E-Switch is moving
out of NONE mode and the hardware lag should be disabled and enabled
once the mode change has ended. If the mode isn't NONE it means
VFs are about to be enabled and such operation doesn't require
toggling the hardware lag.

Fixes: cac1eb2cf2e3 ("net/mlx5: Lag, properly lock eswitch if needed")
Signed-off-by: Mark Bloch <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
3 years agonet/mlx5: Update error handler for UCTX and UMEM
Neta Ostrovsky [Wed, 27 Oct 2021 12:16:14 +0000 (15:16 +0300)]
net/mlx5: Update error handler for UCTX and UMEM

In the fast unload flow, the device state is set to internal error,
which indicates that the driver started the destroy process.
In this case, when a destroy command is being executed, it should return
MLX5_CMD_STAT_OK.
Fix MLX5_CMD_OP_DESTROY_UCTX and MLX5_CMD_OP_DESTROY_UMEM to return OK
instead of EIO.

This fixes a call trace in the umem release process -
[ 2633.536695] Call Trace:
[ 2633.537518]  ib_uverbs_remove_one+0xc3/0x140 [ib_uverbs]
[ 2633.538596]  remove_client_context+0x8b/0xd0 [ib_core]
[ 2633.539641]  disable_device+0x8c/0x130 [ib_core]
[ 2633.540615]  __ib_unregister_device+0x35/0xa0 [ib_core]
[ 2633.541640]  ib_unregister_device+0x21/0x30 [ib_core]
[ 2633.542663]  __mlx5_ib_remove+0x38/0x90 [mlx5_ib]
[ 2633.543640]  auxiliary_bus_remove+0x1e/0x30 [auxiliary]
[ 2633.544661]  device_release_driver_internal+0x103/0x1f0
[ 2633.545679]  bus_remove_device+0xf7/0x170
[ 2633.546640]  device_del+0x181/0x410
[ 2633.547606]  mlx5_rescan_drivers_locked.part.10+0x63/0x160 [mlx5_core]
[ 2633.548777]  mlx5_unregister_device+0x27/0x40 [mlx5_core]
[ 2633.549841]  mlx5_uninit_one+0x21/0xc0 [mlx5_core]
[ 2633.550864]  remove_one+0x69/0xe0 [mlx5_core]
[ 2633.551819]  pci_device_remove+0x3b/0xc0
[ 2633.552731]  device_release_driver_internal+0x103/0x1f0
[ 2633.553746]  unbind_store+0xf6/0x130
[ 2633.554657]  kernfs_fop_write+0x116/0x190
[ 2633.555567]  vfs_write+0xa5/0x1a0
[ 2633.556407]  ksys_write+0x4f/0xb0
[ 2633.557233]  do_syscall_64+0x5b/0x1a0
[ 2633.558071]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 2633.559018] RIP: 0033:0x7f9977132648
[ 2633.559821] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 55 6f 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[ 2633.562332] RSP: 002b:00007fffb1a83888 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 2633.563472] RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f9977132648
[ 2633.564541] RDX: 000000000000000c RSI: 000055b90546e230 RDI: 0000000000000001
[ 2633.565596] RBP: 000055b90546e230 R08: 00007f9977406860 R09: 00007f9977a54740
[ 2633.566653] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f99774056e0
[ 2633.567692] R13: 000000000000000c R14: 00007f9977400880 R15: 000000000000000c
[ 2633.568725] ---[ end trace 10b4fe52945e544d ]---

Fixes: 6a6fabbfa3e8 ("net/mlx5: Update pci error handler entries and command translation")
Signed-off-by: Neta Ostrovsky <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
3 years agonet/mlx5: DR, Fix check for unsupported fields in match param
Yevgeny Kliteynik [Wed, 3 Nov 2021 15:51:03 +0000 (17:51 +0200)]
net/mlx5: DR, Fix check for unsupported fields in match param

The existing loop doesn't cast the buffer while scanning it, which
results in out-of-bounds read and failure to create the matcher.

Fixes: 941f19798a11 ("net/mlx5: DR, Add check for unsupported fields in match param")
Signed-off-by: Yevgeny Kliteynik <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
3 years agonet/mlx5: DR, Handle eswitch manager and uplink vports separately
Yevgeny Kliteynik [Tue, 2 Nov 2021 23:09:04 +0000 (01:09 +0200)]
net/mlx5: DR, Handle eswitch manager and uplink vports separately

When querying eswitch manager vport capabilities as "other = 1",
we encounter a FW compatibility issue with older FW versions.
To maintain backward compatibility, eswitch manager vport should
be queried as "other = 0" vport both for ECPF and non-ECPF cases.

This patch fixes these queries and improves the code readability
by handling eswitch manager and uplink vports separately, avoiding
the excessive 'if' conditions. Also, uplink caps are stored similar
to esw manager and not as part of xarray.

Fixes: dd4acb2a0954 ("net/mlx5: DR, Add missing query for vport 0")
Signed-off-by: Yevgeny Kliteynik <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
3 years agonet/mlx5e: nullify cq->dbg pointer in mlx5_debug_cq_remove()
Valentine Fatiev [Tue, 26 Oct 2021 08:42:41 +0000 (11:42 +0300)]
net/mlx5e: nullify cq->dbg pointer in mlx5_debug_cq_remove()

Prior to this patch in case mlx5_core_destroy_cq() failed it proceeds
to rest of destroy operations. mlx5_core_destroy_cq() could be called again
by user and cause additional call of mlx5_debug_cq_remove().
cq->dbg was not nullify in previous call and cause the crash.

Fix it by nullify cq->dbg pointer after removal.

Also proceed to destroy operations only if FW return 0
for MLX5_CMD_OP_DESTROY_CQ command.

general protection fault, probably for non-canonical address 0x2000300004058: 0000 [#1] SMP PTI
CPU: 5 PID: 1228 Comm: python Not tainted 5.15.0-rc5_for_upstream_min_debug_2021_10_14_11_06 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:lockref_get+0x1/0x60
Code: 5d e9 53 ff ff ff 48 8d 7f 70 e8 0a 2e 48 00 c7 85 d0 00 00 00 02
00 00 00 c6 45 70 00 fb 5d c3 c3 cc cc cc cc cc cc cc cc 53 <48> 8b 17
48 89 fb 85 d2 75 3d 48 89 d0 bf 64 00 00 00 48 89 c1 48
RSP: 0018:ffff888137dd7a38 EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff888107d5f458 RCX: 00000000fffffffe
RDX: 000000000002c2b0 RSI: ffffffff8155e2e0 RDI: 0002000300004058
RBP: ffff888137dd7a88 R08: 0002000300004058 R09: ffff8881144a9f88
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881141d4000
R13: ffff888137dd7c68 R14: ffff888137dd7d58 R15: ffff888137dd7cc0
FS:  00007f4644f2a4c0(0000) GS:ffff8887a2d40000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055b4500f4380 CR3: 0000000114f7a003 CR4: 0000000000170ea0
Call Trace:
  simple_recursive_removal+0x33/0x2e0
  ? debugfs_remove+0x60/0x60
  debugfs_remove+0x40/0x60
  mlx5_debug_cq_remove+0x32/0x70 [mlx5_core]
  mlx5_core_destroy_cq+0x41/0x1d0 [mlx5_core]
  devx_obj_cleanup+0x151/0x330 [mlx5_ib]
  ? __pollwait+0xd0/0xd0
  ? xas_load+0x5/0x70
  ? xa_load+0x62/0xa0
  destroy_hw_idr_uobject+0x20/0x80 [ib_uverbs]
  uverbs_destroy_uobject+0x3b/0x360 [ib_uverbs]
  uobj_destroy+0x54/0xa0 [ib_uverbs]
  ib_uverbs_cmd_verbs+0xaf2/0x1160 [ib_uverbs]
  ? uverbs_finalize_object+0xd0/0xd0 [ib_uverbs]
  ib_uverbs_ioctl+0xc4/0x1b0 [ib_uverbs]
  __x64_sys_ioctl+0x3e4/0x8e0

Fixes: 94b960b9deff ("net/mlx5e: Fix memory leak in mlx5_core_destroy_cq() error path")
Signed-off-by: Valentine Fatiev <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
3 years agonet/mlx5: E-Switch, Fix resetting of encap mode when entering switchdev
Paul Blakey [Thu, 20 May 2021 14:09:58 +0000 (17:09 +0300)]
net/mlx5: E-Switch, Fix resetting of encap mode when entering switchdev

E-Switch encap mode is relevant only when in switchdev mode.
The RDMA driver can query the encap configuration via
mlx5_eswitch_get_encap_mode(). Make sure it returns the currently
used mode and not the set one.

This reverts the cited commit which reset the encap mode
on entering switchdev and fixes the original issue properly.

Fixes: 9a64144d683a ("net/mlx5: E-Switch, Fix default encap mode")
Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Mark Bloch <[email protected]>
Reviewed-by: Maor Dickman <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
3 years agonet/mlx5e: Wait for concurrent flow deletion during neigh/fib events
Vlad Buslov [Thu, 21 Oct 2021 15:15:10 +0000 (18:15 +0300)]
net/mlx5e: Wait for concurrent flow deletion during neigh/fib events

Function mlx5e_take_tmp_flow() skips flows with zero reference count. This
can cause syndrome 0x179e84 when the called from neigh or route update code
and the skipped flow is not removed from the hardware by the time
underlying encap/decap resource is deleted. Add new completion
'del_hw_done' that is completed when flow is unoffloaded. This is safe to
do because flow with reference count zero needs to be detached from
encap/decap entry before its memory is deallocated, which requires taking
the encap_tbl_lock mutex that is held by the event handlers code.

Fixes: 8914add2c9e5 ("net/mlx5e: Handle FIB events to update tunnel endpoint device")
Signed-off-by: Vlad Buslov <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
3 years agonet/mlx5e: kTLS, Fix crash in RX resync flow
Tariq Toukan [Wed, 15 Sep 2021 10:25:31 +0000 (13:25 +0300)]
net/mlx5e: kTLS, Fix crash in RX resync flow

For the TLS RX resync flow, we maintain a list of TLS contexts
that require some attention, to communicate their resync information
to the HW.
Here we fix list corruptions, by protecting the entries against
movements coming from resync_handle_seq_match(), until their resync
handling in napi is fully completed.

Fixes: e9ce991bce5b ("net/mlx5e: kTLS, Add resiliency to RX resync failures")
Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Maxim Mikityanskiy <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
3 years agoMerge branch 'kvm-selftest' into kvm-master
Paolo Bonzini [Tue, 16 Nov 2021 12:44:13 +0000 (07:44 -0500)]
Merge branch 'kvm-selftest' into kvm-master

- Cleanups for the perf test infrastructure and mapping hugepages

- Avoid contention on mmap_sem when the guests start to run

- Add event channel upcall support to xen_shinfo_test

3 years agonet: ieee802154: handle iftypes as u32
Alexander Aring [Fri, 12 Nov 2021 03:09:16 +0000 (22:09 -0500)]
net: ieee802154: handle iftypes as u32

This patch fixes an issue that an u32 netlink value is handled as a
signed enum value which doesn't fit into the range of u32 netlink type.
If it's handled as -1 value some BIT() evaluation ends in a
shift-out-of-bounds issue. To solve the issue we set the to u32 max which
is s32 "-1" value to keep backwards compatibility and let the followed enum
values start counting at 0. This brings the compiler to never handle the
enum as signed and a check if the value is above NL802154_IFTYPE_MAX should
filter -1 out.

Fixes: f3ea5e44231a ("ieee802154: add new interface command")
Signed-off-by: Alexander Aring <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Stefan Schmidt <[email protected]>
3 years agobtrfs: deprecate BTRFS_IOC_BALANCE ioctl
Nikolay Borisov [Wed, 10 Nov 2021 11:41:04 +0000 (13:41 +0200)]
btrfs: deprecate BTRFS_IOC_BALANCE ioctl

The v2 balance ioctl has been introduced more than 9 years ago. Users of
the old v1 ioctl should have long been migrated to it. It's time we
deprecate it and eventually remove it.

The only known user is in btrfs-progs that tries v1 as a fallback in
case v2 is not supported. This is not necessary anymore.

Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Nikolay Borisov <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
3 years agobtrfs: make 1-bit bit-fields of scrub_page unsigned int
Colin Ian King [Wed, 10 Nov 2021 19:20:08 +0000 (19:20 +0000)]
btrfs: make 1-bit bit-fields of scrub_page unsigned int

The bitfields have_csum and io_error are currently signed which is not
recommended as the representation is an implementation defined
behaviour. Fix this by making the bit-fields unsigned ints.

Fixes: 2c36395430b0 ("btrfs: scrub: remove the anonymous structure from scrub_page")
Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Colin Ian King <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
3 years agobtrfs: check-integrity: fix a warning on write caching disabled disk
Wang Yugui [Wed, 27 Oct 2021 22:32:54 +0000 (06:32 +0800)]
btrfs: check-integrity: fix a warning on write caching disabled disk

When a disk has write caching disabled, we skip submission of a bio with
flush and sync requests before writing the superblock, since it's not
needed. However when the integrity checker is enabled, this results in
reports that there are metadata blocks referred by a superblock that
were not properly flushed. So don't skip the bio submission only when
the integrity checker is enabled for the sake of simplicity, since this
is a debug tool and not meant for use in non-debug builds.

fstests/btrfs/220 trigger a check-integrity warning like the following
when CONFIG_BTRFS_FS_CHECK_INTEGRITY=y and the disk with WCE=0.

  btrfs: attempt to write superblock which references block M @5242880 (sdb2/5242880/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)!
  ------------[ cut here ]------------
  WARNING: CPU: 28 PID: 843680 at fs/btrfs/check-integrity.c:2196 btrfsic_process_written_superblock+0x22a/0x2a0 [btrfs]
  CPU: 28 PID: 843680 Comm: umount Not tainted 5.15.0-0.rc5.39.el8.x86_64 #1
  Hardware name: Dell Inc. Precision T7610/0NK70N, BIOS A18 09/11/2019
  RIP: 0010:btrfsic_process_written_superblock+0x22a/0x2a0 [btrfs]
  RSP: 0018:ffffb642afb47940 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
  RDX: 00000000ffffffff RSI: ffff8b722fc97d00 RDI: ffff8b722fc97d00
  RBP: ffff8b5601c00000 R08: 0000000000000000 R09: c0000000ffff7fff
  R10: 0000000000000001 R11: ffffb642afb476f8 R12: ffffffffffffffff
  R13: ffffb642afb47974 R14: ffff8b5499254c00 R15: 0000000000000003
  FS:  00007f00a06d4080(0000) GS:ffff8b722fc80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fff5cff5ff0 CR3: 00000001c0c2a006 CR4: 00000000001706e0
  Call Trace:
   btrfsic_process_written_block+0x2f7/0x850 [btrfs]
   __btrfsic_submit_bio.part.19+0x310/0x330 [btrfs]
   ? bio_associate_blkg_from_css+0xa4/0x2c0
   btrfsic_submit_bio+0x18/0x30 [btrfs]
   write_dev_supers+0x81/0x2a0 [btrfs]
   ? find_get_pages_range_tag+0x219/0x280
   ? pagevec_lookup_range_tag+0x24/0x30
   ? __filemap_fdatawait_range+0x6d/0xf0
   ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e
   ? find_first_extent_bit+0x9b/0x160 [btrfs]
   ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e
   write_all_supers+0x1b3/0xa70 [btrfs]
   ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e
   btrfs_commit_transaction+0x59d/0xac0 [btrfs]
   close_ctree+0x11d/0x339 [btrfs]
   generic_shutdown_super+0x71/0x110
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0xb8/0x140
   task_work_run+0x6d/0xb0
   exit_to_user_mode_prepare+0x1f0/0x200
   syscall_exit_to_user_mode+0x12/0x30
   do_syscall_64+0x46/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  RIP: 0033:0x7f009f711dfb
  RSP: 002b:00007fff5cff7928 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 000055b68c6c9970 RCX: 00007f009f711dfb
  RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055b68c6c9b50
  RBP: 0000000000000000 R08: 000055b68c6ca900 R09: 00007f009f795580
  R10: 0000000000000000 R11: 0000000000000246 R12: 000055b68c6c9b50
  R13: 00007f00a04bf184 R14: 0000000000000000 R15: 00000000ffffffff
  ---[ end trace 2c4b82abcef9eec4 ]---
  S-65536(sdb2/65536/1)
   -->
  M-1064960(sdb2/1064960/1)

Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Wang Yugui <[email protected]>
Signed-off-by: David Sterba <[email protected]>
3 years agobtrfs: silence lockdep when reading chunk tree during mount
Filipe Manana [Thu, 4 Nov 2021 12:43:08 +0000 (12:43 +0000)]
btrfs: silence lockdep when reading chunk tree during mount

Often some test cases like btrfs/161 trigger lockdep splats that complain
about possible unsafe lock scenario due to the fact that during mount,
when reading the chunk tree we end up calling blkdev_get_by_path() while
holding a read lock on a leaf of the chunk tree. That produces a lockdep
splat like the following:

[ 3653.683975] ======================================================
[ 3653.685148] WARNING: possible circular locking dependency detected
[ 3653.686301] 5.15.0-rc7-btrfs-next-103 #1 Not tainted
[ 3653.687239] ------------------------------------------------------
[ 3653.688400] mount/447465 is trying to acquire lock:
[ 3653.689320] ffff8c6b0c76e528 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.691054]
               but task is already holding lock:
[ 3653.692155] ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 3653.693978]
               which lock already depends on the new lock.

[ 3653.695510]
               the existing dependency chain (in reverse order) is:
[ 3653.696915]
               -> #3 (btrfs-chunk-00){++++}-{3:3}:
[ 3653.698053]        down_read_nested+0x4b/0x140
[ 3653.698893]        __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 3653.699988]        btrfs_read_lock_root_node+0x31/0x40 [btrfs]
[ 3653.701205]        btrfs_search_slot+0x537/0xc00 [btrfs]
[ 3653.702234]        btrfs_insert_empty_items+0x32/0x70 [btrfs]
[ 3653.703332]        btrfs_init_new_device+0x563/0x15b0 [btrfs]
[ 3653.704439]        btrfs_ioctl+0x2110/0x3530 [btrfs]
[ 3653.705405]        __x64_sys_ioctl+0x83/0xb0
[ 3653.706215]        do_syscall_64+0x3b/0xc0
[ 3653.706990]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3653.708040]
               -> #2 (sb_internal#2){.+.+}-{0:0}:
[ 3653.708994]        lock_release+0x13d/0x4a0
[ 3653.709533]        up_write+0x18/0x160
[ 3653.710017]        btrfs_sync_file+0x3f3/0x5b0 [btrfs]
[ 3653.710699]        __loop_update_dio+0xbd/0x170 [loop]
[ 3653.711360]        lo_ioctl+0x3b1/0x8a0 [loop]
[ 3653.711929]        block_ioctl+0x48/0x50
[ 3653.712442]        __x64_sys_ioctl+0x83/0xb0
[ 3653.712991]        do_syscall_64+0x3b/0xc0
[ 3653.713519]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3653.714233]
               -> #1 (&lo->lo_mutex){+.+.}-{3:3}:
[ 3653.715026]        __mutex_lock+0x92/0x900
[ 3653.715648]        lo_open+0x28/0x60 [loop]
[ 3653.716275]        blkdev_get_whole+0x28/0x90
[ 3653.716867]        blkdev_get_by_dev.part.0+0x142/0x320
[ 3653.717537]        blkdev_open+0x5e/0xa0
[ 3653.718043]        do_dentry_open+0x163/0x390
[ 3653.718604]        path_openat+0x3f0/0xa80
[ 3653.719128]        do_filp_open+0xa9/0x150
[ 3653.719652]        do_sys_openat2+0x97/0x160
[ 3653.720197]        __x64_sys_openat+0x54/0x90
[ 3653.720766]        do_syscall_64+0x3b/0xc0
[ 3653.721285]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3653.721986]
               -> #0 (&disk->open_mutex){+.+.}-{3:3}:
[ 3653.722775]        __lock_acquire+0x130e/0x2210
[ 3653.723348]        lock_acquire+0xd7/0x310
[ 3653.723867]        __mutex_lock+0x92/0x900
[ 3653.724394]        blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.725041]        blkdev_get_by_path+0xb8/0xd0
[ 3653.725614]        btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
[ 3653.726332]        open_fs_devices+0xd7/0x2c0 [btrfs]
[ 3653.726999]        btrfs_read_chunk_tree+0x3ad/0x870 [btrfs]
[ 3653.727739]        open_ctree+0xb8e/0x17bf [btrfs]
[ 3653.728384]        btrfs_mount_root.cold+0x12/0xde [btrfs]
[ 3653.729130]        legacy_get_tree+0x30/0x50
[ 3653.729676]        vfs_get_tree+0x28/0xc0
[ 3653.730192]        vfs_kern_mount.part.0+0x71/0xb0
[ 3653.730800]        btrfs_mount+0x11d/0x3a0 [btrfs]
[ 3653.731427]        legacy_get_tree+0x30/0x50
[ 3653.731970]        vfs_get_tree+0x28/0xc0
[ 3653.732486]        path_mount+0x2d4/0xbe0
[ 3653.732997]        __x64_sys_mount+0x103/0x140
[ 3653.733560]        do_syscall_64+0x3b/0xc0
[ 3653.734080]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3653.734782]
               other info that might help us debug this:

[ 3653.735784] Chain exists of:
                 &disk->open_mutex --> sb_internal#2 --> btrfs-chunk-00

[ 3653.737123]  Possible unsafe locking scenario:

[ 3653.737865]        CPU0                    CPU1
[ 3653.738435]        ----                    ----
[ 3653.739007]   lock(btrfs-chunk-00);
[ 3653.739449]                                lock(sb_internal#2);
[ 3653.740193]                                lock(btrfs-chunk-00);
[ 3653.740955]   lock(&disk->open_mutex);
[ 3653.741431]
                *** DEADLOCK ***

[ 3653.742176] 3 locks held by mount/447465:
[ 3653.742739]  #0: ffff8c6acf85c0e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xd5/0x3b0
[ 3653.744114]  #1: ffffffffc0b28f70 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x59/0x870 [btrfs]
[ 3653.745563]  #2: ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 3653.747066]
               stack backtrace:
[ 3653.747723] CPU: 4 PID: 447465 Comm: mount Not tainted 5.15.0-rc7-btrfs-next-103 #1
[ 3653.748873] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 3653.750592] Call Trace:
[ 3653.750967]  dump_stack_lvl+0x57/0x72
[ 3653.751526]  check_noncircular+0xf3/0x110
[ 3653.752136]  ? stack_trace_save+0x4b/0x70
[ 3653.752748]  __lock_acquire+0x130e/0x2210
[ 3653.753356]  lock_acquire+0xd7/0x310
[ 3653.753898]  ? blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.754596]  ? lock_is_held_type+0xe8/0x140
[ 3653.755125]  ? blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.755729]  ? blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.756338]  __mutex_lock+0x92/0x900
[ 3653.756794]  ? blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.757400]  ? do_raw_spin_unlock+0x4b/0xa0
[ 3653.757930]  ? _raw_spin_unlock+0x29/0x40
[ 3653.758437]  ? bd_prepare_to_claim+0x129/0x150
[ 3653.758999]  ? trace_module_get+0x2b/0xd0
[ 3653.759508]  ? try_module_get.part.0+0x50/0x80
[ 3653.760072]  blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.760661]  ? devcgroup_check_permission+0xc1/0x1f0
[ 3653.761288]  blkdev_get_by_path+0xb8/0xd0
[ 3653.761797]  btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
[ 3653.762454]  open_fs_devices+0xd7/0x2c0 [btrfs]
[ 3653.763055]  ? clone_fs_devices+0x8f/0x170 [btrfs]
[ 3653.763689]  btrfs_read_chunk_tree+0x3ad/0x870 [btrfs]
[ 3653.764370]  ? kvm_sched_clock_read+0x14/0x40
[ 3653.764922]  open_ctree+0xb8e/0x17bf [btrfs]
[ 3653.765493]  ? super_setup_bdi_name+0x79/0xd0
[ 3653.766043]  btrfs_mount_root.cold+0x12/0xde [btrfs]
[ 3653.766780]  ? rcu_read_lock_sched_held+0x3f/0x80
[ 3653.767488]  ? kfree+0x1f2/0x3c0
[ 3653.767979]  legacy_get_tree+0x30/0x50
[ 3653.768548]  vfs_get_tree+0x28/0xc0
[ 3653.769076]  vfs_kern_mount.part.0+0x71/0xb0
[ 3653.769718]  btrfs_mount+0x11d/0x3a0 [btrfs]
[ 3653.770381]  ? rcu_read_lock_sched_held+0x3f/0x80
[ 3653.771086]  ? kfree+0x1f2/0x3c0
[ 3653.771574]  legacy_get_tree+0x30/0x50
[ 3653.772136]  vfs_get_tree+0x28/0xc0
[ 3653.772673]  path_mount+0x2d4/0xbe0
[ 3653.773201]  __x64_sys_mount+0x103/0x140
[ 3653.773793]  do_syscall_64+0x3b/0xc0
[ 3653.774333]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3653.775094] RIP: 0033:0x7f648bc45aaa

This happens because through btrfs_read_chunk_tree(), which is called only
during mount, ends up acquiring the mutex open_mutex of a block device
while holding a read lock on a leaf of the chunk tree while other paths
need to acquire other locks before locking extent buffers of the chunk
tree.

Since at mount time when we call btrfs_read_chunk_tree() we know that
we don't have other tasks running in parallel and modifying the chunk
tree, we can simply skip locking of chunk tree extent buffers. So do
that and move the assertion that checks the fs is not yet mounted to the
top block of btrfs_read_chunk_tree(), with a comment before doing it.

Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
3 years agobtrfs: fix memory ordering between normal and ordered work functions
Nikolay Borisov [Tue, 2 Nov 2021 12:49:16 +0000 (14:49 +0200)]
btrfs: fix memory ordering between normal and ordered work functions

Ordered work functions aren't guaranteed to be handled by the same thread
which executed the normal work functions. The only way execution between
normal/ordered functions is synchronized is via the WORK_DONE_BIT,
unfortunately the used bitops don't guarantee any ordering whatsoever.

This manifested as seemingly inexplicable crashes on ARM64, where
async_chunk::inode is seen as non-null in async_cow_submit which causes
submit_compressed_extents to be called and crash occurs because
async_chunk::inode suddenly became NULL. The call trace was similar to:

    pc : submit_compressed_extents+0x38/0x3d0
    lr : async_cow_submit+0x50/0xd0
    sp : ffff800015d4bc20

    <registers omitted for brevity>

    Call trace:
     submit_compressed_extents+0x38/0x3d0
     async_cow_submit+0x50/0xd0
     run_ordered_work+0xc8/0x280
     btrfs_work_helper+0x98/0x250
     process_one_work+0x1f0/0x4ac
     worker_thread+0x188/0x504
     kthread+0x110/0x114
     ret_from_fork+0x10/0x18

Fix this by adding respective barrier calls which ensure that all
accesses preceding setting of WORK_DONE_BIT are strictly ordered before
setting the flag. At the same time add a read barrier after reading of
WORK_DONE_BIT in run_ordered_work which ensures all subsequent loads
would be strictly ordered after reading the bit. This in turn ensures
are all accesses before WORK_DONE_BIT are going to be strictly ordered
before any access that can occur in ordered_func.

Reported-by: Chris Murphy <[email protected]>
Fixes: 08a9ff326418 ("btrfs: Added btrfs_workqueue_struct implemented ordered execution based on kernel workqueue")
CC: [email protected] # 4.4+
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2011928
Reviewed-by: Josef Bacik <[email protected]>
Tested-by: Chris Murphy <[email protected]>
Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: David Sterba <[email protected]>
3 years agobtrfs: fix a out-of-bound access in copy_compressed_data_to_page()
Qu Wenruo [Fri, 12 Nov 2021 04:47:30 +0000 (12:47 +0800)]
btrfs: fix a out-of-bound access in copy_compressed_data_to_page()

[BUG]
The following script can cause btrfs to crash:

  $ mount -o compress-force=lzo $DEV /mnt
  $ dd if=/dev/urandom of=/mnt/foo bs=4k count=1
  $ sync

The call trace looks like this:

  general protection fault, probably for non-canonical address 0xe04b37fccce3b000: 0000 [#1] PREEMPT SMP NOPTI
  CPU: 5 PID: 164 Comm: kworker/u20:3 Not tainted 5.15.0-rc7-custom+ #4
  Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
  RIP: 0010:__memcpy+0x12/0x20
  Call Trace:
   lzo_compress_pages+0x236/0x540 [btrfs]
   btrfs_compress_pages+0xaa/0xf0 [btrfs]
   compress_file_range+0x431/0x8e0 [btrfs]
   async_cow_start+0x12/0x30 [btrfs]
   btrfs_work_helper+0xf6/0x3e0 [btrfs]
   process_one_work+0x294/0x5d0
   worker_thread+0x55/0x3c0
   kthread+0x140/0x170
   ret_from_fork+0x22/0x30
  ---[ end trace 63c3c0f131e61982 ]---

[CAUSE]
In lzo_compress_pages(), parameter @out_pages is not only an output
parameter (for the number of compressed pages), but also an input
parameter, as the upper limit of compressed pages we can utilize.

In commit d4088803f511 ("btrfs: subpage: make lzo_compress_pages()
compatible"), the refactoring doesn't take @out_pages as an input, thus
completely ignoring the limit.

And for compress-force case, we could hit incompressible data that
compressed size would go beyond the page limit, and cause the above
crash.

[FIX]
Save @out_pages as @max_nr_page, and pass it to lzo_compress_pages(),
and check if we're beyond the limit before accessing the pages.

Note: this also fixes crash on 32bit architectures that was suspected to
be caused by merge of btrfs patches to 5.16-rc1. Reported in
https://lore.kernel.org/all/20211104115001[email protected]/ .

Reported-by: Omar Sandoval <[email protected]>
Fixes: d4088803f511 ("btrfs: subpage: make lzo_compress_pages() compatible")
Reviewed-by: Omar Sandoval <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
[ add note ]
Signed-off-by: David Sterba <[email protected]>
3 years agoMerge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
David S. Miller [Tue, 16 Nov 2021 13:27:32 +0000 (13:27 +0000)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2021-11-15

This series contains updates to iavf driver only.

Mateusz adds a wait for reset completion when changing queue count which
could otherwise cause issues with VF reset.

Nick adds a null check for vf_res in iavf_fix_features(), corrects
ordering of function calls to resolve dependency issues, and prevents
possible freeing of a lock which isn't being held.

Piotr fixes logic that did not allow setting all multicast mode without
promiscuous mode.

Jake prevents possible accidental freeing of filter structure.

Mitch adds null checks for key and indir parameters in iavf_get_rxfh().

Surabhi adds an additional check that would, previously, cause the driver
to print a false error due to values obtained while the VF is in reset.

Grzegorz prevents a queue request of 0 which would cause queue count to
reset to default values.

Akeem restores VLAN filters when bringing the interface back up.
====================

Signed-off-by: David S. Miller <[email protected]>
3 years agoKVM: x86: Fix uninitialized eoi_exit_bitmap usage in vcpu_load_eoi_exitmap()
黄乐 [Mon, 15 Nov 2021 14:08:29 +0000 (14:08 +0000)]
KVM: x86: Fix uninitialized eoi_exit_bitmap usage in vcpu_load_eoi_exitmap()

In vcpu_load_eoi_exitmap(), currently the eoi_exit_bitmap[4] array is
initialized only when Hyper-V context is available, in other path it is
just passed to kvm_x86_ops.load_eoi_exitmap() directly from on the stack,
which would cause unexpected interrupt delivery/handling issues, e.g. an
*old* linux kernel that relies on PIT to do clock calibration on KVM might
randomly fail to boot.

Fix it by passing ioapic_handled_vectors to load_eoi_exitmap() when Hyper-V
context is not available.

Fixes: f2bc14b69c38 ("KVM: x86: hyper-v: Prepare to meet unallocated Hyper-V context")
Cc: [email protected]
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Huang Le <[email protected]>
Message-Id: <62115b277dab49ea97da5633f8522daf@jd.com>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: selftests: Use perf_test_destroy_vm in memslot_modification_stress_test
David Matlack [Thu, 11 Nov 2021 00:12:57 +0000 (00:12 +0000)]
KVM: selftests: Use perf_test_destroy_vm in memslot_modification_stress_test

Change memslot_modification_stress_test to use perf_test_destroy_vm
instead of manually calling ucall_uninit and kvm_vm_free.

No functional change intended.

Signed-off-by: David Matlack <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
Message-Id: <20211111001257.1446428[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
3 years agoKVM: selftests: Wait for all vCPU to be created before entering guest mode
David Matlack [Thu, 11 Nov 2021 00:12:56 +0000 (00:12 +0000)]
KVM: selftests: Wait for all vCPU to be created before entering guest mode

Thread creation requires taking the mmap_sem in write mode, which causes
vCPU threads running in guest mode to block while they are populating
memory. Fix this by waiting for all vCPU threads to be created and start
running before entering guest mode on any one vCPU thread.

This substantially improves the "Populate memory time" when using 1GiB
pages since it allows all vCPUs to zero pages in parallel rather than
blocking because a writer is waiting (which is waiting for another vCPU
that is busy zeroing a 1GiB page).

Before:

  $ ./dirty_log_perf_test -v256 -s anonymous_hugetlb_1gb
  ...
  Populate memory time: 52.811184013s

After:

  $ ./dirty_log_perf_test -v256 -s anonymous_hugetlb_1gb
  ...
  Populate memory time: 10.204573342s

Signed-off-by: David Matlack <[email protected]>
Message-Id: <20211111001257.1446428[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
This page took 0.143257 seconds and 4 git commands to generate.