Git Repo - linux.git/log

octeontx2-af: Add new CGX_CMD to get PHY FEC statistics

This patch adds support to fetch fec stats from PHY. The stats are
put in the shared data struct fwdata. A PHY driver indicates
that it has FEC stats by setting the flag fwdata.phy.misc.has_fec_stats

Besides CGX_CMD_GET_PHY_FEC_STATS, also add CGX_CMD_PRBS and
CGX_CMD_DISPLAY_EYE to enum cgx_cmd_id so that Linux's enum list is in sync
with firmware's enum list.

Signed-off-by: Felix Manlunas <[email protected]>
Signed-off-by: Christina Jacob <[email protected]>
Signed-off-by: Sunil Goutham <[email protected]>
Signed-off-by: Hariprasad Kelam <[email protected]>
Reviewed-by: Jesse Brandeburg <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-af: forward error correction configuration

CGX block supports forward error correction modes baseR
and RS. This patch adds support to set encoding mode
and to read corrected/uncorrected block counters

Adds new mailbox handlers set_fec to configure encoding modes
and fec_stats to read counters and also increase mbox timeout
to accomdate firmware command response timeout.

Along with new CGX_CMD_SET_FEC command add other commands to
sync with kernel enum list with firmware.

Signed-off-by: Christina Jacob <[email protected]>
Signed-off-by: Sunil Goutham <[email protected]>
Signed-off-by: Hariprasad Kelam <[email protected]>
Reviewed-by: Jesse Brandeburg <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: octeontx2: Fix the confusion in buffer alloc failure path

Pavel pointed that the return of dma_addr_t in
otx2_alloc_rbuf/__otx2_alloc_rbuf() seem suspicious because a negative
error code may be returned in some cases. For a dma_addr_t, the error
code such as -ENOMEM does seem a valid value, so we can't judge if the
buffer allocation fail or not based on that value. Add a parameter for
otx2_alloc_rbuf/__otx2_alloc_rbuf() to store the dma address and make
the return value to indicate if the buffer allocation really fail or
not.

Reported-by: Pavel Machek <[email protected]>
Signed-off-by: Kevin Hao <[email protected]>
Tested-by: Subbaraya Sundeep <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'Add-MBIM-over-MHI-support'

Signed-off-by: David S. Miller <[email protected]>

net: mhi: Add mbim proto

MBIM has initially been specified by USB-IF for transporting data (IP)
between a modem and a host over USB. However some modern modems also
support MBIM over PCIe (via MHI). In the same way as QMAP(rmnet), it
allows to aggregate IP packets and to perform context multiplexing.

This change adds minimal MBIM data transport support to MHI, allowing
to support MBIM only modems. MBIM being based on USB NCM, it reuses
and copy some helpers/functions from the USB stack (cdc-ncm, cdc-mbim).

Note that is a subset of the CDC-MBIM specification, supporting only
transport of network data (IP), there is no support for DSS. Moreover
the multi-session (for multi-pdn) is not supported in this initial
version, but will be added latter, and aligned with the cdc-mbim
solution (VLAN tags).

This code has been inspired from the mhi_mbim downstream implementation
(Carl Yin <[email protected]>).

Signed-off-by: Loic Poulain <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mhi: Add rx_length_errors stat

This can be used by proto when packet len is incorrect.

Signed-off-by: Loic Poulain <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mhi: Create mhi.h

Move mhi-net shared structures to mhi header, that will be used by
upcoming proto(s).

Signed-off-by: Loic Poulain <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mhi: Add dedicated folder

Create a dedicated mhi directory for mhi-net, mhi-net is going to
be split into differente files (for additional protocol support).

Signed-off-by: Loic Poulain <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mhi: Add protocol support

MHI can transport different protocols, some are handled at upper level,
like IP and QMAP(rmnet/netlink), but others will need to be inside MHI
net driver, like mbim. This change adds support for protocol rx and
tx_fixup callbacks registration, that can be used to encode/decode the
targeted protocol.

Signed-off-by: Loic Poulain <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

cxgb4: collect serial config version from register

Collect serial config version information directly from an internal
register, instead of explicitly resizing VPD.

v2:
- Add comments on info stored in PCIE_STATIC_SPARE2 register.

Signed-off-by: Rahul Lakkireddy <[email protected]>
Reviewed-by: Heiner Kallweit <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

i40e: remove the useless value assignment in i40e_clean_adminq_subtask

The variable ret is overwritten by the following call
i40e_clean_arq_element() and the assignment is useless, so remove it.

Reported-by: Tosk Robot <[email protected]>
Signed-off-by: Kaixu Xia <[email protected]>
Tested-by: Tony Brelinski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

i40e: VLAN field for flow director

Allow user to specify VLAN field and add it to flow director. Show VLAN
field in "ethtool -n ethx" command.
Handle VLAN type and tag field provided by ethtool command. Refactored
filter addition, by replacing static arrays with runtime dummy packet
creation, which allows specifying VLAN field.
Previously, VLAN field was omitted.

Signed-off-by: Przemyslaw Patynowski <[email protected]>
Tested-by: Tony Brelinski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

i40e: Add flow director support for IPv6

Flow director for IPv6 is not supported.
1) Implementation of support for IPv6 flow director.
2) Added handlers for addition of TCP6, UDP6, SCTP6, IPv6.
3) Refactored legacy code to make it more generic.
4) Added packet templates for TCP6, UDP6, SCTP6, IPv6.
5) Added handling of IPv6 source and destination address for flow director.
6) Improved argument passing for source and destination portin TCP6, UDP6
and SCTP6.
7) Added handling of ethtool -n for IPv6, TCP6,UDP6, SCTP6.
8) Used correct bit flag regarding FLEXOFF field of flow director data
descriptor.

Without this patch, there would be no support for flow director on IPv6,
TCP6, UDP6, SCTP6.
Tested based on x710 datasheet by using:
ethtool -N enp133s0f0 flow-type tcp4 src-port 13 dst-port 37 user-def 0x44142 action 1
ethtool -N enp133s0f0 flow-type tcp6 src-port 13 dst-port 40 user-def 0x44142 action 2
ethtool -N enp133s0f0 flow-type udp4 src-port 20 dst-port 40 user-def 0x44142 action 3
ethtool -N enp133s0f0 flow-type udp6 src-port 25 dst-port 40 user-def 0x44142 action 4
ethtool -N enp133s0f0 flow-type sctp4 src-port 55 dst-port 65 user-def 0x44142 action 5
ethtool -N enp133s0f0 flow-type sctp6 src-port 60 dst-port 40 user-def 0x44142 action 6
ethtool -N enp133s0f0 flow-type ip4 src-ip 1.1.1.1 dst-ip 1.1.1.4 user-def 0x44142 action 7
ethtool -N enp133s0f0 flow-type ip6 src-ip fe80::3efd:feff:fe6f:bbbb dst-ip fe80::3efd:feff:fe6f:aaaa user-def 0x44142 action 8
Then send traffic from client which matches the criteria provided to ethtool.
Observe that packets are redirected to user set queues with ethtool -S <interface>

Signed-off-by: Przemyslaw Patynowski <[email protected]>
Tested-by: Tony Brelinski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

i40e: Add EEE status getting & setting implementation

Implement Energy Efficient Ethernet (EEE) status getting & setting.
The i40e_get_eee() requesting PHY EEE capabilities from firmware.
The i40e_set_eee() function requests PHY EEE capabilities
from firmware and sets PHY EEE advertising to full abilities or 0
depending whether EEE is to be enabled or disabled.

Signed-off-by: Aleksandr Loktionov <[email protected]>
Tested-by: Tony Brelinski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

i40e: Add netlink callbacks support for software based DCB

Add callbacks used by software based LLDP agent, which allows to
configure DCB feature from userspace.

Update copyright dates as appropriate.

If LLDP agent is turned off in BIOS, or after setting private flag
("disable-fw-lldp on"). The driver initialized DCB functionality with
default values, one traffic class with 100% bandwidth allocated.

The new netlink callbacks are required for software LLDP agent, it
must be able to acquire current DCB configuration of a network port
and apply DCB configuration changes, if required.

Signed-off-by: Aleksandr Loktionov <[email protected]>
Signed-off-by: Arkadiusz Kubalewski <[email protected]>
Tested-by: Tony Brelinski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

i40e: Add init and default config of software based DCB

Add extra handling on changing the "disable-fw-lldp" private
flag to properly initialize software based DCB feature.

Add default configuration of DCB functionality when Firmware
LLDP agent is turned off, in case of driver probe and device
reset on reconfiguration.

Update copyright dates as appropriate.

Software based DCB is a brand-new feature in i40e driver.
Before, DCB was implemented by Firmware LLDP agent only. The agent was
responsible for handling incoming DCB-related LLDP frames and
applying received DCB configuration to hardware.

Default configuration and new initialization flow for software based
DCB is required. If LLDP agent is turned off in BIOS, or after
setting private flag ("disable-fw-lldp on"). The driver initializes
DCB functionality with default values, one traffic class with 100%
bandwidth allocated.

Signed-off-by: Aleksandr Loktionov <[email protected]>
Signed-off-by: Arkadiusz Kubalewski <[email protected]>
Tested-by: Tony Brelinski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

i40e: Add hardware configuration for software based DCB

Add registers and definitions required for applying
DCB related hardware configuration.

Add functions responsible for calculating and setting proper
hardware configuration values for software based DCB functionality.

Add function responsible for invoking Admin Queue command, which
results in applying new DCB configuration to the hardware.

Update copyright dates as appropriate.

Software based DCB is a brand-new feature in i40e driver.
Before, DCB was implemented by Firmware LLDP agent only. The agent was
responsible for handling incoming DCB-related LLDP frames and
applying received DCB configuration to hardware.

New communication channel between software and hardware is required
for software driver. It must be able to calculate and configure all
the registers related for DCB feature.

Signed-off-by: Aleksandr Loktionov <[email protected]>
Signed-off-by: Arkadiusz Kubalewski <[email protected]>
Tested-by: Tony Brelinski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Merge tag 'pm-5.11-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
"Address a performance regression related to scale-invariance on x86
  that may prevent turbo CPU frequencies from being used in certain
  workloads on systems using acpi-cpufreq as the CPU performance scaling
  driver and schedutil as the scaling governor"

* tag 'pm-5.11-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: ACPI: Update arch scale-invariance max perf ratio if CPPC is not there
  cpufreq: ACPI: Extend frequency tables to cover boost frequencies

Merge tag 'acpi-5.11-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fix from Rafael Wysocki:
"Revert a problematic ACPICA commit that changed the code to attempt to
  update memory regions which may be read-only on some systems (Ard
  Biesheuvel)"

* tag 'acpi-5.11-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  Revert "ACPICA: Interpreter: fix memory leak by using existing buffer"

Merge tag 'dmaengine-fix2-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine

Pull dmaengine fixes from Vinod Koul:
"Some late fixes for dmaengine:

  Core:
   - fix channel device_node deletion

  Driver fixes:
   - dw: revert of runtime pm enabling
   - idxd: device state fix, interrupt completion and list corruption
   - ti: resource leak

* tag 'dmaengine-fix2-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine:
  dmaengine dw: Revert "dmaengine: dw: Enable runtime PM"
  dmaengine: idxd: check device state before issue command
  dmaengine: ti: k3-udma: Fix a resource leak in an error handling path
  dmaengine: move channel device_node deletion to driver
  dmaengine: idxd: fix misc interrupt completion
  dmaengine: idxd: Fix list corruption in description completion

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from David Miller:
"Another pile of networing fixes:

   1) ath9k build error fix from Arnd Bergmann

   2) dma memory leak fix in mediatec driver from Lorenzo Bianconi.

   3) bpf int3 kprobe fix from Alexei Starovoitov.

   4) bpf stackmap integer overflow fix from Bui Quang Minh.

   5) Add usb device ids for Cinterion MV31 to qmi_qwwan driver, from
      Christoph Schemmel.

   6) Don't update deleted entry in xt_recent netfilter module, from
      Jazsef Kadlecsik.

   7) Use after free in nftables, fix from Pablo Neira Ayuso.

   8) Header checksum fix in flowtable from Sven Auhagen.

   9) Validate user controlled length in qrtr code, from Sabyrzhan
      Tasbolatov.

  10) Fix race in xen/netback, from Juergen Gross,

  11) New device ID in cxgb4, from Raju Rangoju.

  12) Fix ring locking in rxrpc release call, from David Howells.

  13) Don't return LAPB error codes from x25_open(), from Xie He.

  14) Missing error returns in gsi_channel_setup() from Alex Elder.

  15) Get skb_copy_and_csum_datagram working properly with odd segment
      sizes, from Willem de Bruijn.

  16) Missing RFS/RSS table init in enetc driver, from Vladimir Oltean.

  17) Do teardown on probe failure in DSA, from Vladimir Oltean.

  18) Fix compilation failures of txtimestamp selftest, from Vadim
      Fedorenko.

  19) Limit rx per-napi gro queue size to fix latency regression, from
      Eric Dumazet.

  20) dpaa_eth xdp fixes from Camelia Groza.

  21) Missing txq mode update when switching CBS off, in stmmac driver,
      from Mohammad Athari Bin Ismail.

  22) Failover pending logic fix in ibmvnic driver, from Sukadev
      Bhattiprolu.

  23) Null deref fix in vmw_vsock, from Norbert Slusarek.

  24) Missing verdict update in xdp paths of ena driver, from Shay
      Agroskin.

  25) seq_file iteration fix in sctp from Neil Brown.

  26) bpf 32-bit src register truncation fix on div/mod, from Daniel
      Borkmann.

  27) Fix jmp32 pruning in bpf verifier, from Daniel Borkmann.

  28) Fix locking in vsock_shutdown(), from Stefano Garzarella.

  29) Various missing index bound checks in hns3 driver, from Yufeng Mo.

  30) Flush ports on .phylink_mac_link_down() in dsa felix driver, from
      Vladimir Oltean.

  31) Don't mix up stp and mrp port states in bridge layer, from Horatiu
      Vultur.

  32) Fix locking during netif_tx_disable(), from Edwin Peer"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits)
  bpf: Fix 32 bit src register truncation on div/mod
  bpf: Fix verifier jmp32 pruning decision logic
  bpf: Fix verifier jsgt branch analysis on max bound
  vsock: fix locking in vsock_shutdown()
  net: hns3: add a check for index in hclge_get_rss_key()
  net: hns3: add a check for tqp_index in hclge_get_ring_chain_from_mbx()
  net: hns3: add a check for queue_id in hclge_reset_vf_queue()
  net: dsa: felix: implement port flushing on .phylink_mac_link_down
  switchdev: mrp: Remove SWITCHDEV_ATTR_ID_MRP_PORT_STAT
  bridge: mrp: Fix the usage of br_mrp_port_switchdev_set_state
  net: watchdog: hold device global xmit lock during tx disable
  netfilter: nftables: relax check for stateful expressions in set definition
  netfilter: conntrack: skip identical origin tuple in same zone only
  vsock/virtio: update credit only if socket is not closed
  net: fix iteration for sctp transport seq_files
  net: ena: Update XDP verdict upon failure
  net/vmw_vsock: improve locking in vsock_connect_timeout()
  net/vmw_vsock: fix NULL pointer dereference
  ibmvnic: Clear failover_pending if unable to schedule
  net: stmmac: set TxQ mode back to DCB after disabling CBS
  ...

Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
"14 patches.

  Subsystems affected by this patch series: mm (kasan, mremap, tmpfs,
  selftests, memcg, and slub), MAINTAINERS, squashfs, nilfs2, and
  firmware"

* emailed patches from Andrew Morton <[email protected]>:
  nilfs2: make splice write available again
  mm, slub: better heuristic for number of cpus when calculating slab order
  Revert "mm: memcontrol: avoid workload stalls when lowering memory.high"
  MAINTAINERS: update Andrey Ryabinin's email address
  selftests/vm: rename file run_vmtests to run_vmtests.sh
  tmpfs: disallow CONFIG_TMPFS_INODE64 on alpha
  tmpfs: disallow CONFIG_TMPFS_INODE64 on s390
  mm/mremap: fix BUILD_BUG_ON() error in get_extent
  firmware_loader: align .builtin_fw to 8
  kasan: fix stack traces dependency for HW_TAGS
  squashfs: add more sanity checks in xattr id lookup
  squashfs: add more sanity checks in inode lookup
  squashfs: add more sanity checks in id lookup
  squashfs: avoid out of bounds writes in decompressors

nilfs2: make splice write available again

Since 5.10, splice() or sendfile() to NILFS2 return EINVAL. This was
caused by commit 36e2c7421f02 ("fs: don't allow splice read/write
without explicit ops").

This patch initializes the splice_write field in file_operations, like
most file systems do, to restore the functionality.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Joachim Henke <[email protected]>
Signed-off-by: Ryusuke Konishi <[email protected]>
Tested-by: Ryusuke Konishi <[email protected]>
Cc: <[email protected]> [5.10+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, slub: better heuristic for number of cpus when calculating slab order

When creating a new kmem cache, SLUB determines how large the slab pages
will based on number of inputs, including the number of CPUs in the
system.  Larger slab pages mean that more objects can be allocated/free
from per-cpu slabs before accessing shared structures, but also
potentially more memory can be wasted due to low slab usage and
fragmentation.  The rough idea of using number of CPUs is that larger
systems will be more likely to benefit from reduced contention, and also
should have enough memory to spare.

Number of CPUs used to be determined as nr_cpu_ids, which is number of
possible cpus, but on some systems many will never be onlined, thus
commit 045ab8c9487b ("mm/slub: let number of online CPUs determine the
slub page order") changed it to nr_online_cpus().  However, for kmem
caches created early before CPUs are onlined, this may lead to
permamently low slab page sizes.

Vincent reports a regression [1] of hackbench on arm64 systems:

  "I'm facing significant performances regression on a large arm64
   server system (224 CPUs). Regressions is also present on small arm64
   system (8 CPUs) but in a far smaller order of magnitude

   On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
   v5.11-rc4 : 9.135sec (+/- 0.45%)
   v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
   v5.10: 3.136sec (+/- 0.40%)"

Mel reports a regression [2] of hackbench on x86_64, with lockstat suggesting
page allocator contention:

  "i.e. the patch incurs a 7% to 32% performance penalty. This bisected
   cleanly yesterday when I was looking for the regression and then
   found the thread.

   Numerous caches change size. For example, kmalloc-512 goes from
   order-0 (vanilla) to order-2 with the revert.

   So mostly this is down to the number of times SLUB calls into the
   page allocator which only caches order-0 pages on a per-cpu basis"

Clearly num_online_cpus() doesn't work too early in bootup.  We could
change the order dynamically in a memory hotplug callback, but runtime
order changing for existing kmem caches has been already shown as
dangerous, and removed in 32a6f409b693 ("mm, slub: remove runtime
allocation order changes").

It could be resurrected in a safe manner with some effort, but to fix
the regression we need something simpler.

We could use num_present_cpus() that should be the number of physically
present CPUs even before they are onlined.  That would work for PowerPC
[3], which triggered the original commit, but that still doesn't work on
arm64 [4] as explained in [5].

So this patch tries to determine the best available value without
specific arch knowledge.

- num_present_cpus() if the number is larger than 1, as that means the
   arch is likely setting it properly

- nr_cpu_ids otherwise

This should fix the reported regressions while also keeping the effect
of 045ab8c9487b for PowerPC systems.  It's possible there are
configurations where num_present_cpus() is 1 during boot while
nr_cpu_ids is at the same time bloated, so these (if they exist) would
keep the large orders based on nr_cpu_ids as was before 045ab8c9487b.

[1] https://lore.kernel.org/linux-mm/CAKfTPtA_JgMf_+zdFbcb_V9rM7JBWNPjAz9irgwFj7Rou=xzZg@mail.gmail.com/
[2] https://lore.kernel.org/linux-mm/20210128134512 [email protected]/
[3] https://lore.kernel.org/linux-mm/20210123051607 [email protected]/
[4] https://lore.kernel.org/linux-mm/CAKfTPtAjyVmS5VYvU6DBxg4-JEo5bdmWbngf-03YsY18cmWv_g@mail.gmail.com/
[5] https://lore.kernel.org/linux-mm/20210126230305.GD30941@willie-the-truck/

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 045ab8c9487b ("mm/slub: let number of online CPUs determine the slub page order")
Signed-off-by: Vlastimil Babka <[email protected]>
Reported-by: Vincent Guittot <[email protected]>
Reported-by: Mel Gorman <[email protected]>
Tested-by: Mel Gorman <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

iwlwifi:mvm: Add support for version 2 of the LARI_CONFIG_CHANGE command.

Add support for version 2 of the LARI_CONFIG_CHANGE command.
this is needed to support UHB enable/disable from BIOS

Signed-off-by: Miri Korenblit <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210142629.8a0c951bfdea.I850f29d3ff3931388447bda635dfbc742ea1df61@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: pcie: don't crash when rx queues aren't allocated in interrupt

WARNING is better than crashing. Since this happened to me,
be on the safe side.

Signed-off-by: Emmanuel Grumbach <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210142629.d4651427fcda.I1bcecb73676d039e2521309c07fc6b6314a90546@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: correction of group-id once sending REPLY_ERROR

Once sending the REPLY_ERROR group ID is not set and this lead to
get it set to wrong value LONG_GROUP later in default handling

Fix this by checking the REPLY_ERROR and avoid changing the Group ID

Signed-off-by: Mukesh Sisodiya <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210142629.82578caaea84.I0ca9cfdd4e656d2e88ee7696dd6baf4267e7cb52@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: pcie: add AX201 and AX211 radio modules for Ma devices

Add support for AX201 and AX211 radio modules, which we call HR2 and
GF, respectively. These modules can be used with the Ma family of
devices and above.

Signed-off-by: Matti Gottlieb <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210142629.f8e3080ce633.I7377b421b031796730daf809c4024a3c3ef95fa8@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: pcie: add CDB bit to the device configuration parsing

Some new devices contain an extra bit in the CRF ID register to denote
that they support CDB. Add definitions and macros to be able to
support it and add the "NO_CDB" to all existing entired.

Signed-off-by: Matti Gottlieb <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210142629.7b40184d9899.I3bb2cf9b9afb0457583f786dc52d4d1b1ad75ffc@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: acpi: don't return valid pointer as an ERR_PTR

iwl_acpi_get_wifi_pkg() may return a valid pointer (meaning success),
while `tbl_rev` is invalid (equel to 1).
In this case, we will treat that as an error.
Subsequent "users" of this "error code" may either check for nonzero
(good; pointers are never zero) or negative
(bad; pointers may be "positive") fix that by splitting the if statement.
First check if IS_ERR(wifi_pkg) and then if tbl_rev != 0.

Signed-off-by: Haim Dreyfuss <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210142629.1c8c4b58c932.I147373f6fd364606b0282af8d402c722eb917225@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: queue: add fake tx time point

In case we get TX sequence number out of range, trigger fake tx time
point to collect FW debug data.

Signed-off-by: Mordechay Goodstein <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210142629.e098026e83ad.I8870fcbc504a74cab6a50134b3df1131d6da946d@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: remove flags argument for nic_access

Since we no longer save interrupts, we no longer need the flags
argument here, remove it throughout.

Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210142629.8de8fe6f9fff.If040b056d0e8c771c65ac5c29230f939354a142b@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: declare support for triggered SU/MU beamforming feedback

The NIC supports this, so set the relevant bits in the HE PHY
capabilities.

Signed-off-by: Naftali Goldstein <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210142629.24212c1aac90.I82f6c1bdb9fe351ce46e8cc8ec6da221908dec45@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: dbg: add op_mode callback for collecting debug data.

The first use is collecting debug data when transport stops the device.

Signed-off-by: Mordechay Goodstein <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210142629.d282d0a9ee7b.I9a0ad29f80daba8956a6aa077ba865e19b2150be@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: api: clean up some documentation/bits

Clean up some documentation references and some bits in the enums
to make the documentation more useful.

Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210142629.941d963ceb88.I72a89c0161d7beab99bc3a90707796c2a63e4197@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: dbg: remove unsupported regions

In case user requested to register an unsupported regions,
remove it from active list and trigger list, this saves operational
driver memory and run time at collecting debug data.

Signed-off-by: Mordechay Goodstein <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210142629.a0cc944040e8.I3ae37547452b39f8040428c21ed47bdc67ae8f71@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: pcie: Change Ma device ID

The Ma device ID needs to be 0x7E40 instead of 0x7E80.

Signed-off-by: Matti Gottlieb <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210135352.a97272169e3f.Ic4acfb3f7b4e9d7b49c9c0b9a31c9a305d4d9fcc@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: when HW has rate offload don't look at control field

Control field is set by mac80211 only if case rate is not offloaded to
hw.

Signed-off-by: Mordechay Goodstein <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210135352.f845c4387eed.I30c4d26698bae1f5f8c396da80a545baa145e2ad@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: pcie: NULLify pointers after free

Remember that those pointers have been freed by setting them
to NULL. Otherwise, we'd keep rxq pointing to random memory
which would prevent us from trying to re-allocate the Rx
resources if we call rx_alloc again.

Also, propagate the allocation failure to the caller of
iwl_pcie_nic_init so that we won't go further in the
start flow.

Signed-off-by: Emmanuel Grumbach <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210135352.996b400d2f1c.I630379c504644700322f57b259383ae0af8d1975@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: mvm: assign SAR table revision to the command later

The call to iwl_sar_geo_init() was moved to the end of the
iwl_mvm_sar_geo_init() function, after the table revision is assigned
to the FW command. But the revision is only known after
iwl_sar_geo_init() is called, so we were always assigning zero to it.

Fix that by moving the assignment code after the iwl_sar_geo_init()
function is called.

Signed-off-by: Luca Coelho <[email protected]>
Fixes: 45acebf8d6a6 ("iwlwifi: fix sar geo table initialization")
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210135352.cef55ef3a065.If96c60f08d24c2262c287168a6f0dbd7cf0f8f5c@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: mvm: remove useless iwl_mvm_resume_d3() function

This is called exactly once, a few lines down, so there's
no point in having the extra function.

Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210135352.1ef80bf3008c.I0b5349530182b5616a4149dd596f95aa54ea724c@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: mvm: enhance a print in CSA flows

Add the count and the mode to the modify CSA flow.

Signed-off-by: Emmanuel Grumbach <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210135352.361bc0f024ef.I904f269858b3123b7d6532f049c7f92b63fb8807@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: mvm: send stored PPAG command instead of local

Some change conflicts apparently cause a confusion between a local
variable being used to send the PPAG command and the introduction of a
union for this command. Most parts of the local command were never
copied from the stored data, so the FW was getting garbage in the
tables instead of getting valid values.

Fix this by completely removing the local and using only the union
that we have stored in fwrt.

Signed-off-by: Luca Coelho <[email protected]>
Fixes: f2134f66f40e ("iwlwifi: acpi: support ppag table command v2")
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210135352.d090e0301023.I7d57f4d7da9a3297734c51cf988199323c76916d@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: mvm: store PPAG enabled/disabled flag properly

When reading the PPAG table from ACPI, we should store everything in
our fwrt structure, so it can be accessed later. But we had a local
ppag_table variable in the function and were erroneously storing the
enabled/disabled flag in it instead of storing it in the fwrt. Fix
this by removing the local variable and storing everything directly in
fwrt.

Signed-off-by: Luca Coelho <[email protected]>
Fixes: f2134f66f40e ("iwlwifi: acpi: support ppag table command v2")
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210135352.889862e6d393.I8b894c1b2b3fe0ad2fb39bf438273ea47eb5afa4@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: mvm: fix the type we use in the PPAG table validity checks

The value we receive from ACPI is a long long unsigned integer but the
values should be treated as signed char. When comparing the received
value with ACPI_PPAG_MIN_LB/HB, we were doing an unsigned comparison,
so the negative value would actually be treated as a very high number.

To solve this issue, assign the value to our table of s8's before
making the comparison, so the value is already converted when we do
so.

Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210135352.b0ec69f312bc.If77fd9c61a96aa7ef2ac96d935b7efd7df502399@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: acpi: fix PPAG table sizes

We were erroneously adding 3 extra values to the table size
calculation, when we should actually add only a 2 (one for the domain
type and one for the enabled/disabled flag). Fix this for both
revisions we support.

Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210135352.9d037b8f5098.I3c88af130d9e270517c8bac8eb02e11f817fe959@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: pcie: don't disable interrupts for reg_lock

The only thing we do touching the device in hard interrupt context
is, at most, writing an interrupt ACK register, which isn't racing
in with anything protected by the reg_lock.

Thus, avoid disabling interrupts here for potentially long periods
of time, particularly long periods have been observed with dumping
of firmware memory (leading to lockup warnings on some devices.)

Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210135352.da916ab91298.I064c3e7823b616647293ed97da98edefb9ce9435@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: pcie: add a few missing entries for So with Hr

Some devices were missing from the So with Hr section.

Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210210135352.71da7ce27261.I0d96fe7b799527c49f1270ddf9acdb152bdd4841@changeid
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: dbg: Mark ucode tlv data as const

The ucode TLV data may be read-only and should be treated as const
pointers, but currently a few code forcibly cast to the writable
pointer unnecessarily. This gave developers a wrong impression as if
it can be modified, resulting in crashing regressions already a couple
of times.

This patch adds the const prefix to those cast pointers, so that such
attempt can be caught more easily in future.

Signed-off-by: Takashi Iwai <[email protected]>
Acked-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: add new cards for So and Qu family

add few PCI ID'S for So with Hr and Qu with Hr in AX family.

Cc: [email protected]
Signed-off-by: Ihab Zhaika <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Link: https://lore.kernel.org/r/iwlwifi.20210206130110.6f0c1849f7dc.I647b4d22f9468c2f34b777a4bfa445912c6f04f0@changeid

ath11k: fix a locking bug in ath11k_mac_op_start()

This error path leads to a Smatch warning:

drivers/net/wireless/ath/ath11k/mac.c:4269 ath11k_mac_op_start()
error: double unlocked '&ar->conf_mutex' (orig line 4251)

We're not holding the lock when we do the "goto err;" so it leads to a
double unlock. The fix is to hold the lock for a little longer.

Fixes: c83c500b55b6 ("ath11k: enable idle power save mode")
Signed-off-by: Dan Carpenter <[email protected]>
[[email protected]: move also rcu_assign_pointer() call]
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/YBk4GoeE+yc0wlJH@mwanda

rtlwifi: rtl8821ae: phy: Simplify bool comparison

Fix the following coccicheck warning:

./drivers/net/wireless/realtek/rtlwifi/rtl8821ae/phy.c:3853:7-17:
WARNING: Comparison of 0/1 to bool variable.

Reported-by: Abaci Robot <[email protected]>
Signed-off-by: Jiapeng Chong <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/1612840381-109714-1-git-send-email-jiapeng.chong@linux.alibaba.com

rtlwifi: rtl8192se: Simplify bool comparison

Fix the follow coccicheck warnings:

./drivers/net/wireless/realtek/rtlwifi/rtl8192se/hw.c:2305:6-27:
WARNING: Comparison of 0/1 to bool variable.

./drivers/net/wireless/realtek/rtlwifi/rtl8192se/hw.c:1376:5-26:
WARNING: Comparison of 0/1 to bool variable.

Reported-by: Abaci Robot <[email protected]>
Signed-off-by: Jiapeng Chong <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/1612839264-85773-1-git-send-email-jiapeng.chong@linux.alibaba.com

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2021-02-10

The following pull-request contains BPF updates for your *net* tree.

We've added 5 non-merge commits during the last 8 day(s) which contain
a total of 3 files changed, 22 insertions(+), 21 deletions(-).

The main changes are:

1) Fix missed execution of kprobes BPF progs when kprobe is firing via
   int3, from Alexei Starovoitov.

2) Fix potential integer overflow in map max_entries for stackmap on
   32 bit archs, from Bui Quang Minh.

3) Fix a verifier pruning and a insn rewrite issue related to 32 bit ops,
   from Daniel Borkmann.
====================

Signed-off-by: David S. Miller <[email protected]>
c# Please enter a commit message to explain why this merge is necessary,

Revert "mm: memcontrol: avoid workload stalls when lowering memory.high"

This reverts commit 536d3bf261a2fc3b05b3e91e7eef7383443015cf, as it can
cause writers to memory.high to get stuck in the kernel forever,
performing page reclaim and consuming excessive amounts of CPU cycles.

Before the patch, a write to memory.high would first put the new limit
in place for the workload, and then reclaim the requested delta.  After
the patch, the kernel tries to reclaim the delta before putting the new
limit into place, in order to not overwhelm the workload with a sudden,
large excess over the limit.  However, if reclaim is actively racing
with new allocations from the uncurbed workload, it can keep the write()
working inside the kernel indefinitely.

This is causing problems in Facebook production.  A privileged
system-level daemon that adjusts memory.high for various workloads
running on a host can get unexpectedly stuck in the kernel and
essentially turn into a sort of involuntary kswapd for one of the
workloads.  We've observed that daemon busy-spin in a write() for
minutes at a time, neglecting its other duties on the system, and
expending privileged system resources on behalf of a workload.

To remedy this, we have first considered changing the reclaim logic to
break out after a couple of loops - whether the workload has converged
to the new limit or not - and bound the write() call this way.  However,
the root cause that inspired the sequence change in the first place has
been fixed through other means, and so a revert back to the proven
limit-setting sequence, also used by memory.max, is preferable.

The sequence was changed to avoid extreme latencies in the workload when
the limit was lowered: the sudden, large excess created by the limit
lowering would erroneously trigger the penalty sleeping code that is
meant to throttle excessive growth from below.  Allocating threads could
end up sleeping long after the write() had already reclaimed the delta
for which they were being punished.

However, erroneous throttling also caused problems in other scenarios at
around the same time.  This resulted in commit b3ff92916af3 ("mm, memcg:
reclaim more aggressively before high allocator throttling"), included
in the same release as the offending commit.  When allocating threads
now encounter large excess caused by a racing write() to memory.high,
instead of entering punitive sleeps, they will simply be tasked with
helping reclaim down the excess, and will be held no longer than it
takes to accomplish that.  This is in line with regular limit
enforcement - i.e.  if the workload allocates up against or over an
otherwise unchanged limit from below.

With the patch breaking userspace, and the root cause addressed by other
means already, revert it again.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
Signed-off-by: Johannes Weiner <[email protected]>
Reported-by: Tejun Heo <[email protected]>
Acked-by: Chris Down <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: <[email protected]> [5.8+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

MAINTAINERS: update Andrey Ryabinin's email address

Update my email, @virtuozzo.com will stop working shortly.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrey Ryabinin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

selftests/vm: rename file run_vmtests to run_vmtests.sh

Commit c2aa8afc36fa has renamed run_vmtests in Makefile, but the file
still uses the old name.

The kernel test robot reported the following issue:

  # selftests: vm: run_vmtests.sh
  # Warning: file run_vmtests.sh is missing!
  not ok 1 selftests: vm: run_vmtests.sh

Link: https://lkml.kernel.org/r/[email protected]
Fixes: c2aa8afc36fa (selftests/vm: rename run_vmtests --> run_vmtests.sh)
Signed-off-by: Rong Chen <[email protected]>
Reported-by: kernel test robot <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

tmpfs: disallow CONFIG_TMPFS_INODE64 on alpha

As with s390, alpha is a 64-bit architecture with a 32-bit ino_t.  With
CONFIG_TMPFS_INODE64=y tmpfs mounts will get 64-bit inode numbers and
display "inode64" in the mount options, whereas passing "inode64" in the
mount options will fail.  This leads to erroneous behaviours such as
this:

  # mkdir mnt
  # mount -t tmpfs nodev mnt
  # mount -o remount,rw mnt
  mount: /home/ubuntu/mnt: mount point not mounted or bad option.

Prevent CONFIG_TMPFS_INODE64 from being selected on alpha.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb")
Signed-off-by: Seth Forshee <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Chris Down <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: <[email protected]> [5.9+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

tmpfs: disallow CONFIG_TMPFS_INODE64 on s390

Currently there is an assumption in tmpfs that 64-bit architectures also
have a 64-bit ino_t.  This is not true on s390 which has a 32-bit ino_t.
With CONFIG_TMPFS_INODE64=y tmpfs mounts will get 64-bit inode numbers
and display "inode64" in the mount options, but passing the "inode64"
mount option will fail.  This leads to the following behavior:

  # mkdir mnt
  # mount -t tmpfs nodev mnt
  # mount -o remount,rw mnt
  mount: /home/ubuntu/mnt: mount point not mounted or bad option.

As mount sees "inode64" in the mount options and thus passes it in the
options for the remount.

So prevent CONFIG_TMPFS_INODE64 from being selected on s390.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb")
Signed-off-by: Seth Forshee <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Chris Down <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: <[email protected]> [5.9+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mremap: fix BUILD_BUG_ON() error in get_extent

clang can't evaluate this function argument at compile time when the
function is not inlined, which leads to a link time failure:

  ld.lld: error: undefined symbol: __compiletime_assert_414
  >>> referenced by mremap.c
  >>>               mremap.o:(get_extent) in archive mm/built-in.a

Mark the function as __always_inline to avoid it.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 9ad9718bfa41 ("mm/mremap: calculate extent in one place")
Signed-off-by: Arnd Bergmann <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Tested-by: Sedat Dilek <[email protected]>
Cc: Kirill A. Shutemov" <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Brian Geffon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

firmware_loader: align .builtin_fw to 8

arm64 references the start address of .builtin_fw (__start_builtin_fw)
with a pair of R_AARCH64_ADR_PREL_PG_HI21/R_AARCH64_LDST64_ABS_LO12_NC
relocations. The compiler is allowed to emit the
R_AARCH64_LDST64_ABS_LO12_NC relocation because struct builtin_fw in
include/linux/firmware.h is 8-byte aligned.

The R_AARCH64_LDST64_ABS_LO12_NC relocation requires the address to be a
multiple of 8, which may not be the case if .builtin_fw is empty.
Unconditionally align .builtin_fw to fix the linker error. 32-bit
architectures could use ALIGN(4) but that would add unnecessary
complexity, so just use ALIGN(8).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://github.com/ClangBuiltLinux/linux/issues/1204
Fixes: 5658c76 ("firmware: allow firmware files to be built into kernel image")
Signed-off-by: Fangrui Song <[email protected]>
Reported-by: kernel test robot <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Tested-by: Douglas Anderson <[email protected]>
Acked-by: Nathan Chancellor <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kasan: fix stack traces dependency for HW_TAGS

Currently, whether the alloc/free stack traces collection is enabled by
default for hardware tag-based KASAN depends on CONFIG_DEBUG_KERNEL.
The intention for this dependency was to only enable collection on slow
debug kernels due to a significant perf and memory impact.

As it turns out, CONFIG_DEBUG_KERNEL is not considered a debug option
and is enabled on many productions kernels including Android and Ubuntu.
As the result, this dependency is pointless and only complicates the
code and documentation.

Having stack traces collection disabled by default would make the
hardware mode work differently to to the software ones, which is
confusing.

This change removes the dependency and enables stack traces collection
by default.

Looking into the future, this default might makes sense for production
kernels, assuming we implement a fast stack trace collection approach.

Link: https://lkml.kernel.org/r/6678d77ceffb71f1cff2cf61560e2ffe7bb6bfe9.1612808820.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Evgenii Stepanov <[email protected]>
Cc: Branislav Rankov <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

squashfs: add more sanity checks in xattr id lookup

Sysbot has reported a warning where a kmalloc() attempt exceeds the
maximum limit.  This has been identified as corruption of the xattr_ids
count when reading the xattr id lookup table.

This patch adds a number of additional sanity checks to detect this
corruption and others.

1. It checks for a corrupted xattr index read from the inode.  This could
   be because the metadata block is uncompressed, or because the
   "compression" bit has been corrupted (turning a compressed block
   into an uncompressed block).  This would cause an out of bounds read.

2. It checks against corruption of the xattr_ids count.  This can either
   lead to the above kmalloc failure, or a smaller than expected
   table to be read.

3. It checks the contents of the index table for corruption.

[[email protected]: fix checkpatch issue]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Phillip Lougher <[email protected]>
Reported-by: [email protected]
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

squashfs: add more sanity checks in inode lookup

Sysbot has reported an "slab-out-of-bounds read" error which has been
identified as being caused by a corrupted "ino_num" value read from the
inode.  This could be because the metadata block is uncompressed, or
because the "compression" bit has been corrupted (turning a compressed
block into an uncompressed block).

This patch adds additional sanity checks to detect this, and the
following corruption.

1. It checks against corruption of the inodes count.  This can either
   lead to a larger table to be read, or a smaller than expected
   table to be read.

   In the case of a too large inodes count, this would often have been
   trapped by the existing sanity checks, but this patch introduces
   a more exact check, which can identify too small values.

2. It checks the contents of the index table for corruption.

[[email protected]: fix checkpatch issue]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Phillip Lougher <[email protected]>
Reported-by: [email protected]
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

squashfs: add more sanity checks in id lookup

Sysbot has reported a number of "slab-out-of-bounds reads" and
"use-after-free read" errors which has been identified as being caused
by a corrupted index value read from the inode.  This could be because
the metadata block is uncompressed, or because the "compression" bit has
been corrupted (turning a compressed block into an uncompressed block).

This patch adds additional sanity checks to detect this, and the
following corruption.

1. It checks against corruption of the ids count.  This can either
   lead to a larger table to be read, or a smaller than expected
   table to be read.

   In the case of a too large ids count, this would often have been
   trapped by the existing sanity checks, but this patch introduces
   a more exact check, which can identify too small values.

2. It checks the contents of the index table for corruption.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Phillip Lougher <[email protected]>
Reported-by: [email protected]
Reported-by: [email protected]
Reported-by: [email protected]
Reported-by: [email protected]
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

squashfs: avoid out of bounds writes in decompressors

Patch series "Squashfs: fix BIO migration regression and add sanity checks".

Patch [1/4] fixes a regression introduced by the "migrate from
ll_rw_block usage to BIO" patch, which has produced a number of
Sysbot/Syzkaller reports.

Patches [2/4], [3/4], and [4/4] fix a number of filesystem corruption
issues which have produced Sysbot reports in the id, inode and xattr
lookup code.

Each patch has been tested against the Sysbot reproducers using the
given kernel configuration.  They have the appropriate "Reported-by:"
lines added.

Additionally, all of the reproducer filesystems are indirectly fixed by
patch [4/4] due to the fact they all have xattr corruption which is now
detected there.

Additional testing with other configurations and architectures (32bit,
big endian), and normal filesystems has also been done to trap any
inadvertent regressions caused by the additional sanity checks.

This patch (of 4):

This is a regression introduced by the patch "migrate from ll_rw_block
usage to BIO".

Sysbot/Syskaller has reported a number of "out of bounds writes" and
"unable to handle kernel paging request in squashfs_decompress" errors
which have been identified as a regression introduced by the above
patch.

Specifically, the patch removed the following sanity check

        if (length < 0 || length > output->length ||
(index + length) > msblk->bytes_used)

This check did two things:

1. It ensured any reads were not beyond the end of the filesystem

2. It ensured that the "length" field read from the filesystem
   was within the expected maximum length.  Without this any
   corrupted values can over-run allocated buffers.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 93e72b3c612adc ("squashfs: migrate from ll_rw_block usage to BIO")
Reported-by: [email protected]
Signed-off-by: Phillip Lougher <[email protected]>
Cc: Philippe Liard <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'i3c/fixes-for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux

Pull i3c fix from Alexandre Belloni:
"A single build warning fix"

* tag 'i3c/fixes-for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux:
i3c/master/mipi-i3c-hci: Fix position of __maybe_unused in i3c_hci_of_match

bpf: Fix 32 bit src register truncation on div/mod

While reviewing a different fix, John and I noticed an oddity in one of the
BPF program dumps that stood out, for example:

  # bpftool p d x i 13
   0: (b7) r0 = 808464450
   1: (b4) w4 = 808464432
   2: (bc) w0 = w0
   3: (15) if r0 == 0x0 goto pc+1
   4: (9c) w4 %= w0
  [...]

In line 2 we noticed that the mov32 would 32 bit truncate the original src
register for the div/mod operation. While for the two operations the dst
register is typically marked unknown e.g. from adjust_scalar_min_max_vals()
the src register is not, and thus verifier keeps tracking original bounds,
simplified:

  0: R1=ctx(id=0,off=0,imm=0) R10=fp0
  0: (b7) r0 = -1
  1: R0_w=invP-1 R1=ctx(id=0,off=0,imm=0) R10=fp0
  1: (b7) r1 = -1
  2: R0_w=invP-1 R1_w=invP-1 R10=fp0
  2: (3c) w0 /= w1
  3: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1_w=invP-1 R10=fp0
  3: (77) r1 >>= 32
  4: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1_w=invP4294967295 R10=fp0
  4: (bf) r0 = r1
  5: R0_w=invP4294967295 R1_w=invP4294967295 R10=fp0
  5: (95) exit
  processed 6 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0

Runtime result of r0 at exit is 0 instead of expected -1. Remove the
verifier mov32 src rewrite in div/mod and replace it with a jmp32 test
instead. After the fix, we result in the following code generation when
having dividend r1 and divisor r6:

  div, 64 bit:                             div, 32 bit:

   0: (b7) r6 = 8                           0: (b7) r6 = 8
   1: (b7) r1 = 8                           1: (b7) r1 = 8
   2: (55) if r6 != 0x0 goto pc+2           2: (56) if w6 != 0x0 goto pc+2
   3: (ac) w1 ^= w1                         3: (ac) w1 ^= w1
   4: (05) goto pc+1                        4: (05) goto pc+1
   5: (3f) r1 /= r6                         5: (3c) w1 /= w6
   6: (b7) r0 = 0                           6: (b7) r0 = 0
   7: (95) exit                             7: (95) exit

  mod, 64 bit:                             mod, 32 bit:

   0: (b7) r6 = 8                           0: (b7) r6 = 8
   1: (b7) r1 = 8                           1: (b7) r1 = 8
   2: (15) if r6 == 0x0 goto pc+1           2: (16) if w6 == 0x0 goto pc+1
   3: (9f) r1 %= r6                         3: (9c) w1 %= w6
   4: (b7) r0 = 0                           4: (b7) r0 = 0
   5: (95) exit                             5: (95) exit

x86 in particular can throw a 'divide error' exception for div
instruction not only for divisor being zero, but also for the case
when the quotient is too large for the designated register. For the
edx:eax and rdx:rax dividend pair it is not an issue in x86 BPF JIT
since we always zero edx (rdx). Hence really the only protection
needed is against divisor being zero.

Fixes: 68fda450a7df ("bpf: fix 32-bit divide by zero")
Co-developed-by: John Fastabend <[email protected]>
Signed-off-by: John Fastabend <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>

bpf: Fix verifier jmp32 pruning decision logic

Anatoly has been fuzzing with kBdysch harness and reported a hang in
one of the outcomes:

  func#0 @0
  0: R1=ctx(id=0,off=0,imm=0) R10=fp0
  0: (b7) r0 = 808464450
  1: R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R10=fp0
  1: (b4) w4 = 808464432
  2: R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R4_w=invP808464432 R10=fp0
  2: (9c) w4 %= w0
  3: R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
  3: (66) if w4 s> 0x30303030 goto pc+0
   R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff),s32_max_value=808464432) R10=fp0
  4: R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff),s32_max_value=808464432) R10=fp0
  4: (7f) r0 >>= r0
  5: R0_w=invP(id=0) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff),s32_max_value=808464432) R10=fp0
  5: (9c) w4 %= w0
  6: R0_w=invP(id=0) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  6: (66) if w0 s> 0x3030 goto pc+0
   R0_w=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  7: R0=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4=invP(id=0) R10=fp0
  7: (d6) if w0 s<= 0x303030 goto pc+1
  9: R0=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4=invP(id=0) R10=fp0
  9: (95) exit
  propagating r0

  from 6 to 7: safe
  4: R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umin_value=808464433,umax_value=2147483647,var_off=(0x0; 0x7fffffff)) R10=fp0
  4: (7f) r0 >>= r0
  5: R0_w=invP(id=0) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umin_value=808464433,umax_value=2147483647,var_off=(0x0; 0x7fffffff)) R10=fp0
  5: (9c) w4 %= w0
  6: R0_w=invP(id=0) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  6: (66) if w0 s> 0x3030 goto pc+0
   R0_w=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  propagating r0
  7: safe
  propagating r0

  from 6 to 7: safe
  processed 15 insns (limit 1000000) max_states_per_insn 0 total_states 1 peak_states 1 mark_read 1

The underlying program was xlated as follows:

  # bpftool p d x i 10
   0: (b7) r0 = 808464450
   1: (b4) w4 = 808464432
   2: (bc) w0 = w0
   3: (15) if r0 == 0x0 goto pc+1
   4: (9c) w4 %= w0
   5: (66) if w4 s> 0x30303030 goto pc+0
   6: (7f) r0 >>= r0
   7: (bc) w0 = w0
   8: (15) if r0 == 0x0 goto pc+1
   9: (9c) w4 %= w0
  10: (66) if w0 s> 0x3030 goto pc+0
  11: (d6) if w0 s<= 0x303030 goto pc+1
  12: (05) goto pc-1
  13: (95) exit

The verifier rewrote original instructions it recognized as dead code with
'goto pc-1', but reality differs from verifier simulation in that we are
actually able to trigger a hang due to hitting the 'goto pc-1' instructions.

Taking a closer look at the verifier analysis, the reason is that it misjudges
its pruning decision at the first 'from 6 to 7: safe' occasion. What happens
is that while both old/cur registers are marked as precise, they get misjudged
for the jmp32 case as range_within() yields true, meaning that the prior
verification path with a wider register bound could be verified successfully
and therefore the current path with a narrower register bound is deemed safe
as well whereas in reality it's not. R0 old/cur path's bounds compare as
follows:

  old: smin_value=0x8000000000000000,smax_value=0x7fffffffffffffff,umin_value=0x0,umax_value=0xffffffffffffffff,var_off=(0x0; 0xffffffffffffffff)
  cur: smin_value=0x8000000000000000,smax_value=0x7fffffff7fffffff,umin_value=0x0,umax_value=0xffffffff7fffffff,var_off=(0x0; 0xffffffff7fffffff)

  old: s32_min_value=0x80000000,s32_max_value=0x00003030,u32_min_value=0x00000000,u32_max_value=0xffffffff
  cur: s32_min_value=0x00003031,s32_max_value=0x7fffffff,u32_min_value=0x00003031,u32_max_value=0x7fffffff

The 64 bit bounds generally look okay and while the information that got
propagated from 32 to 64 bit looks correct as well, it's not precise enough
for judging a conditional jmp32. Given the latter only operates on subregisters
we also need to take these into account as well for a range_within() probe
in order to be able to prune paths. Extending the range_within() constraint
to both bounds will be able to tell us that the old signed 32 bit bounds are
not wider than the cur signed 32 bit bounds.

With the fix in place, the program will now verify the 'goto' branch case as
it should have been:

  [...]
  6: R0_w=invP(id=0) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  6: (66) if w0 s> 0x3030 goto pc+0
   R0_w=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  7: R0=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4=invP(id=0) R10=fp0
  7: (d6) if w0 s<= 0x303030 goto pc+1
  9: R0=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4=invP(id=0) R10=fp0
  9: (95) exit

  7: R0_w=invP(id=0,smax_value=9223372034707292159,umax_value=18446744071562067967,var_off=(0x0; 0xffffffff7fffffff),s32_min_value=12337,u32_min_value=12337,u32_max_value=2147483647) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  7: (d6) if w0 s<= 0x303030 goto pc+1
   R0_w=invP(id=0,smax_value=9223372034707292159,umax_value=18446744071562067967,var_off=(0x0; 0xffffffff7fffffff),s32_min_value=3158065,u32_min_value=3158065,u32_max_value=2147483647) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  8: R0_w=invP(id=0,smax_value=9223372034707292159,umax_value=18446744071562067967,var_off=(0x0; 0xffffffff7fffffff),s32_min_value=3158065,u32_min_value=3158065,u32_max_value=2147483647) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  8: (30) r0 = *(u8 *)skb[808464432]
  BPF_LD_[ABS|IND] uses reserved fields
  processed 11 insns (limit 1000000) max_states_per_insn 1 total_states 1 peak_states 1 mark_read 1

The bug is quite subtle in the sense that when verifier would determine that
a given branch is dead code, it would (here: wrongly) remove these instructions
from the program and hard-wire the taken branch for privileged programs instead
of the 'goto pc-1' rewrites which will cause hard to debug problems.

Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Reported-by: Anatoly Trosinenko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: John Fastabend <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>

bpf: Fix verifier jsgt branch analysis on max bound

Fix incorrect is_branch{32,64}_taken() analysis for the jsgt case. The return
code for both will tell the caller whether a given conditional jump is taken
or not, e.g. 1 means branch will be taken [for the involved registers] and the
goto target will be executed, 0 means branch will not be taken and instead we
fall-through to the next insn, and last but not least a -1 denotes that it is
not known at verification time whether a branch will be taken or not. Now while
the jsgt has the branch-taken case correct with reg->s32_min_value > sval, the
branch-not-taken case is off-by-one when testing for reg->s32_max_value < sval
since the branch will also be taken for reg->s32_max_value == sval. The jgt
branch analysis, for example, gets this right.

Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Fixes: 4f7b3e82589e ("bpf: improve verifier branch analysis")
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: John Fastabend <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
40GbE Intel Wired LAN Driver Updates 2021-02-08

This series contains updates to i40e driver only.

Cristian makes improvements to driver XDP path. Avoids writing
next-to-clean pointer on every update, removes redundant updates of
cleaned_count and buffer info, creates a helper function to consolidate
XDP actions and simplifies some of the behavior.

Eryk adds messages to inform the user when MTU is larger than supported
====================

Signed-off-by: David S. Miller <[email protected]>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) nf_conntrack_tuple_taken() needs to recheck zone for
NAT clash resolution, from Florian Westphal.

2) Restore support for stateful expressions when set definition
specifies no stateful expressions.
====================

Signed-off-by: David S. Miller <[email protected]>

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
100GbE Intel Wired LAN Driver Updates 2021-02-08

This series contains updates to the ice driver and documentation.

Brett adds a log message when a trusted VF goes in and out of promiscuous
for consistency with i40e driver.

Dave implements a new LLDP command that allows adding VSI destinations to
existing filters and adds support for netdev bonding events, current
support is software based.

Michal refactors code to move from VSI stored xsk_buff_pools to
netdev-provided ones.

Kiran implements the creation scheduler aggregator nodes and distributing
VSIs within the nodes.

Ben modifies rate limit calculations to use clock frequency from the
hardware instead of using a hardcoded one.

Jesse adds support for user to control writeback frequency.

Chinh refactors DCB variables out of the ice_port_info struct.

Bruce removes some unnecessary casting.

Mitch fixes an error message that was reported as if_up instead of if_down.

Tony adjusts fallback allocation for MSI-X to use all given vectors instead
of using only the minimum configuration and updates documentation for
the ice driver.
====================

Reviewed-by: Alexander Duyck <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'hns3-cleanups'

Huazhong Tan says:

====================
net: hns3: some cleanups for -next

There are some cleanups for the HNS3 ethernet driver.

change log:
V2: remove previous #3 which should target net.

previous version:
V1: https://patchwork.kernel.org/project/netdevbpf/cover/1612784382 [email protected]/
====================

Acked-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: cleanup for endian issue for VF RSS

Currently the RSS commands of VF are using host byte order.
According to the user manual, it should use little endian in
the command to firmware. For the host and firmware are both
using little endian, so it can work well in this case.

Do cleanup to make it more explicitly.

Signed-off-by: Jian Shen <[email protected]>
Signed-off-by: Huazhong tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: remove unused macro definition

Some macros are defined but unused, so remove them.

Signed-off-by: Peng Li <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: remove an unused parameter in hclge_vf_rate_param_check()

Parameter vf in hclge_vf_rate_param_check() is unused now,
so remove it.

Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: remove redundant return value of hns3_uninit_all_ring()

Since hns3_uninit_all_ring() only returns 0, so remove this
redundant return value and function declaration in hns3_enet.h.

Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: change hclge_query_bd_num() param type

The type of parameter mpf_bd_num and pf_bd_num in
hclge_query_bd_num() should be u32* instead of int*,
so change them.

Signed-off-by: Peng Li <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: change hclge_parse_speed() param type

The type of parameters in hclge_parse_speed() should be
unsigned type, so change them.

Signed-off-by: Peng Li <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: modify some unmacthed types print parameter

Fix an issue where the formatting symbol of the formatting input and
output function does not match the actual type.

Signed-off-by: Jiaran Zhang <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: clean up unnecessary parentheses in macro definitions

In macro definitions, parentheses are unnecessary in some cases,
such as the calling parameter of a function, the left variable
of the equal sign, and so on. So remove these unnecessary
parentheses according to these rules.

Signed-off-by: Yufeng Mo <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: remove the shaper param magic number

To make the code more readable, this patch adds a definition for
the magic number 126 used for the default shaper param ir_b, and
rename macro DIVISOR_IR_B_126.

No functional change.

Signed-off-by: Peng Li <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: remove redundant client_setup_tc handle

Since the real tx queue number and real rx queue number
always be updated when netdev opens, it's redundant
to call hclge_client_setup_tc to do the same thing.
So remove it.

Signed-off-by: Jian Shen <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: clean up some incorrect variable types in hclge_dbg_dump_tm_map()

queue_id, qset_id and other IDs are unsigned type, so modify
the corresponding local variables' type in hclge_dbg_dump_tm_map()
from signed to unsigned. kstrtouint() and the print format should
be updated as well.

Signed-off-by: Yonglong Liu <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

vsock: fix locking in vsock_shutdown()

In vsock_shutdown() we touched some socket fields without holding the
socket lock, such as 'state' and 'sk_flags'.

Also, after the introduction of multi-transport, we are accessing
'vsk->transport' in vsock_send_shutdown() without holding the lock
and this call can be made while the connection is in progress, so
the transport can change in the meantime.

To avoid issues, we hold the socket lock when we enter in
vsock_shutdown() and release it when we leave.

Among the transports that implement the 'shutdown' callback, only
hyperv_transport acquired the lock. Since the caller now holds it,
we no longer take it.

Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Signed-off-by: Stefano Garzarella <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'implement-kthread-based-napi-poll'

Wei Wang says:

====================
implement kthread based napi polle

The idea of moving the napi poll process out of softirq context to a
kernel thread based context is not new.
Paolo Abeni and Hannes Frederic Sowa have proposed patches to move napi
poll to kthread back in 2016. And Felix Fietkau has also proposed
patches of similar ideas to use workqueue to process napi poll just a
few weeks ago.

The main reason we'd like to push forward with this idea is that the
scheduler has poor visibility into cpu cycles spent in softirq context,
and is not able to make optimal scheduling decisions of the user threads.
For example, we see in one of the application benchmark where network
load is high, the CPUs handling network softirqs has ~80% cpu util. And
user threads are still scheduled on those CPUs, despite other more idle
cpus available in the system. And we see very high tail latencies. In this
case, we have to explicitly pin away user threads from the CPUs handling
network softirqs to ensure good performance.
With napi poll moved to kthread, scheduler is in charge of scheduling both
the kthreads handling network load, and the user threads, and is able to
make better decisions. In the previous benchmark, if we do this and we
pin the kthreads processing napi poll to specific CPUs, scheduler is
able to schedule user threads away from these CPUs automatically.

And the reason we prefer 1 kthread per napi, instead of 1 workqueue
entity per host, is that kthread is more configurable than workqueue,
and we could leverage existing tuning tools for threads, like taskset,
chrt, etc to tune scheduling class and cpu set, etc. Another reason is
if we eventually want to provide busy poll feature using kernel threads
for napi poll, kthread seems to be more suitable than workqueue.
Furthermore, for large platforms with 2 NICs attached to 2 sockets,
kthread is more flexible to be pinned to different sets of CPUs.

In this patch series, I revived Paolo and Hannes's patch in 2016 and
made modifications. Then there are changes proposed by Felix, Jakub,
Paolo and myself on top of those, with suggestions from Eric Dumazet.

In terms of performance, I ran tcp_rr tests with 1000 flows with
various request/response sizes, with RFS/RPS disabled, and compared
performance between softirq vs kthread vs workqueue (patchset proposed
by Felix Fietkau).
Host has 56 hyper threads and 100Gbps nic, 8 rx queues and only 1 numa
node. All threads are unpinned.

        req/resp   QPS   50%tile    90%tile    99%tile    99.9%tile
softirq   1B/1B   2.75M   337us       376us      1.04ms     3.69ms
kthread   1B/1B   2.67M   371us       408us      455us      550us
workq     1B/1B   2.56M   384us       435us      673us      822us

softirq 5KB/5KB   1.46M   678us       750us      969us      2.78ms
kthread 5KB/5KB   1.44M   695us       789us      891us      1.06ms
workq   5KB/5KB   1.34M   720us       905us     1.06ms      1.57ms

softirq 1MB/1MB   11.0K   79ms       166ms      306ms       630ms
kthread 1MB/1MB   11.0K   75ms       177ms      303ms       596ms
workq   1MB/1MB   11.0K   79ms       180ms      303ms       587ms

When running workqueue implementation, I found the number of threads
used is usually twice as much as kthread implementation. This probably
introduces higher scheduling cost, which results in higher tail
latencies in most cases.

I also ran an application benchmark, which performs fixed qps remote SSD
read/write operations, with various sizes. Again, both with RFS/RPS
disabled.
The result is as follows:
         op_size  QPS   50%tile 95%tile 99%tile 99.9%tile
softirq   4K     572.6K   385us   1.5ms  3.16ms   6.41ms
kthread   4K     572.6K   390us   803us  2.21ms   6.83ms
workq     4k     572.6K   384us   763us  3.12ms   6.87ms

softirq   64K    157.9K   736us   1.17ms 3.40ms   13.75ms
kthread   64K    157.9K   745us   1.23ms 2.76ms    9.87ms
workq     64K    157.9K   746us   1.23ms 2.76ms    9.96ms

softirq   1M     10.98K   2.03ms  3.10ms  3.7ms   11.56ms
kthread   1M     10.98K   2.13ms  3.21ms  4.02ms  13.3ms
workq     1M     10.98K   2.13ms  3.20ms  3.99ms  14.12ms

In this set of tests, the latency is predominant by the SSD operation.
Also, the user threads are much busier compared to tcp_rr tests. We have
to pin the kthreads/workqueue threads to limit to a few CPUs, to not
disturb user threads, and provide some isolation.

Changes since v9:
Small change in napi_poll() in patch 1.
Split napi_kthread_stop() functionality to add separately in
napi_disable() and netif_napi_del() in patch 2.
Add description for napi_set_threaded() and return dev->threaded when
dev->napi_list is empty for threaded sysfs in patch 3.

Changes since v8:
Added description for threaded param in struct net_device in patch 2.

Changes since v7:
Break napi_set_threaded() into 2 parts, one to create kthread called
from netif_napi_add(), the other to set threaded bit in napi_enable(),
to get rid of inconsistency through all napi in 1 dev.
Added documentation for /sys/class/net/<dev>/threaded.

Changes since v6:
Added memory barrier in napi_set_threaded().
Changed /sys/class/net/<dev>/thread to a ternary value.
Change dev->threaded to a bit instead of bool.

Changes since v5:
Removed ASSERT_RTNL() from napi_set_threaded() and removed rtnl_lock()
operation from napi_enable().

Changes since v4:
Recorded the threaded setting in dev and restore it in napi_enable().

Changes since v3:
Merged and rearranged patches in a logical order for easier review.
Changed sysfs control to be per device.

Changes since v2:
Corrected typo in patch 1, and updated the cover letter with more
detailed and updated test results.

Changes since v1:
Replaced kthread_create() with kthread_run() in patch 5 as suggested by
Felix Fietkau.

Changes since RFC:
Renamed the kthreads to be napi/<dev>-<napi_id> in patch 5 as suggested
by Hannes Frederic Sowa.
====================

Signed-off-by: David S. Miller <[email protected]>

net: add sysfs attribute to control napi threaded mode

This patch adds a new sysfs attribute to the network device class.
Said attribute provides a per-device control to enable/disable the
threaded mode for all the napi instances of the given network device,
without the need for a device up/down.
User sets it to 1 or 0 to enable or disable threaded mode.
Note: when switching between threaded and the current softirq based mode
for a napi instance, it will not immediately take effect if the napi is
currently being polled. The mode switch will happen for the next time
napi_schedule() is called.

Co-developed-by: Paolo Abeni <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Co-developed-by: Hannes Frederic Sowa <[email protected]>
Signed-off-by: Hannes Frederic Sowa <[email protected]>
Co-developed-by: Felix Fietkau <[email protected]>
Signed-off-by: Felix Fietkau <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: implement threaded-able napi poll loop support

This patch allows running each napi poll loop inside its own
kernel thread.
The kthread is created during netif_napi_add() if dev->threaded
is set. And threaded mode is enabled in napi_enable(). We will
provide a way to set dev->threaded and enable threaded mode
without a device up/down in the following patch.

Once that threaded mode is enabled and the kthread is
started, napi_schedule() will wake-up such thread instead
of scheduling the softirq.

The threaded poll loop behaves quite likely the net_rx_action,
but it does not have to manipulate local irqs and uses
an explicit scheduling point based on netdev_budget.

Co-developed-by: Paolo Abeni <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Co-developed-by: Hannes Frederic Sowa <[email protected]>
Signed-off-by: Hannes Frederic Sowa <[email protected]>
Co-developed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: extract napi poll functionality to __napi_poll()

This commit introduces a new function __napi_poll() which does the main
logic of the existing napi_poll() function, and will be called by other
functions in later commits.
This idea and implementation is done by Felix Fietkau <[email protected]> and
is proposed as part of the patch to move napi work to work_queue
context.
This commit by itself is a code restructure.

Signed-off-by: Felix Fietkau <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'hns3-fixes'

Huazhong Tan says:

====================
net: hns3: fixes for -net

The parameters sent from vf may be unreliable. If these
parameters are used directly, memory overwriting may occur.

So this series adds some checks for this case.
====================

Signed-off-by: David S. Miller <[email protected]>

net: hns3: add a check for index in hclge_get_rss_key()

The index is received from vf, if use it directly,
an out-of-bound issue may be caused, so add a check for
this index before using it in hclge_get_rss_key().

Fixes: a638b1d8cc87 ("net: hns3: fix get VF RSS issue")
Signed-off-by: Yufeng Mo <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: add a check for tqp_index in hclge_get_ring_chain_from_mbx()

The tqp_index is received from vf, if use it directly,
an out-of-bound issue may be caused, so add a check for
this tqp_index before using it in hclge_get_ring_chain_from_mbx().

Fixes: 84e095d64ed9 ("net: hns3: Change PF to add ring-vect binding & resetQ to mailbox")
Signed-off-by: Yufeng Mo <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: add a check for queue_id in hclge_reset_vf_queue()

The queue_id is received from vf, if use it directly,
an out-of-bound issue may be caused, so add a check for
this queue_id before using it in hclge_reset_vf_queue().

Fixes: 1a426f8b40fc ("net: hns3: fix the VF queue reset flow error")
Signed-off-by: Yufeng Mo <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: felix: implement port flushing on .phylink_mac_link_down

There are several issues which may be seen when the link goes down while
forwarding traffic, all of which can be attributed to the fact that the
port flushing procedure from the reference manual was not closely
followed.

With flow control enabled on both the ingress port and the egress port,
it may happen when a link goes down that Ethernet packets are in flight.
In flow control mode, frames are held back and not dropped. When there
is enough traffic in flight (example: iperf3 TCP), then the ingress port
might enter congestion and never exit that state. This is a problem,
because it is the egress port's link that went down, and that has caused
the inability of the ingress port to send packets to any other port.
This is solved by flushing the egress port's queues when it goes down.

There is also a problem when performing stream splitting for
IEEE 802.1CB traffic (not yet upstream, but a sort of multicast,
basically). There, if one port from the destination ports mask goes
down, splitting the stream towards the other destinations will no longer
be performed. This can be traced down to this line:

ocelot_port_writel(ocelot_port, 0, DEV_MAC_ENA_CFG);

which should have been instead, as per the reference manual:

ocelot_port_rmwl(ocelot_port, 0, DEV_MAC_ENA_CFG_RX_ENA,
DEV_MAC_ENA_CFG);

Basically only DEV_MAC_ENA_CFG_RX_ENA should be disabled, but not
DEV_MAC_ENA_CFG_TX_ENA - I don't have further insight into why that is
the case, but apparently multicasting to several ports will cause issues
if at least one of them doesn't have DEV_MAC_ENA_CFG_TX_ENA set.

I am not sure what the state of the Ocelot VSC7514 driver is, but
probably not as bad as Felix/Seville, since VSC7514 uses phylib and has
the following in ocelot_adjust_link:

if (!phydev->link)
return;

therefore the port is not really put down when the link is lost, unlike
the DSA drivers which use .phylink_mac_link_down for that.

Nonetheless, I put ocelot_port_flush() in the common ocelot.c because it
needs to access some registers from drivers/net/ethernet/mscc/ocelot_rew.h
which are not exported in include/soc/mscc/ and a bugfix patch should
probably not move headers around.

Fixes: bdeced75b13f ("net: dsa: felix: Add PCS operations for PHYLINK")
Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: broadcom: bcm4908enet: add BCM4908 controller driver

BCM4908 SoCs family uses Ethernel controller that includes UniMAC but
uses different DMA engine (than other controllers) and requires
different programming.

Signed-off-by: Rafał Miłecki <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

dt-bindings: net: document BCM4908 Ethernet controller

BCM4908 is a family of SoCs with integrated Ethernet controller.

Signed-off-by: Rafał Miłecki <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2021-02-09

1) Support TSO on xfrm interfaces.
   From Eyal Birger.

2) Variable calculation simplifications in esp4/esp6.
   From Jiapeng Chong / Jiapeng Zhong.

3) Fix a return code in xfrm_do_migrate.
   From Zheng Yongjun.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <[email protected]>

Documentation: networking: ip-sysctl: Document src_valid_mark sysctl

Provide documentation for src_valid_mark sysctl, which was added
in commit 28f6aeea3f12 ("net: restore ip source validation").

Signed-off-by: Jay Vosburgh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>