Git Repo - linux.git/log

Bluetooth: btusb: mediatek: move Bluetooth power off command position

Move MediaTek Bluetooth power off command before releasing
usb ISO interface.

Signed-off-by: Chris Lu <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>

eth: fbnic: Add support to dump registers

Add support for the 'ethtool -d <dev>' command to retrieve and print
a register dump for fbnic. The dump defaults to version 1 and consists
of two parts: all the register sections that can be dumped linearly, and
an RPC RAM section that is structured in an interleaved fashion and
requires special handling. For each register section, the dump also
contains the start and end boundary information which can simplify parsing.

Signed-off-by: Mohsin Bashir <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

Merge branch 'net-dsa-microchip-add-lan9646-switch-support'

Tristram Ha says:

====================
net: dsa: microchip: Add LAN9646 switch support

This series of patches is to add LAN9646 switch support to the KSZ DSA
driver.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: dsa: microchip: Add LAN9646 switch support to KSZ DSA driver

LAN9646 switch is a 6-port switch with functions like KSZ9897. It has
4 internal PHYs and 1 SGMII port. The chip id read from hardware is
same as KSZ9477, so software driver needs to create a new chip id and
group allowable functions under its chip data structure to
differentiate the product.

Signed-off-by: Tristram Ha <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

dt-bindings: net: dsa: microchip: Add LAN9646 switch support

LAN9646 switch is a 6-port switch with functions like KSZ9897. It has
4 internal PHYs and 1 SGMII port.

Signed-off-by: Tristram Ha <[email protected]>
Acked-by: Krzysztof Kozlowski <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'support-external-snapshots-on-dwmac1000'

Maxime Chevallier says:

====================
Support external snapshots on dwmac1000

The main change since v3 is the move of the fifo flush wait in the
ptp_clock_info enable() function within the mutex that protects the ptp
registers. Thanks Jakub and Paolo for spotting this.

This series also aggregates Daniel's reviews, except for the patch 4
which was modified since then.

This series is another take on the previous work [1] done by
Alexis Lothoré, that fixes the support for external snapshots
timestamping in GMAC3-based devices.

Details on why this is needed are mentionned on the cover [2] from V1.

[1]: https://lore.kernel.org/netdev/20230616100409 [email protected]/
[2]: https://lore.kernel.org/netdev/20241029115419.1160201 [email protected]/

Link to V1: https://lore.kernel.org/netdev/20241029115419.1160201 [email protected]/
Link to V2: https://lore.kernel.org/netdev/20241104170251.2202270 [email protected]/
Link to V3: https://lore.kernel.org/netdev/20241106090331 [email protected]/
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: stmmac: dwmac_socfpga: This platform has GMAC

Indicate that dwmac_socfpga has a gmac. This will make sure that
gmac-specific interrupt processing is done, including timestamp
interrupt handling. Without this, the external snapshot interrupt is
never ack'd and we have an interrupt storm on external snapshot event.

Reviewed-by: Daniel Machon <[email protected]>
Signed-off-by: Maxime Chevallier <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: stmmac: Configure only the relevant bits for timestamping setup

The PTP_TCR (Timestamp Control Register) is used to configure several
features related to packet timestamping.

On one hand, it configures the 1588 packet processing, to indicate what
types of frames should be timestamped (all, only 1588v1 or 1588v2, using
L2 or L4 timestamping, on IPv4 or IPv6, etc.). This is congfigured
usually through the ioctl / ndo dedicated for such setup. This
configuration is done by setting some fields in that register, that seem
to behave the same way on all dwmac variants, including DWMAC1000.

On the other hand, and only on DWMAC1000 apparently, some fields in that
register are used to configure external snapshots (bits 24/25).
On DWMAC4 and others, these fields are reserved and external
snapshots are configured through a dedicated register that simply
doesn't seem to exist on DWMAC1000.

This configuration is done in the dwmac1000-specific ptp_clock_info ops
(cf dwmac1000_ptp_enable()).

So to avoid the timestamping configuration interfering with the external
snapshots, this commit makes sure that the config_hw_tstamping only
configures the relevant bits in PTP_TCR, so that the DWMAC1000
timestamping can correctly rely on these otherwise reserved fields.

Reviewed-by: Daniel Machon <[email protected]>
Signed-off-by: Maxime Chevallier <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: stmmac: Don't include dwmac4 definitions in stmmac_ptp

The stmmac_ptp code doesn't need the dwmac4 register definitions, remove
the inclusion.

Reviewed-by: Daniel Machon <[email protected]>
Signed-off-by: Maxime Chevallier <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: stmmac: Enable timestamping interrupt on dwmac1000

The default configuration for the interrupts on dwmac1000 have the
timestamping interrupt masked. Now that the timestamping has been
adapted to dwmac1000, enable the timestamping interrupt on these
platforms.

On dwmac1000, the external snapshot interrupt is configured through a
dedicated bit, that is set as reserved on other dwmac variants. The
timestaming interrupt is acknowledged by reading the
GMAC3_X_TIMESTAMP_STATUS register.

Make sure that this interrupt is enabled when snapshot is enabled, and
masked when disabled.

Reviewed-by: Daniel Machon <[email protected]>
Signed-off-by: Maxime Chevallier <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: stmmac: Introduce dwmac1000 timestamping operations

In GMAC3_X, the timestamping configuration differs from GMAC4 in the
layout of the registers accessed to grab the number of snapshots in FIFO
as well as the register offset to grab the aux snapshot timestamp.

Introduce dedicated ops to configure timestamping on dwmac100 and
dwmac1000. The latency correction doesn't seem to exist on GMAC3, so its
corresponding operation isn't populated.

Reviewed-by: Daniel Machon <[email protected]>
Signed-off-by: Maxime Chevallier <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: stmmac: Introduce dwmac1000 ptp_clock_info and operations

The PTP configuration for GMAC3_X differs from the other implementations
in several ways :

- There's only one external snapshot trigger
- The snapshot configuration is done through the PTP_TCR register,
   whereas the other dwmac variants have a dedicated ACR (auxiliary
   control reg) for that purpose
- The layout for the PTP_TCR register also differs, as bits 24/25 are
   used for the snapshot configuration. These bits are reserved on other
   variants.

On GMAC3_X, we also can't discover the number of snapshot triggers
automatically.

The GMAC3_X has one PPS output, however it's configuration isn't
supported yet so report 0 n_per_out for now.

Introduce a dedicated set of ptp_clock_info ops and configuration
parameters to reflect these differences specific to GMAC3_X.

This was tested on dwmac_socfpga.

Signed-off-by: Maxime Chevallier <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: stmmac: Only update the auto-discovered PTP clock features

Some DWMAC variants such as dwmac1000 don't support discovering the
number of output pps and auxiliary snapshots. Allow these parameters to
be defined in default ptp_clock_info, and let them be updated only when
the feature discovery yielded a result.

Reviewed-by: Daniel Machon <[email protected]>
Signed-off-by: Maxime Chevallier <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: stmmac: Use per-hw ptp clock ops

The auxiliary snapshot configuration was found to differ depending on
the dwmac version. To prepare supporting this, allow specifying the
ptp_clock_info ops in the hwif array

Reviewed-by: Daniel Machon <[email protected]>
Signed-off-by: Maxime Chevallier <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: stmmac: Don't modify the global ptp ops directly

The stmmac_ptp_clock_ops are copied into the stmmac_priv structure
before being registered to the PTP core. Some adjustments are made prior
to that, such as the number of snapshots or max adjustment parameters.

Instead of modifying the global definition, then copying into the local
private data, let's first copy then modify the local parameters.

Reviewed-by: Daniel Machon <[email protected]>
Signed-off-by: Maxime Chevallier <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: phy: c45: don't use temporary linkmode bitmaps in genphy_c45_ethtool_get_eee

genphy_c45_eee_is_active() populates both bitmaps only if it returns
successfully. So we can avoid the overhead of the temporary bitmaps.

Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Russell King (Oracle) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: simplify eeecfg_mac_can_tx_lpi

Simplify the function.

Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Russell King (Oracle) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ynl: samples: Fix the wrong format specifier

Make a minor change to eliminate a static checker warning. The type
of s->ifc is unsigned int, so the correct format specifier should be
%u instead of %d.

Signed-off-by: Luo Yifan <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'tools-ynl-two-patches-to-ease-building-with-rpmbuild'

Jan Stancek says:

====================
tools: ynl: two patches to ease building with rpmbuild

I'm looking to build and package ynl for Fedora and Centos Stream users.
Default rpmbuild has couple hardening options enabled by default [1][2],
which currently prevent ynl from building.

This series contains 2 small patches to address it.

[1] https://fedoraproject.org/wiki/Changes/Harden_All_Packages
[2] https://fedoraproject.org/wiki/Changes/PythonSafePath
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

tools: ynl: extend CFLAGS to keep options from environment

Package build environments like Fedora rpmbuild introduced hardening
options (e.g. -pie -Wl,-z,now) by passing a -spec option to CFLAGS
and LDFLAGS.

ynl Makefiles currently override CFLAGS but not LDFLAGS, which leads
to a mismatch and build failure:
        CC sample devlink
  /usr/bin/ld: devlink.o: relocation R_X86_64_32 against symbol `ynl_devlink_family' can not be used when making a PIE object; recompile with -fPIE
  /usr/bin/ld: failed to set dynamic section sizes: bad value
  collect2: error: ld returned 1 exit status

Extend CFLAGS to support hardening options set by build environment.

Signed-off-by: Jan Stancek <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Reviewed-by: Donald Hunter <[email protected]>
Link: https://patch.msgid.link/265b2d5d3a6d4721a161219f081058ed47dc846a.1731399562.git.jstancek@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>

tools: ynl: add script dir to sys.path

Python options like PYTHONSAFEPATH or -P [1] do not add script
directory to PYTHONPATH. ynl depends on this path to build and run.

[1] This option is default for Fedora rpmbuild since introduction of
https://fedoraproject.org/wiki/Changes/PythonSafePath

Signed-off-by: Jan Stancek <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Reviewed-by: Donald Hunter <[email protected]>
Link: https://patch.msgid.link/b26537cdb6e1b24435b50b2ef81d71f31c630bc1.1731399562.git.jstancek@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>

Merge tag 'wireless-next-2024-11-13' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.13

Most likely the last -next pull request for v6.13. Most changes are in
Realtek and Qualcomm drivers, otherwise not really anything
noteworthy.

Major changes:

mac80211
* EHT 1024 aggregation size for transmissions

ath12k
* switch to using wiphy_lock() and remove ar->conf_mutex
* firmware coredump collection support
* add debugfs support for a multitude of statistics

ath11k
* dt: document WCN6855 hardware inputs

ath9k
* remove include/linux/ath9k_platform.h

ath5k
* Arcadyan ARV45XX AR2417 & Gigaset SX76[23] AR241[34]A support

rtw88:
* 8821au and 8812au USB adapters support

rtw89
* thermal protection
* firmware secure boot for WiFi 6 chip

* tag 'wireless-next-2024-11-13' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (154 commits)
  Revert "wifi: iwlegacy: do not skip frames with bad FCS"
  wifi: mac80211: pass MBSSID config by reference
  wifi: mac80211: Support EHT 1024 aggregation size in TX
  net: rfkill: gpio: Add check for clk_enable()
  wifi: brcmfmac: Fix oops due to NULL pointer dereference in brcmf_sdiod_sglist_rw()
  wifi: Switch back to struct platform_driver::remove()
  wifi: ipw2x00: libipw_rx_any(): fix bad alignment
  wifi: brcmfmac: release 'root' node in all execution paths
  wifi: iwlwifi: mvm: don't call power_update_mac in fast suspend
  wifi: iwlwifi: s/IWL_MVM_INVALID_STA/IWL_INVALID_STA
  wifi: iwlwifi: bump minimum API version in BZ/SC to 92
  wifi: iwlwifi: move IWL_LMAC_*_INDEX to fw/api/context.h
  wifi: iwlwifi: be less noisy if the NIC is dead in S3
  wifi: iwlwifi: mvm: tell iwlmei when we finished suspending
  wifi: iwlwifi: allow fast resume on ax200
  wifi: iwlwifi: mvm: support new initiator and responder command version
  wifi: iwlwifi: mvm: use wiphy locked debugfs for low-latency
  wifi: iwlwifi: mvm: MLO scan upon channel condition degradation
  wifi: iwlwifi: mvm: support new versions of the wowlan APIs
  wifi: iwlwifi: mvm: allow always calling iwl_mvm_get_bss_vif()
  ...
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'phy-mediatek-reorg'

Sky Huang says:

====================
Re-organize MediaTek ethernet phy drivers and propose mtk-phy-lib

This patchset comes from patch 1/9, 3/9, 4/9, 5/9 and 7/9 of:
https://lore.kernel.org/netdev/20241004102413 [email protected]/

This patchset changes MediaTek's ethernet phy's folder structure and
integrates helper functions, including LED & token ring manipulation,
into mtk-phy-lib.

---
Change in v2:
- Add correct Reviewed-by tag in each patch.

Change in v3:
[patch 4/5]
- Fix kernel test robot error by adding missing MTK_NET_PHYLIB.
====================

Signed-off-by: Sky Huang <[email protected]>

net: phy: mediatek: add MT7530 & MT7531's PHY ID macros

This patch adds MT7530 & MT7531's PHY ID macros in mtk-ge.c so that
it follows the same rule of mtk-ge-soc.c.

Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: SkyLake.Huang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: mediatek: Integrate read/write page helper functions

This patch integrates read/write page helper functions as MTK phy lib.
They are basically the same in mtk-ge.c & mtk-ge-soc.c.

Signed-off-by: SkyLake.Huang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: mediatek: Improve readability of mtk-phy-lib.c's mtk_phy_led_hw_ctrl_set()

This patch removes parens around TRIGGER_NETDEV_RX/TRIGGER_NETDEV_TX in
mtk_phy_led_hw_ctrl_set(), which improves readability.

Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: SkyLake.Huang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: mediatek: Move LED helper functions into mtk phy lib

This patch creates mtk-phy-lib.c & mtk-phy.h and integrates mtk-ge-soc.c's
LED helper functions so that we can use those helper functions in other
MTK's ethernet phy driver.

Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: SkyLake.Huang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: mediatek: Re-organize MediaTek ethernet phy drivers

Re-organize MediaTek ethernet phy driver files and get ready to integrate
some common functions and add new 2.5G phy driver.
mtk-ge.c: MT7530 Gphy on MT7621 & MT7531 Gphy
mtk-ge-soc.c: Built-in Gphy on MT7981 & Built-in switch Gphy on MT7988
mtk-2p5ge.c: Planned for built-in 2.5G phy on MT7988

Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: SkyLake.Huang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'octeontx2-rvu-rep'

Geetha sowjanya says:

====================
Introduce RVU representors

This series adds representor support for each rvu devices.
When switchdev mode is enabled, representor netdev is registered
for each rvu device. In implementation of representor model,
one NIX HW LF with multiple SQ and RQ is reserved, where each
RQ and SQ of the LF are mapped to a representor. A loopback channel
is reserved to support packet path between representors and VFs.
CN10K silicon supports 2 types of MACs, RPM and SDP. This
patch set adds representor support for both RPM and SDP MAC
interfaces.

- Patch 1: Implements basic representor driver.
- Patch 2: Add devlink support to create representor netdevs that
  can be used to manage VFs.
- Patch 3: Implements basec netdev_ndo_ops.
- Patch 4: Installs tcam rules to route packets between representor and
   VFs.
- Patch 5: Enables fetching VF stats via representor interface
- Patch 6: Adds support to sync link state between representors and VFs .
- Patch 7: Enables configuring VF MTU via representor netdevs.
- Patch 8: Adds representors for sdp MAC.
- Patch 9: Adds devlink port support.
- Patch 10: Implements offload stats.
- Patch 11: Implements tc offload support.
- patch 12: Adds documentation for rvu port representor.

pci/0002:1c:00.0

Command to create PF/VF representor

Rpf1vf0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether f6:43:83:ee:26:21 brd ff:ff:ff:ff:ff:ff
Rpf1vf1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether 12:b2:54:0e:24:54 brd ff:ff:ff:ff:ff:ff
Rpf1vf2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether 4a:12:c4:4c:32:62 brd ff:ff:ff:ff:ff:ff
Rpf1vf3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether ca:cb:68:0e:e2:6e brd ff:ff:ff:ff:ff:ff
Rpf2vf0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether 06:cc:ad:b4:f0:93 brd ff:ff:ff:ff:ff:ff

~# devlink port
pci/0002:1c:00.0/0: type eth netdev Rpf1vf0 flavour physical port 0 splittable false
pci/0002:1c:00.0/1: type eth netdev Rpf1vf1 flavour pcivf controller 0 pfnum 1 vfnum 1 external false splittable false
pci/0002:1c:00.0/2: type eth netdev Rpf1vf2 flavour pcivf controller 0 pfnum 1 vfnum 2 external false splittable false
pci/0002:1c:00.0/3: type eth netdev Rpf1vf3 flavour pcivf controller 0 pfnum 1 vfnum 3 external false splittable false

-----------
v11:v1:
  - Submitted refactoring changes as a separate patch set.
https://lore.kernel.org/netdev/20241023161843 [email protected]/T/
  - Moved documentation to a separate patch.
  - patch 9: Added code changes to forward updated mac address to VF.
  - Implemented TC offload support.

v10-v11:
  - As suggested by "Jiri Pirko" adjusted the documentation.
  - Added more commit description to patch1.

v9-v10:
  - Fixed build warning w.r.t documentation.

v8-v9:
   - Updated the documentation.

v7-v8:
   - Implemented offload stats ndo.
   - Added documentation.

v6-v7:
  - Rebased on top net-next branch.

v5-v6:
  - Addressed review comments provided by "Simon Horman".
  - Added review tag.

v4-v5:
  - Patch 3: Removed devm_* usage in rvu_rep_create()
  - Patch 3: Fixed build warnings.

v3-v4:
- Patch 2 & 3: Fixed coccinelle reported warnings.
- Patch 10: Added devlink port support.

v2-v3:
- Used extack for error messages.
- As suggested reworked commit messages.
- Fixed sparse warning.

v1-v2:
-Fixed build warnings.
-Address review comments provided by "Kalesh Anakkur Purayil".
====================

Signed-off-by: David S. Miller <[email protected]>

Documentation: octeontx2: Add Documentation for RVU representors

Adds documentation for creating and configuring rvu port representors

Signed-off-by: Geetha sowjanya <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-pf: Adds TC offload support

Implements tc offload support for rvu representors.

Usage example:

- Add tc rule to drop packets with vlan id 3 using port
   representor(Rpf1vf0).

# tc filter add dev Rpf1vf0 protocol 802.1Q parent ffff: flower
   vlan_id 3 vlan_ethtype ipv4 skip_sw action drop

- Redirect packets with vlan id 5 and IPv4 packets to eth1,
  after stripping vlan header.

# tc filter add dev Rpf1vf0 ingress protocol 802.1Q flower vlan_id 5
  vlan_ethtype ipv4 skip_sw action vlan pop action mirred ingress
  redirect dev eth1

Signed-off-by: Geetha sowjanya <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-pf: Implement offload stats ndo for representors

Implement the offload stat ndo by fetching the HW stats
of rx/tx queues attached to the representor.

Signed-off-by: Geetha sowjanya <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-pf: Add devlink port support

Register devlink port for the rvu representors.

Signed-off-by: Geetha sowjanya <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-pf: Add representors for sdp MAC

Hardware supports different types of MACs eg RPM, SDP, LBK.
LBK is for internal Tx->Rx HW loopback path. RPM and SDP MACs support
ingress/egress pkt IO on interfaces with different set of capabilities
like interface modes. At the time of netdev driver registration PF will
seek MAC related information from Admin function driver
'drivers/net/ethernet/marvell/octeontx2/af' and sets up ingress/egress
queues etc such that pkt IO on the channels of these different MACs is
possible. This patch add representors for SDP MAC.

Signed-off-by: Geetha sowjanya <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-pf: Configure VF mtu via representor

Adds support to manage the mtu configuration for VF through representor.
On update of representor mtu a mbox notification is send
to VF to update its mtu.

This feature is implemented based on the "Network Function Representors"
kernel documentation.
"
Setting an MTU on the representor should cause that same MTU
to be reported to the representee.
"

Signed-off-by: Sai Krishna <[email protected]>
Signed-off-by: Geetha sowjanya <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-pf: Add support to sync link state between representor and VFs

Implements the below requirement mentioned
in the representors documentation.

"
The representee's link state is controlled through the
representor. Setting the representor administratively UP
or DOWN should cause carrier ON or OFF at the representee.
"

This patch enables
- Reflecting the link state of representor based on the VF state and
link state of VF based on representor.
- On VF interface up/down a notification is sent via mbox to representor
  to update the link state.
  eg: ip link set eth0 up/down  will disable carrier on/off
       of the corresponding representor(r0p1) interface.
- On representor interface up/down will cause the link state update of VF.
  eg: ip link set r0p1 up/down  will disable carrier on/off
       of the corresponding representee(eth0) interface.

Signed-off-by: Harman Kalra <[email protected]>
Signed-off-by: Geetha sowjanya <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-pf: Get VF stats via representor

Adds support to export VF port statistics via representor
netdev. Defines new mbox "NIX_LF_STATS" to fetch VF hw stats.

Signed-off-by: Geetha sowjanya <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-af: Add packet path between representor and VF

Current HW, do not support in-built switch which will forward pkts
between representee and representor. When representor is put under
a bridge and pkts needs to be sent to representee, then pkts from
representor are sent on a HW internal loopback channel, which again
will be punted to ingress pkt parser. Now the rules that this patch
installs are the MCAM filters/rules which will match against these
pkts and forward them to representee.
The rules that this patch installs are for basic
representor <=> representee path similar to Tun/TAP between VM and
Host.

Signed-off-by: Geetha sowjanya <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-pf: Add basic net_device_ops

Implements basic set of net_device_ops.

Signed-off-by: Geetha sowjanya <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-pf: Create representor netdev

Adds initial devlink support to set/get the switchdev mode.
Representor netdevs are created for each rvu devices when
the switch mode is set to 'switchdev'. These netdevs are
be used to control and configure VFs.

Signed-off-by: Geetha sowjanya <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-pf: RVU representor driver

Adds basic driver for the RVU representor.

Driver on probe does pci specific initialization and
does hw resources configuration. Introduces RVU_ESWITCH
kernel config to enable/disable the driver. Representor
and NIC shares the code but representors netdev support
subset of NIC functionality. Hence "otx2_rep_dev" API
helps to skip the features initialization that are not
supported by the representors.

Signed-off-by: Geetha sowjanya <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: page_pool: do not count normal frag allocation in stats

Commit 0f6deac3a079 ("net: page_pool: add page allocation stats for
two fast page allocate path") added increments for "fast path"
allocation to page frag alloc. It mentions performance degradation
analysis but the details are unclear. Could be that the author
was simply surprised by the alloc stats not matching packet count.

In my experience the key metric for page pool is the recycling rate.
Page return stats, however, count returned _pages_ not frags.
This makes it impossible to calculate recycling rate for drivers
using the frag API. Here is example output of the page-pool
YNL sample for a driver allocating 1200B frags (4k pages)
with nearly perfect recycling:

$ ./page-pool
eth0[2] page pools: 32 (zombies: 0)
refs: 291648 bytes: 1194590208 (refs: 0 bytes: 0)
recycling: 33.3% (alloc: 4557:2256365862 recycle: 200476245:551541893)

The recycling rate is reported as 33.3% because we give out
4096 // 1200 = 3 frags for every recycled page.

Effectively revert the aforementioned commit. This also aligns
with the stats we would see for drivers which do the fragmentation
themselves, although that's not a strong reason in itself.

On the (very unlikely) path where we can reuse the current page
let's bump the "cached" stat. The fact that we don't put the page
in the cache is just an optimization.

Acked-by: Jesper Dangaard Brouer <[email protected]>
Reviewed-by: Xuan Zhuo <[email protected]>
Acked-by: Ilias Apalodimas <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

eth: bnxt: use page pool for head frags

Testing small size RPCs (300B-400B) on a large AMD system suggests
that page pool recycling is very useful even for just the head frags.
With this patch (and copy break disabled) I see a 30% performance
improvement (82Gbps -> 106Gbps).

Convert bnxt from normal page frags to page pool frags for head buffers.

On systems with small page size we can use the same pool as for TPA
pages. On systems with large pages the frag allocation logic of the
page pool is already used to split a large page into TPA chunks.
TPA chunks are much larger than heads (8k or 64k, AFAICT vs 1kB)
and we always allocate the same sized chunks. Mixing allocation
of TPA and head pages would lead to sub-optimal memory use.
Plus Taehee's work on zero-copy / devmem will need to differentiate
between TPA and non-TPA page pool, anyway. Conditionally allocate
a new page pool for heads.

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

dsa: qca8k: Use nested lock to avoid splat

qca8k_phy_eth_command() is used to probe the child MDIO bus while the
parent MDIO is locked. This causes lockdep splat, reporting a possible
deadlock. It is not an actually deadlock, because different locks are
used. By making use of mutex_lock_nested() we can avoid this false
positive.

Signed-off-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mptcp: fix possible integer overflow in mptcp_reset_tout_timer

In 'mptcp_reset_tout_timer', promote 'probe_timestamp' to unsigned long
to avoid possible integer overflow. Compile tested only.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Dmitry Kandybka <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Revert "wifi: iwlegacy: do not skip frames with bad FCS"

This reverts commit 02b682d54598f61cbb7dbb14d98ec1801112b878.

Alf reports that this commit causes the connection to eventually die on
iwl4965. The reason is that rx_status.flag is zeroed after
RX_FLAG_FAILED_FCS_CRC is set and mac80211 doesn't know the received frame is
corrupted.

Fixes: 02b682d54598 ("wifi: iwlegacy: do not skip frames with bad FCS")
Reported-by: Alf Marius <[email protected]>
Closes: https://lore.kernel.org/r/[email protected]/
Signed-off-by: Kalle Valo <[email protected]>
Link: https://patch.msgid.link/[email protected]

wifi: mac80211: pass MBSSID config by reference

It's inefficient and confusing to pass the MBSSID config
by value, requiring the whole struct to be copied. Pass
it by reference instead.

Link: https://patch.msgid.link/20241108092227.48fbd8a00112.I64abc1296a7557aadf798d88db931024486ab3b6@changeid
Signed-off-by: Johannes Berg <[email protected]>

wifi: mac80211: Support EHT 1024 aggregation size in TX

Support EHT 1024 aggregation size in TX

The 1024 agg size for RX is supported but not for TX.
This patch adds this support and refactors common parsing logics for
addbaext in both process_addba_resp and process_addba_req into a
function.

Reviewed-by: Shayne Chen <[email protected]>
Reviewed-by: Money Wang <[email protected]>
Co-developed-by: Peter Chiu <[email protected]>
Signed-off-by: Peter Chiu <[email protected]>
Signed-off-by: MeiChia Chiu <[email protected]>
Link: https://patch.msgid.link/[email protected]
[pass elems/len instead of mgmt/len/is_req]
Signed-off-by: Johannes Berg <[email protected]>

net: rfkill: gpio: Add check for clk_enable()

Add check for the return value of clk_enable() to catch the potential
error.

Fixes: 7176ba23f8b5 ("net: rfkill: add generic gpio rfkill driver")
Signed-off-by: Mingwei Zheng <[email protected]>
Signed-off-by: Jiasheng Jiang <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Johannes Berg <[email protected]>

net: sched: cls_api: improve the error message for ID allocation failure

We run into an exhaustion problem with the kernel-allocated filter IDs.
Our allocation problem can be fixed on the user space side,
but the error message in this case was quite misleading:

  "Filter with specified priority/protocol not found" (EINVAL)

Specifically when we can't allocate a _new_ ID because filter with
lowest ID already _exists_, saying "filter not found", is confusing.

Kernel allocates IDs in range of 0xc0000 -> 0x8000, giving out ID one
lower than lowest existing in that range. The error message makes sense
when tcf_chain_tp_find() gets called for GET and DEL but for NEW we
need to provide more specific error messages for all three cases:

- user wants the ID to be auto-allocated but filter with ID 0x8000
   already exists

- filter already exists and can be replaced, but user asked
   for a protocol change

- filter doesn't exist

Caller of tcf_chain_tp_insert_unique() doesn't set extack today,
so don't bother plumbing it in.

Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

net: Implement fault injection forcing skb reallocation

Introduce a fault injection mechanism to force skb reallocation. The
primary goal is to catch bugs related to pointer invalidation after
potential skb reallocation.

The fault injection mechanism aims to identify scenarios where callers
retain pointers to various headers in the skb but fail to reload these
pointers after calling a function that may reallocate the data. This
type of bug can lead to memory corruption or crashes if the old,
now-invalid pointers are used.

By forcing reallocation through fault injection, we can stress-test code
paths and ensure proper pointer management after potential skb
reallocations.

Add a hook for fault injection in the following functions:

* pskb_trim_rcsum()
* pskb_may_pull_reason()
* pskb_trim()

As the other fault injection mechanism, protect it under a debug Kconfig
called CONFIG_FAIL_SKB_REALLOC.

This patch was *heavily* inspired by Jakub's proposal from:
https://lore.kernel.org/all/20240719174140.47a868e6@kernel.org/

CC: Akinobu Mita <[email protected]>
Suggested-by: Jakub Kicinski <[email protected]>
Signed-off-by: Breno Leitao <[email protected]>
Reviewed-by: Akinobu Mita <[email protected]>
Acked-by: Paolo Abeni <[email protected]>
Acked-by: Guillaume Nault <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

Merge branch 'net-ip-add-drop-reasons-to-input-route'

Menglong Dong says:

====================
net: ip: add drop reasons to input route

In this series, we mainly add some skb drop reasons to the input path of
ip routing, and we make the following functions return drop reasons:

  fib_validate_source()
  ip_route_input_mc()
  ip_mc_validate_source()
  ip_route_input_slow()
  ip_route_input_rcu()
  ip_route_input_noref()
  ip_route_input()
  ip_mkroute_input()
  __mkroute_input()
  ip_route_use_hint()

And following new skb drop reasons are added:

  SKB_DROP_REASON_IP_LOCAL_SOURCE
  SKB_DROP_REASON_IP_INVALID_SOURCE
  SKB_DROP_REASON_IP_LOCALNET
  SKB_DROP_REASON_IP_INVALID_DEST

Changes since v4:
- in the 6th patch: remove the unneeded "else" in ip_expire()
- in the 8th patch: delete the unneeded comment in __mkroute_input()
- in the 9th patch: replace "return 0" with "return SKB_NOT_DROPPED_YET"
  in ip_route_use_hint()

Changes since v3:
- don't refactor fib_validate_source/__fib_validate_source, and introduce
  a wrapper for fib_validate_source() instead in the 1st patch.
- some small adjustment in the 4-7 patches

Changes since v2:
- refactor fib_validate_source and __fib_validate_source to make
  fib_validate_source return drop reasons
- add the 9th and 10th patches to make this series cover the input route
  code path

Changes since v1:
- make ip_route_input_noref/ip_route_input_rcu/ip_route_input_slow return
  drop reasons, instead of passing a local variable to their function
  arguments.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

net: ip: make ip_route_use_hint() return drop reasons

In this commit, we make ip_route_use_hint() return drop reasons. The
drop reasons that we return are similar to what we do in
ip_route_input_slow(), and no drop reasons are added in this commit.

Signed-off-by: Menglong Dong <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ip: make ip_mkroute_input/__mkroute_input return drop reasons

In this commit, we make ip_mkroute_input() and __mkroute_input() return
drop reasons.

The drop reason "SKB_DROP_REASON_ARP_PVLAN_DISABLE" is introduced for
the case: the packet which is not IP is forwarded to the in_dev, and
the proxy_arp_pvlan is not enabled.

Signed-off-by: Menglong Dong <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ip: make ip_route_input() return drop reasons

In this commit, we make ip_route_input() return skb drop reasons that come
from ip_route_input_noref().

Meanwhile, adjust all the call to it.

Signed-off-by: Menglong Dong <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ip: make ip_route_input_noref() return drop reasons

In this commit, we make ip_route_input_noref() return drop reasons, which
come from ip_route_input_rcu().

We need adjust the callers of ip_route_input_noref() to make sure the
return value of ip_route_input_noref() is used properly.

The errno that ip_route_input_noref() returns comes from ip_route_input
and bpf_lwt_input_reroute in the origin logic, and we make them return
-EINVAL on error instead. In the following patch, we will make
ip_route_input() returns drop reasons too.

Signed-off-by: Menglong Dong <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ip: make ip_route_input_rcu() return drop reasons

In this commit, we make ip_route_input_rcu() return drop reasons, which
come from ip_route_input_mc() and ip_route_input_slow().

The only caller of ip_route_input_rcu() is ip_route_input_noref(). We
adjust it by making it return -EINVAL on error and ignore the reasons that
ip_route_input_rcu() returns. In the following patch, we will make
ip_route_input_noref() returns the drop reasons.

Signed-off-by: Menglong Dong <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ip: make ip_route_input_slow() return drop reasons

In this commit, we make ip_route_input_slow() return skb drop reasons,
and following new skb drop reasons are added:

SKB_DROP_REASON_IP_INVALID_DEST

The only caller of ip_route_input_slow() is ip_route_input_rcu(), and we
adjust it by making it return -EINVAL on error.

Signed-off-by: Menglong Dong <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ip: make ip_mc_validate_source() return drop reason

Make ip_mc_validate_source() return drop reason, and adjust the call of
it in ip_route_input_mc().

Another caller of it is ip_rcv_finish_core->udp_v4_early_demux, and the
errno is not checked in detail, so we don't do more adjustment for it.

The drop reason "SKB_DROP_REASON_IP_LOCALNET" is added in this commit.

Signed-off-by: Menglong Dong <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ip: make ip_route_input_mc() return drop reason

Make ip_route_input_mc() return drop reason, and adjust the call of it
in ip_route_input_rcu().

Signed-off-by: Menglong Dong <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ip: make fib_validate_source() support drop reasons

In this commit, we make fib_validate_source() and __fib_validate_source()
return -reason instead of errno on error.

The return value of fib_validate_source can be -errno, 0, and 1. It's hard
to make fib_validate_source() return drop reasons directly.

The fib_validate_source() will return 1 if the scope of the source(revert)
route is HOST. And the __mkroute_input() will mark the skb with
IPSKB_DOREDIRECT in this case (combine with some other conditions). And
then, a REDIRECT ICMP will be sent in ip_forward() if this flag exists. We
can't pass this information to __mkroute_input if we make
fib_validate_source() return drop reasons.

Therefore, we introduce the wrapper fib_validate_source_reason() for
fib_validate_source(), which will return the drop reasons on error.

In the origin logic, LINUX_MIB_IPRPFILTER will be counted if
fib_validate_source() return -EXDEV. And now, we need to adjust it by
checking "reason == SKB_DROP_REASON_IP_RPFILTER". However, this will take
effect only after the patch "net: ip: make ip_route_input_noref() return
drop reasons", as we can't pass the drop reasons from
fib_validate_source() to ip_rcv_finish_core() in this patch.

Following new drop reasons are added in this patch:

SKB_DROP_REASON_IP_LOCAL_SOURCE
SKB_DROP_REASON_IP_INVALID_SOURCE

Signed-off-by: Menglong Dong <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

Merge branch 'mlx5-esw-qos-refactor-and-shampo-cleanup'

Tariq Toukan says:

====================
mlx5 esw qos refactor and SHAMPO cleanup

This patchset for the mlx5 core and Eth drivers consists of 3 parts.

First patch by Patrisious improves the E-switch mode change operation.

The following 6 patches by Carolina introduce further refactoring for
the QoS handling, to set the foundation for future extensions.

In the following 5 patches by Dragos, we enhance the SHAMPO datapath
flow by simplifying some logic, and cleaning up the implementation.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Rework header allocation loop

The current loop code was based on the assumption
that there can be page leftovers from previous function calls.

This patch changes the allocation loop to make it clearer how
pages get allocated every MLX5E_SHAMPO_WQ_HEADER_PER_PAGE headers.
This change has no functional implications.

Signed-off-by: Dragos Tatulea <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Drop info array

The info array is used to store a pointer to the
dma address of the header and to the frag page. However,
this array is not really required:
- The frag page can be calculated from the header index
frag page index = header index / headers per page.
- The dma address can be calculated through a formula:
dma page address + header offset.

This series gets rid of the info array and uses the above
formulas instead.

The current_page_index was used in conjunction with the info array to
store page fragment indices. This variable is dropped as well.

There was no performance regression observed.

Signed-off-by: Dragos Tatulea <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Change frag page setup order during allocation

Now that the UMR allocation has been simplified, it is no longer
possible to have a leftover page from a previous call to
mlx5e_build_shampo_hd_umr().

This patch simplifies the code by switching the order of operations:
first take the frag page and then increment the index. This is more
straightforward and it also paves the way for dropping the info
array.

Signed-off-by: Dragos Tatulea <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Fix page_index calculation inconsistency

When calculating the index for the next frag page slot, the divisor is
incorrect: it should be the number of pages per queue not the number of
headers per queue. This is currently harmless because frag pages are not
used directly, but they are intermediated through the info array. But it
needs to be fixed as an upcoming patch will get rid of the info array.

This patch introduces a new pages per queue variable and plugs it in the
formula.

Now that this variable exists, additional code can be simplified in the
SHAMPO initialization code.

Signed-off-by: Dragos Tatulea <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Simplify UMR allocation for headers

Allocating page fragments for header data split is currently
more complicated than it should be. That's because the number
of KSM entries allocated is not aligned to the number of headers
per page. This leads to having leftovers in the next allocation
which require additional accounting and needlessly complicated
code.

This patch aligns (down) the number of KSM entries in the
UMR WQE to the number of headers per page by:

1) Aligning the max number of entries allocated per UMR WQE
(max_ksm_entries) to MLX5E_SHAMPO_WQ_HEADER_PER_PAGE.

2) Aligning the total number of free headers to
MLX5E_SHAMPO_WQ_HEADER_PER_PAGE.

... and then it drops the extra accounting code from
mlx5e_build_shampo_hd_umr().

Although the number of entries allocated per UMR WQE is slightly
smaller due to aligning down, no performance impact was observed.

Signed-off-by: Dragos Tatulea <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5: Make vport QoS enablement more flexible for future extensions

Refactor esw_qos_vport_enable to support more generic configurations,
allowing it to be reused for new vport node types in future patches.

This refactor includes a new way to change the vport parent node by
disabling the current setup and re-enabling it with the new parent.
This change sets the foundation for adapting configuration based on the
parent type in future patches.

Signed-off-by: Carolina Jubran <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5: Integrate esw_qos_vport_enable logic into rate operations

Fold the esw_qos_vport_enable function into operations for configuring
maximum and minimum rates, simplifying QoS logic. This change
consolidates enabling and updating the scheduling element
configuration, streamlining how vport QoS is initialized and adjusted.

Signed-off-by: Carolina Jubran <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5: Generalize scheduling element operations

Introduce helper functions to create and destroy scheduling elements,
allowing flexible configuration for different scheduling element types.

The new helper functions streamline the process by centralizing error
handling and logging through esw_qos_sched_elem_op_warn, which now
accepts the operation type (create, destroy, or modify).

The changes also adjust the esw_qos_vport_enable and
mlx5_esw_qos_vport_disable functions to leverage the new generalized
create/destroy helpers.

The destroy functions now log errors with esw_warn without returning
them. This prevents unnecessary error handling since the node was
already destroyed and no further action is required from callers.

Signed-off-by: Carolina Jubran <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5: Refactor scheduling element configuration bitmasks

Refactor esw_qos_sched_elem_config to set bitmasks only when max_rate
or bw_share values change, allowing the function to configure nodes
with only one of these parameters.

This enables more flexible usage for nodes where only one parameter
requires configuration.

Remove scattered assignments and checks to centralize them within this
function, removing the now redundant esw_qos_set_node_max_rate
entirely.

With this refactor, also remove the assignment of the vport scheduling
node max rate to the parent max rate for unlimited vports
(where max rate is set to zero), as firmware already handles this
behavior.

Signed-off-by: Carolina Jubran <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5: Generalize max_rate and min_rate setting for nodes

Refactor max_rate and min_rate setting functions to operate on
mlx5_esw_sched_node, allowing for generalized handling of both vports
and nodes.

Signed-off-by: Carolina Jubran <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5: Simplify QoS normalization by removing error handling

This change updates esw_qos_normalize_min_rate to not return errors,
significantly simplifying the code.

Normalization failures are software bugs, and it's unnecessary to
handle them with rollback mechanisms. Instead,
`esw_qos_update_sched_node_bw_share` and `esw_qos_normalize_min_rate`
now return void, with any errors logged as warnings to indicate
potential software issues.

This approach avoids compensating for hidden bugs and removes error
handling from all places that perform normalization, streamlining
future patches.

Signed-off-by: Carolina Jubran <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5: E-switch, refactor eswitch mode change

The E-switch mode was previously updated before removing and re-adding the
IB device, which could cause a temporary mismatch between the E-switch mode
and the IB device configuration.

To prevent this discrepancy, the IB device is now removed first, then
the E-switch mode is updated, and finally, the IB device is re-added.
This sequence ensures consistent alignment between the E-switch mode and
the IB device whenever the mode changes, regardless of the new mode value.

Signed-off-by: Patrisious Haddad <[email protected]>
Reviewed-by: Mark Bloch <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: ipv4: Cache pmtu for all packet paths if multipath enabled

Check number of paths by fib_info_num_path(),
and update_or_create_fnhe() for every path.
Problem is that pmtu is cached only for the oif
that has received icmp message "need to frag",
other oifs will still try to use "default" iface mtu.

An example topology showing the problem:

                    |  host1
                +---------+
                |  dummy0 | 10.179.20.18/32  mtu9000
                +---------+
        +-----------+----------------+
    +---------+                     +---------+
    | ens17f0 |  10.179.2.141/31    | ens17f1 |  10.179.2.13/31
    +---------+                     +---------+
        |    (all here have mtu 9000)    |
    +------+                         +------+
    | ro1  |  10.179.2.140/31        | ro2  |  10.179.2.12/31
    +------+                         +------+
        |                                |
---------+------------+-------------------+------
                        |
                    +-----+
                    | ro3 | 10.10.10.10  mtu1500
                    +-----+
                        |
    ========================================
                some networks
    ========================================
                        |
                    +-----+
                    | eth0| 10.10.30.30  mtu9000
                    +-----+
                        |  host2

host1 have enabled multipath and
sysctl net.ipv4.fib_multipath_hash_policy = 1:

default proto static src 10.179.20.18
        nexthop via 10.179.2.12 dev ens17f1 weight 1
        nexthop via 10.179.2.140 dev ens17f0 weight 1

When host1 tries to do pmtud from 10.179.20.18/32 to host2,
host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500.
And host1 caches it in nexthop exceptions cache.

Problem is that it is cached only for the iface that has received icmp,
and there is no way that ro3 will send icmp msg to host1 via another path.

Host1 now have this routes to host2:

ip r g 10.10.30.30 sport 30000 dport 443
10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0
    cache expires 521sec mtu 1500

ip r g 10.10.30.30 sport 30033 dport 443
10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0
    cache

So when host1 tries again to reach host2 with mtu>1500,
if packet flow is lucky enough to be hashed with oif=ens17f1 its ok,
if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1,
until lucky day when ro3 will send it through another flow to ens17f0.

Signed-off-by: Vladimir Vdovin <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: netconsole: selftests: Check if netdevsim is available

The netconsole selftest relies on the availability of the netdevsim module.
To ensure the test can run correctly, we need to check if the netdevsim
module is either loaded or built-in before proceeding.

Update the netconsole selftest to check for the existence of
the /sys/bus/netdevsim/new_device file before running the test. If the
file is not found, the test is skipped with an explanation that the
CONFIG_NETDEVSIM kernel config option may not be enabled.

Signed-off-by: Breno Leitao <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'net-phylink-phylink_resolve-cleanups'

Russell King says:

====================
net: phylink: phylink_resolve() cleanups

This series does a bit of clean-up in phylink_resolve() to make the code
a little easier to follow.

Patch 1 moves the manual flow control setting in two of the switch
cases to after the switch().

Patch 2 changes the MLO_AN_FIXED case to be a simple if() statement,
reducing its indentation.

Patch 3 changes the MLO_AN_PHY case to also be a simple if() statment,
also reducing its indentation.

Patch 4 does the same for the last case.

Patch 5 reformats the code and comments for the reduced indentation,
making it easier to read.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: phylink: clean up phylink_resolve()

Now that we have reduced the indentation level, clean up the code
formatting.

Signed-off-by: Russell King (Oracle) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: phylink: remove switch() statement in resolve handling

The switch() statement doesn't sit very well with the preceeding if()
statements, so let's just convert everything to if()s. As a result of
the two preceding commits, there is now only one case in the switch()
statement. Remove the switch statement and reduce the code indentation.
Code reformatting will be in the following commit.

Signed-off-by: Russell King (Oracle) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: phylink: move MLO_AN_PHY resolve handling to if() statement

The switch() statement doesn't sit very well with the preceeding if()
statements, and results in excessive indentation that spoils code
readability. Continue cleaning this up by converting the MLO_AN_PHY
case to use an if() statmeent.

Signed-off-by: Russell King (Oracle) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: phylink: move MLO_AN_FIXED resolve handling to if() statement

The switch() statement doesn't sit very well with the preceeding if()
statements, and results in excessive indentation that spoils code
readability. Begin cleaning this up by converting the MLO_AN_FIXED case
to an if() statement.

Signed-off-by: Russell King (Oracle) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: phylink: move manual flow control setting

Move the handling of manual flow control configuration to a common
location during resolve. We currently evaluate this for all but
fixed links.

Signed-off-by: Russell King (Oracle) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'suspend-irqs-during-application-busy-periods'

Joe Damato says:

====================
Suspend IRQs during application busy periods

This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.

Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel.

I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].

~ The short explanation (TL;DR)

We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.

If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.

If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).

The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.

For a more in depth explanation, please continue reading.

~ Comparison with existing mechanisms

Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:

At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.

At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.

While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.

An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.

We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.

This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.

~ Usage details

IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.

Here's how it is intended to work:
  - The user application (or system administrator) uses the netdev-genl
    netlink interface to set the pre-existing napi_defer_hard_irqs and
    gro_flush_timeout NAPI config parameters to enable IRQ deferral.

  - The user application (or system administrator) sets the proposed
    irq_suspend_timeout parameter via the netdev-genl netlink interface
    to a larger value than gro_flush_timeout to enable IRQ suspension.

  - The user application issues the existing epoll ioctl to set the
    prefer_busy_poll flag on the epoll context.

  - The user application then calls epoll_wait to busy poll for network
    events, as it normally would.

  - If epoll_wait returns events to userland, IRQs are suspended for the
    duration of irq_suspend_timeout.

  - If epoll_wait finds no events and the thread is about to go to
    sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
    is resumed.

As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.

Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.

~ Usage scenario

The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).

gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.

irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.

~ Important call out in the implementation

  - Enabling per epoll-context preferred busy poll will now effectively
    lead to a nonblocking iteration through napi_busy_loop, even when
    busy_poll_usecs is 0. See patch 4.

~ Benchmark configs & descriptions

The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].

To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.

Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):

  - base:
    - no other options enabled
  - deferX:
    - set defer_hard_irqs to 100
    - set gro_flush_timeout to X,000
  - napibusy:
    - set defer_hard_irqs to 100
    - set gro_flush_timeout to 200,000
    - enable busy poll via the existing ioctl (busy_poll_usecs = 64,
      busy_poll_budget = 64, prefer_busy_poll = true)
  - fullbusy:
    - set defer_hard_irqs to 100
    - set gro_flush_timeout to 5,000,000
    - enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
      busy_poll_budget = 64, prefer_busy_poll = true)
    - change memcached's nonblocking epoll_wait invocation (via
      libevent) to using a 1 ms timeout
  - suspend0:
    - set defer_hard_irqs to 0
    - set gro_flush_timeout to 0
    - set irq_suspend_timeout to 20,000,000
    - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
      busy_poll_budget = 64, prefer_busy_poll = true)
  - suspendX:
    - set defer_hard_irqs to 100
    - set gro_flush_timeout to X,000
    - set irq_suspend_timeout to 20,000,000
    - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
      busy_poll_budget = 64, prefer_busy_poll = true)

~ Benchmark results

Tested on:

Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC

The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.

Results:

Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
nearly 10 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.

The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.

These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.

The absolute QPS results shift between submissions, but the
relative differences are equivalent. As patches are rebased,
several factors likely influence overall performance.

Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
  example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
  increases.
- suspend0, which sets defer_hard_irqs and gro_flush_timeout to 0, has
  nearly the same performance as the base case (this is FAQ item #1).

The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).

Note: we've reorganized the results to make comparison among testcases
with the same load easier.

  testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
      base  200K  199946     112     239     416      26   12973   11343
   defer10  200K  199971      54     124     142      29   19412   17460
   defer20  200K  199986      60     130     153      26   15644   14095
   defer50  200K  200025      79     144     182      23   12122   11632
  defer200  200K  199999     164     254     309      19    8923    9635
  fullbusy  200K  199998      46     118     133     100   43658   23133
  napibusy  200K  199983     100     237     277      56   24840   24716
  suspend0  200K  200020     105     249     432      30   14264   11796
suspend10  200K  199950      53     123     141      32   19518   16903
suspend20  200K  200037      58     126     151      30   16426   14736
suspend50  200K  199961      73     136     177      26   13310   12633
suspend200  200K  199998     149     251     306      21    9566   10203

  testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
      base  400K  400014     139     269     707      41    9476    9343
   defer10  400K  400016      59     133     166      53   13991   12989
   defer20  400K  399952      67     140     172      47   12063   11644
   defer50  400K  400007      87     162     198      39    9384    9880
  defer200  400K  399979     181     274     330      31    7089    8430
  fullbusy  400K  399987      50     123     156     100   21827   16037
  napibusy  400K  400014      76     222     272      83   18185   16529
  suspend0  400K  400015     127     350     776      47   10699    9603
suspend10  400K  400023      57     129     164      54   13758   13178
suspend20  400K  400043      62     135     169      49   12071   11826
suspend50  400K  400071      76     149     186      42   10011   10301
suspend200  400K  399961     154     269     327      34    7827    8774

  testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
      base  600K  599951     149     266     574      61    9265    8876
   defer10  600K  600006      71     147     203      76   11866   10936
   defer20  600K  600123      76     152     203      66   10430   10342
   defer50  600K  600162      95     172     217      54    8526    9142
  defer200  600K  599942     200     301     357      46    6977    8212
  fullbusy  600K  599990      55     127     177     100   14551   13983
  napibusy  600K  600035      63     160     250      96   13937   14140
  suspend0  600K  599903     127     320     732      68   10166    8963
suspend10  600K  599908      63     137     192      69   10902   11100
suspend20  600K  599961      66     141     194      65    9976   10370
suspend50  600K  599973      80     159     204      57    8678    9381
suspend200  600K  600010     157     277     346      48    7133    8381

  testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
      base  800K  800039     181     300     536      87    9585    8304
   defer10  800K  800038     181     530     939      96   10564    8970
   defer20  800K  800029     112     225     329      90   10056    8935
   defer50  800K  799999     120     208     296      82    9234    8562
  defer200  800K  800066     227     338     401      63    7117    8129
  fullbusy  800K  800040      61     134     190     100   10913   12608
  napibusy  800K  799944      64     141     214      99   10828   12588
  suspend0  800K  799911     126     248     509      85    9346    8498
suspend10  800K  800006      69     143     200      83    9410    9845
suspend20  800K  800120      74     150     207      78    8786    9454
suspend50  800K  799989      87     168     224      71    7946    8833
suspend200  800K  799987     160     292     357      62    6923    8229

  testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
      base 1000K  906879    4079    5751    6216      98    9496    7904
   defer10 1000K  860849    3643    6274    6730      99   10040    8676
   defer20 1000K  896063    3298    5840    6349      98    9620    8237
   defer50 1000K  919782    2962    5513    5807      97    9284    7951
  defer200 1000K  970941    3059    5348    5984      95    8593    7959
  fullbusy 1000K  999950      70     150     207     100    8732   10777
  napibusy 1000K  999996      78     154     223     100    8722   10656
  suspend0 1000K  949706    2666    5770    6660      99    9071    8046
suspend10 1000K 1000024      80     160     220      92    8137    9035
suspend20 1000K 1000059      83     165     226      89    7850    8804
suspend50 1000K  999955      95     180     240      84    7411    8459
suspend200 1000K  999914     163     299     366      77    6833    8078

  testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
      base   MAX 1037654    4184    5453    5810     100    8411    7938
   defer10   MAX  905607    4840    6151    6380     100    9639    8431
   defer20   MAX  986463    4455    5594    5796     100    8848    8110
   defer50   MAX 1077030    4000    5073    5299     100    8104    7920
  defer200   MAX 1040728    4152    5385    5765     100    8379    7849
  fullbusy   MAX 1247536    3518    3935    3984     100    6998    7930
  napibusy   MAX 1136310    3799    7756    9964     100    7670    7877
  suspend0   MAX 1057509    4132    5724    6185     100    8253    7918
suspend10   MAX 1215147    3580    3957    4041     100    7185    7944
suspend20   MAX 1216469    3576    3953    3988     100    7175    7950
suspend50   MAX 1215871    3577    3961    4075     100    7181    7949
suspend200   MAX 1216882    3556    3951    3988     100    7175    7955

~ FAQ

  - Why is a new parameter needed? Does irq_suspend_timeout override
    gro_flush_timeout?

    Using the suspend mechanism causes the system to alternate between
    polling mode and irq-driven packet delivery. During busy periods,
    irq_suspend_timeout overrides gro_flush_timeout and keeps the system
    busy polling, but when epoll finds no events, the setting of
    gro_flush_timeout and napi_defer_hard_irqs determine the next step.

    There are essentially three possible loops for network processing and
    packet delivery:

    1) hardirq -> softirq -> napi poll; basic interrupt delivery
    2) timer -> softirq -> napi poll; deferred irq processing
    3) epoll -> busy-poll -> napi poll; busy looping

    Loop 2 can take control from Loop 1, if gro_flush_timeout and
    napi_defer_hard_irqs are set.

    If gro_flush_timeout and napi_defer_hard_irqs are set, Loops 2 and
    3 "wrestle" with each other for control. During busy periods,
    irq_suspend_timeout is used as timer in Loop 2, which essentially
    tilts this in favour of Loop 3.

    If gro_flush_timeout and napi_defer_hard_irqs are not set, Loop 3
    cannot take control from Loop 1.

    Therefore, setting gro_flush_timeout and napi_defer_hard_irqs is the
    recommended usage, because otherwise setting irq_suspend_timeout
    might not have any discernible effect.

    This is shown in the results above: compare suspend0 with the base
    case. Note that the lack of napi_defer_hard_irqs and
    gro_flush_timeout produce similar results for both, which encourages
    the use of napi_defer_hard_irqs and gro_flush_timeout in addition to
    irq_suspend_timeout.

  - Can the new timeout value be threaded through the new epoll ioctl ?

    It is possible, but presents challenges for userspace. User
    applications must ensure that the file descriptors added to epoll
    contexts have the same NAPI ID to support busy polling.

    An epoll context is not permanently tied to any particular NAPI ID.
    So, a user application could decide to clear the file descriptors
    from the context and add a new set of file descriptors with a
    different NAPI ID to the context. Busy polling would work as
    expected, but the meaning of the suspend timeout becomes ambiguous
    because IRQs are not inherently associated with epoll contexts, but
    rather with the NAPI. The user program would need to reissue the
    ioctl to set the irq_suspend_timeout, but the napi_defer_hard_irqs
    and gro_flush_timeout settings would come from the NAPI's
    napi_config (which are set either by sysfs or by netlink). Such an
    interface seems awkard to use from a user perspective.

    Further, IRQs are related to NAPIs, which is why they are stored in
    the napi_config space. Putting the irq_suspend_timeout in
    the epoll context while other IRQ deferral mechanisms remain in the
    NAPI's napi_config space seems like an odd design choice.

    We've opted to keep all of the IRQ deferral parameters together and
    place the irq_suspend_timeout in napi_config. This has nice benefits
    for userspace: if a user app were to remove all file descriptors
    from an epoll context and add new file descriptors with a new NAPI ID,
    the correct suspend timeout for that NAPI ID would be used automatically
    without the user application needing to do anything (like re-issuing an
    ioctl, for example). All IRQ deferral related parameters are in one
    place and can all be set the same way: with netlink.

  - Can irq suspend be built by combining NIC coalescing and
    gro_flush_timeout ?

    No. The problem is that the long timeout must engage if and only if
    prefer-busy is active.

    When using NIC coalescing for the short timeout (without
    napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
    period will trigger softirq, which will run napi polling. At this
    point, prefer-busy is not active, so NIC interrupts would be
    re-enabled. Then it is not possible for the longer timeout to
    interject to switch control back to polling. In other words, only by
    using the software timer for the short timeout, it is possible to
    extend the timeout without having to reprogram the NIC timer or
    reach down directly and disable interrupts.

    Using gro_flush_timeout for the long timeout also has problems, for
    the same underlying reason. In the current napi implementation,
    gro_flush_timeout is not tied to prefer-busy. We'd either have to
    change that and in the process modify the existing deferral
    mechanism, or introduce a state variable to determine whether
    gro_flush_timeout is used as long timeout for irq suspend or whether
    it is used for its default purpose. In an earlier version, we did
    try something similar to the latter and made it work, but it ends up
    being a lot more convoluted than our current proposal.

  - Isn't it already possible to combine busy looping with irq deferral?

    Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
    gro_flush_timeout is a precondition for prefer_busy_poll to have an
    effect. If the application also uses a tight busy loop with
    essentially nonblocking epoll_wait (accomplished with a very short
    timeout parameter), this is the fullbusy case shown in the results.
    An application using blocking epoll_wait is shown as the napibusy
    case in the results. It's a hybrid approach that provides limited
    latency benefits compared to the base case and plain irq deferral,
    but not as good as fullbusy or suspend.

~ Special thanks

Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:

  - Peter Cai (CC'd), for the initial kernel patch and his contributions
    to the paper.

  - Mohammadamin Shafie (CC'd), for testing various versions of the kernel
    patch and providing helpful feedback.

Thanks,
Martin and Joe

[1]: https://lore.kernel.org/netdev/20240812125717 [email protected]/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/memcached.patch
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/libevent.patch
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results

v8: https://lore.kernel.org/20241108045337 [email protected]
v7: https://lore.kernel.org/20241108023912 [email protected]
v6: https://lore.kernel.org/20241104215542 [email protected]
v5: https://lore.kernel.org/20241103052421 [email protected]
v4: https://lore.kernel.org/20241102005214 [email protected]
v3: https://lore.kernel.org/20241101004846 [email protected]
v2: https://lore.kernel.org/20241021015311 [email protected]
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

docs: networking: Describe irq suspension

Describe irq suspension, the epoll ioctls, and the tradeoffs of using
different gro_flush_timeout values.

Signed-off-by: Joe Damato <[email protected]>
Co-developed-by: Martin Karsten <[email protected]>
Signed-off-by: Martin Karsten <[email protected]>
Reviewed-by: Sridhar Samudrala <[email protected]>
Reviewed-by: Bagas Sanjaya <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: net: Add busy_poll_test

Add an epoll busy poll test using netdevsim.

This test is comprised of:
  - busy_poller (via busy_poller.c)
  - busy_poll_test.sh which loads netdevsim, sets up network namespaces,
    and runs busy_poller to receive data and socat to send data.

The selftest tests two different scenarios:
  - busy poll (the pre-existing version in the kernel)
  - busy poll with suspend enabled (what this series adds)

The data transmit is a 1MiB temporary file generated from /dev/urandom
and the test is considered passing if the md5sum of the input file to
socat matches the md5sum of the output file from busy_poller.

netdevsim was chosen instead of veth due to netdevsim's support for
netdev-genl.

For now, this test uses the functionality that netdevsim provides. In the
future, perhaps netdevsim can be extended to emulate device IRQs to more
thoroughly test all pre-existing kernel options (like defer_hard_irqs)
and suspend.

Signed-off-by: Joe Damato <[email protected]>
Co-developed-by: Martin Karsten <[email protected]>
Signed-off-by: Martin Karsten <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

eventpoll: Control irq suspension for prefer_busy_poll

When events are reported to userland and prefer_busy_poll is set, irqs
are temporarily suspended using napi_suspend_irqs.

If no events are found and ep_poll would go to sleep, irq suspension is
cancelled using napi_resume_irqs.

Signed-off-by: Martin Karsten <[email protected]>
Co-developed-by: Joe Damato <[email protected]>
Signed-off-by: Joe Damato <[email protected]>
Tested-by: Joe Damato <[email protected]>
Tested-by: Martin Karsten <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Sridhar Samudrala <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set

Setting prefer_busy_poll now leads to an effectively nonblocking
iteration though napi_busy_loop, even when busy_poll_usecs is 0.

Signed-off-by: Martin Karsten <[email protected]>
Co-developed-by: Joe Damato <[email protected]>
Signed-off-by: Joe Damato <[email protected]>
Tested-by: Joe Damato <[email protected]>
Tested-by: Martin Karsten <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Sridhar Samudrala <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: Add control functions for irq suspension

The napi_suspend_irqs routine bootstraps irq suspension by elongating
the defer timeout to irq_suspend_timeout.

The napi_resume_irqs routine effectively cancels irq suspension by
forcing the napi to be scheduled immediately.

Signed-off-by: Martin Karsten <[email protected]>
Co-developed-by: Joe Damato <[email protected]>
Signed-off-by: Joe Damato <[email protected]>
Tested-by: Joe Damato <[email protected]>
Tested-by: Martin Karsten <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Sridhar Samudrala <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: Add napi_struct parameter irq_suspend_timeout

Add a per-NAPI IRQ suspension parameter, which can be get/set with
netdev-genl.

This patch doesn't change any behavior but prepares the code for other
changes in the following commits which use irq_suspend_timeout as a
timeout for IRQ suspension.

Signed-off-by: Martin Karsten <[email protected]>
Co-developed-by: Joe Damato <[email protected]>
Signed-off-by: Joe Damato <[email protected]>
Tested-by: Joe Damato <[email protected]>
Tested-by: Martin Karsten <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Sridhar Samudrala <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: add unlocked version of bnxt_refclk_read

Serialization of PHC read with FW reset mechanism uses ptp_lock which
also protects timecounter updates. This means we cannot grab it when
called from bnxt_cc_read(). Let's move locking into different function.

Fixes: 6c0828d00f07 ("bnxt_en: replace PTP spinlock with seqlock")
Signed-off-by: Vadim Fedorenko <[email protected]>
Reviewed-by: Pavan Chebbi <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'rtnetlink-convert-rtnl_newlink-to-per-netns-rtnl'

Kuniyuki Iwashima says:

====================
rtnetlink: Convert rtnl_newlink() to per-netns RTNL.

Patch 1 - 3 removes __rtnl_link_unregister and protect link_ops by
its dedicated mutex to move synchronize_srcu() out of RTNL scope.

Patch 4 introduces struct rtnl_nets and helper functions to acquire
multiple per-netns RTNL in rtnl_newlink().

Patch 5 - 8 are to prefetch the peer device's netns in rtnl_newlink().

Patch 9 converts rtnl_newlink() to per-netns RTNL.

Patch 10 pushes RTNL down to rtnl_dellink() and rtnl_setlink(), but
the conversion will not be completed unless we support cases with
peer/upper/lower devices.

I confirmed v3 survived ./rtnetlink.sh; rmmod netdevsim.ko; without
lockdep splat.

v3: https://lore.kernel.org/20241107022900 [email protected]
v2: https://lore.kernel.org/20241106022432 [email protected]
v1: https://lore.kernel.org/20241105020514 [email protected]
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

rtnetlink: Register rtnl_dellink() and rtnl_setlink() with RTNL_FLAG_DOIT_PERNET_WIP.

Currently, rtnl_setlink() and rtnl_dellink() cannot be fully converted
to per-netns RTNL due to a lack of handling peer/lower/upper devices in
different netns.

For example, when we change a device in rtnl_setlink() and need to
propagate that to its upper devices, we want to avoid acquiring all netns
locks, for which we do not know the upper limit.

The same situation happens when we remove a device.

rtnl_dellink() could be transformed to remove a single device in the
requested netns and delegate other devices to per-netns work, and
rtnl_setlink() might be ?

Until we come up with a better idea, let's use a new flag
RTNL_FLAG_DOIT_PERNET_WIP for rtnl_dellink() and rtnl_setlink().

This will unblock converting RTNL users where such devices are not related.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Nikolay Aleksandrov <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

rtnetlink: Convert RTM_NEWLINK to per-netns RTNL.

Now, we are ready to convert rtnl_newlink() to per-netns RTNL;
rtnl_link_ops is protected by SRCU and netns is prefetched in
rtnl_newlink().

Let's register rtnl_newlink() with RTNL_FLAG_DOIT_PERNET and
push RTNL down as rtnl_nets_lock().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Nikolay Aleksandrov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

netkit: Set IFLA_NETKIT_PEER_INFO to netkit_link_ops.peer_type.

For per-netns RTNL, we need to prefetch the peer device's netns.

Let's set rtnl_link_ops.peer_type and accordingly remove duplicated
validation in ->newlink().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

vxcan: Set VXCAN_INFO_PEER to vxcan_link_ops.peer_type.

For per-netns RTNL, we need to prefetch the peer device's netns.

Let's set rtnl_link_ops.peer_type and accordingly remove duplicated
validation in ->newlink().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Nikolay Aleksandrov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

veth: Set VETH_INFO_PEER to veth_link_ops.peer_type.

For per-netns RTNL, we need to prefetch the peer device's netns.

Let's set rtnl_link_ops.peer_type and accordingly remove duplicated
validation in ->newlink().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Nikolay Aleksandrov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

rtnetlink: Add peer_type in struct rtnl_link_ops.

In ops->newlink(), veth, vxcan, and netkit call rtnl_link_get_net() with
a net pointer, which is the first argument of ->newlink().

rtnl_link_get_net() could return another netns based on IFLA_NET_NS_PID
and IFLA_NET_NS_FD in the peer device's attributes.

We want to get it and fill rtnl_nets->nets[] in advance in rtnl_newlink()
for per-netns RTNL.

All of the three get the peer netns in the same way:

  1. Call rtnl_nla_parse_ifinfomsg()
  2. Call ops->validate() (vxcan doesn't have)
  3. Call rtnl_link_get_net_tb()

Let's add a new field peer_type to struct rtnl_link_ops and prefetch
netns in the peer ifla to add it to rtnl_nets in rtnl_newlink().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Nikolay Aleksandrov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

rtnetlink: Introduce struct rtnl_nets and helpers.

rtnl_newlink() needs to hold 3 per-netns RTNL: 2 for a new device
and 1 for its peer.

We will add rtnl_nets_lock() later, which performs the nested locking
based on struct rtnl_nets, which has an array of struct net pointers.

rtnl_nets_add() adds a net pointer to the array and sorts it so that
rtnl_nets_lock() can simply acquire per-netns RTNL from array[0] to [2].

Before calling rtnl_nets_add(), get_net() must be called for the net,
and rtnl_nets_destroy() will call put_net() for each.

Let's apply the helpers to rtnl_newlink().

When CONFIG_DEBUG_NET_SMALL_RTNL is disabled, we do not call
rtnl_net_lock() thus do not care about the array order, so
rtnl_net_cmp_locks() returns -1 so that the loop in rtnl_nets_add()
can be optimised to NOP.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Nikolay Aleksandrov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

rtnetlink: Remove __rtnl_link_register()

link_ops is protected by link_ops_mutex and no longer needs RTNL,
so we have no reason to have __rtnl_link_register() separately.

Let's remove it and call rtnl_link_register() from ifb.ko and
dummy.ko.

Note that both modules' init() work on init_net only, so we need
not export pernet_ops_rwsem and can use rtnl_net_lock() there.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

rtnetlink: Protect link_ops by mutex.

rtnl_link_unregister() holds RTNL and calls synchronize_srcu(),
but rtnl_newlink() will acquire SRCU frist and then RTNL.

Then, we need to unlink ops and call synchronize_srcu() outside
of RTNL to avoid the deadlock.

   rtnl_link_unregister()       rtnl_newlink()
   ----                         ----
   lock(rtnl_mutex);
                                lock(&ops->srcu);
                                lock(rtnl_mutex);
   sync(&ops->srcu);

Let's move as such and add a mutex to protect link_ops.

Now, link_ops is protected by its dedicated mutex and
rtnl_link_register() no longer needs to hold RTNL.

While at it, we move the initialisation of ops->dellink and
ops->srcu out of the mutex scope.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>